After several months of development, I am happy to announce the official release of OLTP-Bench, an extensible “batteries included” DBMS benchmarking testbed ! This project is ought to be an aggregator of popular and research oriented OLTP benchmarks. It provides a portable framework for workload generation and an API for integrating benchmark queries. Besides it uses JDBC API which allows you to connect to any DBMS systems with a proper driver.
OLTP-Bench has modular architecture for hooking new benchmarks, hopefully I’ll write a detailed how-to guide but for now you can already have a look at the implemented benchmarks to have an idea on how to write your queries and use the workload generator. We ported several popular and interesting benchmarks with varying complexity and domain application, it includes: TPCC-like, TATP, SEATS, AuctionMark, YCSB, Wikipedia, Twitter, JPAB, Epinions and Resource Stresser. More information on each benchmark is available here.
The workload generator is driven by an XML configuration file; users have to define phases of execution composed of a target rate (expressed in transactions per seconds), the duration to apply the rate and also the weight of each procedure (or query) of the benchmark. By combining phases one can simulate very complex situations to stress and test the database system. Doing so we have conducting hundreds of experiments on different systems and configuration more details are available here.
Hopefully this will get the database community excited as our goal was not to write “Yet another benchmark” but rather engage everyone to share their configuration and results.
Memcached is a key value object cache used by many websites to relieve the load on their databases and provide faster answer to the client. it’s very interesting since it needs only commodity PCs and can scale indefinitely. The keys distribution is solely made by your application, a hash function is usually used, thus the different instances share nothing and have no knowledge about each other!
I have installed memcached on my Ubuntu using the ppa, you can right away start your instance, in fact you can start multiple instances of memcached on different ports on your machine. With the following commands I started two instance on 11211 and 11311:
ded@ubuntu:~$ memcached -p 11211 -d
ded@ubuntu:~$ memcached -p 11311 -d
-p is used to specify the listening port, 11211 is the one by default, -d tells Memcached to start as a daemon.
You can test your running memcached instance with Telnet, the following example connect to the instance set a key “key1″to “hello world” then get that key.
Lets connect to the first instance
ded@ubuntu:~$ telnet localhost 11211
Connected to localhost.
Escape character is ‘^]’.
Now we want to store the word “hello” with the key “key1″, the first digit is the flag, the second is the expiry time (0 means unlimited), last is the number of bytes expected (5 for the word hello) if you make a mistake on that you will receive a bad data chunk error message!
set key1 0 0 5
Lets now retrieve that value :
VALUE key1 0 5
we close the connection using :
Connection closed by foreign host.
Next I used libmemcached, a c++ library, to access and use memcached instances. Here I relied on MyCache, a class that Padraig O’Sillivan wrote on his blog. Here is a simple main that make use of the that class to cache a string:
std::string text=”Hello world”; // our string to cache
std::vector raw(text.size()); // Cache objects need to in vector format
MyCache::singleton().set(“key1″,raw); // cache the vector with the key “key1″
std::vector data=MyCache::singleton().get(“key1″); // retrieve the object in a vector
To compile a program that uses libmemcache don’t forget the -lmemcached compiler flag !
g++ -lmemcached -o test test.cc
Note: It’s very easy to write your own MyCache Class it’s just a wrapper of memcache::Memcache, but if you use Padraig’s one you need to specify in the class MyCache the number of instances that you have in num_of_clients. and modify the GetCache() method that randomly select a client, with a deterministic one, since you’ll need to retrieve your cached object exactly in the instance where you put it!
I will write a MyCache version that will take care of that shortly .. stay tuned
Figured I’ll put down some of my new insight about memcpy dealing with strings, char and vectors in C++.
First initializing a char arrays, it’s important to specify the size of the array !
char c=”hello”; // will auto init your array of chars given the length of the sentence
char *c= new char[string.size()];// will init the char array given a size
To transform a string into a char array:
// by using the c_str()
const char* c=new char[text.size()];
c=text.c_str(); //c_str() return a const *char, thus c must a const char*
// by using memcpy
Dealing with vector of char:
vector vec(text.size()); // need to specify the size of the that vector
Despite the fact the a vector can be dynamically expanded using push_back, when using memcpy we have to make sure its size corresponds to the source. Furthermore memcpy needs a void* pointer, that’s why we convert raw or string into char. *char is somehow equivalent to void*.
Google Summer of Code 2010 student list is out, and I am in!! ^_^
Yet another objective I wanted to realize long ago, because, after all, being accepted is just the beginning.
My project abstract is “Memcached query cache plug-in for Drizzle” and my mentor is Toru Maesaka.
This summer will be quit busy, between travels and research project! I’ll have late nights drizzling and dwelling at Drizzle’s IRC (nickname: ded) for my GSOC project ! youpala!
As a reference for future gsoc students, bellow is my initial project proposal excluding the schedule:
A Memcached Query Cache Plugin for Drizzle
Synopsis Caching is a key ingredient to scale web applications, a simple principle that avoids dbms access if the same query is being executed over and over, which reduces the computation and IO time. The goal of this project is to create a query cache plug-in for Drizzle, that will permit to scale out the memory by storing results of redundant queries in a cache repository like memcached, then return the cached results to clients if the same request is executed, thus without having to parse and execute the query again.
Benefits to the Drizzle Community
The ability to scale is not a luxury but a requirement for a database system backend intended for web and cloud applications. This project will set the standard API of Drizzle query cache, and will be carried out hand in hand with the replication initiative to minimize the number of invalidation incurred if the database is altered. A problem that has hinder the adoption of such feature but for which we propose a new approach.
Create a Plugin that works with any external cache based software (primarly focus on memcached)
Create the q_cache system view (hold locally metadata of the cache content)
Implement the sql cache syntax: SQL_CACHE (SELECT SQL_CACHE * from table1)
Allow recognizing uncacheable queries (optional)
Query cache caches the results of a select statement into some hash based repository like Memcached, such as when an exact similar query is executed the server simply sends back the entry‘s content.
Deciding whether there is a cache hit depends on the employed key. For obvious reasons, it must at least contain the original query. We can use md5() or MurMurHash to create a key from the query, or add extra information: server, user ect.
Some rules apply to decide if a query is cacheable:
- Must be explicitly specified by the user, using a hint (SQL_CACHE)
- The query result must be deterministic (rand(), sysdate() are not)
In order to maintain consistency of the cache content with a dynamic database, we have to implement a cache invalidation policy. Generally, a table that has been altered must check out any cache referring to it. Thus, the trivial behavior of invalidation reduces the benefits of such an implementation.
It is useful to notice that changing some rows content does not necessarily affect the cache content. We propose to enhance the invalidation behavior by filtering the entries concerned by an update from those that are not. We have to maintain locally some metadata information on the current cache content: key, tables, projections and selection fields. Then the idea is to create an algorithm that will decide that a DML statement update/delete/insert falls in the range of cache entry range by checking the metadata.
To be continued ..