I’ll start writing a series of articles concerning my GSoC project “Drizzle Query Cache plugin”. At the current stage me and Siddharth and our respective mentors Toru Maesaka, Padraig O’Sullivan and the help of Jay Pipes, will all try to come up with the design ideas for the plug-in. As agreed I’ll take care of the caching part and Siddharth will manage the invalidation/replication, our common objective is to reduce the number of invalidation.
The basic capabilities of the Query Cache Plugin are:
- Cache Select queries result sets into Memcached. (Not insert, update ,delete, embedded select, non-cacheable select)
- Return the result set from the cache if an exact similar query (to one cached) is executed.
- Invalidate the cache entry if a table referenced by the query is altered. (with some enhancements)
What is still is to be discussed (but may be at a latter stage):
- Is it possible to handle complex queries: multiple joins, nested queries etc. For sure we can cache them, but can we apply our smart invalidation on them?
I have to mention that MySQL has its own implementation of query cache, and it’s not a bad idea to be “inspired” by some of its concepts, and leave or at least be aware of the unsuccessful ones. It’s also important to mention that MySQL handles the cache in memory only, using Memcached will remove the segmentation problem but also mean communicate over the network if one decides to scale out.
The overall idea of our Query Cache would require the following:
- A query cache mode: MySQL has a deactivated, permanent or an on-demand mode (with SELECT SQL_CACHE directive). I would rather go for just an on demand one, but might be cumbersome for a developer.
- Metadata: And here is the most critical part, in fact to be able to act smart during the invalidation process the plugin have to have some knowledge about the content of the cached queries. Here is some points stating my ideas concerning the importance of each item:
Tables and their aliases: This information is mandatory if a change is made to a table then all cached queries referring to it should be checked for invalidation. A first step would be to work on that.
Selection fields: the selection field can be an important factor later on. imagine the following scenario
cached query: “select name from employee”
update query “update employee set salary=salary+100″.
Here modifying the salary has no impact on the name and should therefore not invalidate our cache!
Condition fields: Our approach is to identify overlapping range selections, and decide if it is necessary to invalidate a cached query based on that
Cache Query: “Select * from employee e where e.salary>10000″
Update Query: “Update employee e set e.salary=e.salary*rate where e.salary”
Obviously the range of both queries is not overlapping but one must be careful, if rate=4 then the cached query will be deprecated and thus must be invalidated! The range query method can be used with Delete statements but for simple queries. Finally I hope to come up with a sufficient technique to make use of the condition fields adequately!
Now that the basic functions and ideas has been exposed, we need the think about how to store and process these information. My initial proposition was to add a qcache table with following fields:
+Query+ list of tables in the query+ list of selection field+ list of condition fields and their ranges+
With this method the query cache adds the necessary information to that table whenever some query is added to memcached, whereas the Checker will simply look at that table and know exactly what is in the cache and decide how to behave (invalide or not). It turns out that using one table is not realistic, because the list of tables can grow bigger and will involve long parsing, for example to decide if the table employee is used by some cache entry we need to sweep through all the qcache table and parse the table list.
I think that we need a more elegant (complex) design with either a relational schema for the query cache system or add a “cached” column to TABLES, COLUMNS ..ect in the information schema.
Your comments/ideas are welcome!
Memcached is a key value object cache used by many websites to relieve the load on their databases and provide faster answer to the client. it’s very interesting since it needs only commodity PCs and can scale indefinitely. The keys distribution is solely made by your application, a hash function is usually used, thus the different instances share nothing and have no knowledge about each other!
I have installed memcached on my Ubuntu using the ppa, you can right away start your instance, in fact you can start multiple instances of memcached on different ports on your machine. With the following commands I started two instance on 11211 and 11311:
ded@ubuntu:~$ memcached -p 11211 -d
ded@ubuntu:~$ memcached -p 11311 -d
-p is used to specify the listening port, 11211 is the one by default, -d tells Memcached to start as a daemon.
You can test your running memcached instance with Telnet, the following example connect to the instance set a key “key1″to “hello world” then get that key.
Lets connect to the first instance
ded@ubuntu:~$ telnet localhost 11211
Connected to localhost.
Escape character is ‘^]’.
Now we want to store the word “hello” with the key “key1″, the first digit is the flag, the second is the expiry time (0 means unlimited), last is the number of bytes expected (5 for the word hello) if you make a mistake on that you will receive a bad data chunk error message!
set key1 0 0 5
Lets now retrieve that value :
VALUE key1 0 5
we close the connection using :
Connection closed by foreign host.
Next I used libmemcached, a c++ library, to access and use memcached instances. Here I relied on MyCache, a class that Padraig O’Sillivan wrote on his blog. Here is a simple main that make use of the that class to cache a string:
std::string text=”Hello world”; // our string to cache
std::vector raw(text.size()); // Cache objects need to in vector format
MyCache::singleton().set(“key1″,raw); // cache the vector with the key “key1″
std::vector data=MyCache::singleton().get(“key1″); // retrieve the object in a vector
To compile a program that uses libmemcache don’t forget the -lmemcached compiler flag !
g++ -lmemcached -o test test.cc
Note: It’s very easy to write your own MyCache Class it’s just a wrapper of memcache::Memcache, but if you use Padraig’s one you need to specify in the class MyCache the number of instances that you have in num_of_clients. and modify the GetCache() method that randomly select a client, with a deterministic one, since you’ll need to retrieve your cached object exactly in the instance where you put it!
I will write a MyCache version that will take care of that shortly .. stay tuned
Figured I’ll put down some of my new insight about memcpy dealing with strings, char and vectors in C++.
First initializing a char arrays, it’s important to specify the size of the array !
char c=”hello”; // will auto init your array of chars given the length of the sentence
char *c= new char[string.size()];// will init the char array given a size
To transform a string into a char array:
// by using the c_str()
const char* c=new char[text.size()];
c=text.c_str(); //c_str() return a const *char, thus c must a const char*
// by using memcpy
Dealing with vector of char:
vector vec(text.size()); // need to specify the size of the that vector
Despite the fact the a vector can be dynamically expanded using push_back, when using memcpy we have to make sure its size corresponds to the source. Furthermore memcpy needs a void* pointer, that’s why we convert raw or string into char. *char is somehow equivalent to void*.
Google Summer of Code 2010 student list is out, and I am in!! ^_^
Yet another objective I wanted to realize long ago, because, after all, being accepted is just the beginning.
My project abstract is “Memcached query cache plug-in for Drizzle” and my mentor is Toru Maesaka.
This summer will be quit busy, between travels and research project! I’ll have late nights drizzling and dwelling at Drizzle’s IRC (nickname: ded) for my GSOC project ! youpala!
As a reference for future gsoc students, bellow is my initial project proposal excluding the schedule:
A Memcached Query Cache Plugin for Drizzle
Synopsis Caching is a key ingredient to scale web applications, a simple principle that avoids dbms access if the same query is being executed over and over, which reduces the computation and IO time. The goal of this project is to create a query cache plug-in for Drizzle, that will permit to scale out the memory by storing results of redundant queries in a cache repository like memcached, then return the cached results to clients if the same request is executed, thus without having to parse and execute the query again.
Benefits to the Drizzle Community
The ability to scale is not a luxury but a requirement for a database system backend intended for web and cloud applications. This project will set the standard API of Drizzle query cache, and will be carried out hand in hand with the replication initiative to minimize the number of invalidation incurred if the database is altered. A problem that has hinder the adoption of such feature but for which we propose a new approach.
Create a Plugin that works with any external cache based software (primarly focus on memcached)
Create the q_cache system view (hold locally metadata of the cache content)
Implement the sql cache syntax: SQL_CACHE (SELECT SQL_CACHE * from table1)
Allow recognizing uncacheable queries (optional)
Query cache caches the results of a select statement into some hash based repository like Memcached, such as when an exact similar query is executed the server simply sends back the entry‘s content.
Deciding whether there is a cache hit depends on the employed key. For obvious reasons, it must at least contain the original query. We can use md5() or MurMurHash to create a key from the query, or add extra information: server, user ect.
Some rules apply to decide if a query is cacheable:
- Must be explicitly specified by the user, using a hint (SQL_CACHE)
- The query result must be deterministic (rand(), sysdate() are not)
In order to maintain consistency of the cache content with a dynamic database, we have to implement a cache invalidation policy. Generally, a table that has been altered must check out any cache referring to it. Thus, the trivial behavior of invalidation reduces the benefits of such an implementation.
It is useful to notice that changing some rows content does not necessarily affect the cache content. We propose to enhance the invalidation behavior by filtering the entries concerned by an update from those that are not. We have to maintain locally some metadata information on the current cache content: key, tables, projections and selection fields. Then the idea is to create an algorithm that will decide that a DML statement update/delete/insert falls in the range of cache entry range by checking the metadata.
To be continued ..
I’ve got this frenzy manner to type “ls” at each command prompt! it’s sounds crazy but I can’t help it, It’s even weird when I am in front of a dos shell, still typing my ls command and of corse getting an error. Let’s face the truth, I just cannot type “dir” at the first shot, unless enormous concentration effort is involved. So I decided to look for a work arround.