Archive
I just finalized a first version of the …
I just finalized a first version of the query cache plugin for Drizzle. The good news is that the version is functional (YES you can try it out) but I guess it’s just a starting point for finding bugs and adding the new stuff in , in short … make it really functional.
In a previous article I have exposed the plugin query cache interface and hooks in the Drizzle kernel. Few modifications were made to that part and I will talk about that and all the technical details in a next article, for now I am just going to give a “how-to” to get you started using the plugin.
First we need to get the branch where the latest code is pushed from my launchpad directory with the command:
$ bzr branch lp:~dedzone/drizzle/query-cache-gsoc
The plugin has two dependencies, Memcached and libmemcached, so make sure have these two installed before going further otherwise the compilation would fail. Then compile the branch and specify the query cache plugin (it’s not by default):
$ ./config/autorun.sh
$ ./configure --with-memcached-query-cache-plugin
$ ./make -j2 && make install
If you wish you can run the test plugin’s test suite (try to have a memcached instance runing on localhost:11211, if the test poses problems)
$./tests/dtr --suite=memcached_query_cache
After that you would have an operational Drizzle server, and since we are going to use the memcached query cache plugin, we will also need a runing instance of memcached (it can be local or remote). Note that the option –add-plugin would cause our plugin to get loaded, again it’s not loaded by default, and don’t forget to create a directory for your data exp: /home/user/db
$ memcached -l 127.0.0.1 -p 11211 -d
$ drizzled --datadir=/home/user/db/ --add-plugin=memcached_query_cache
Now that we are all set, lets try out the memcached query cache plugin by starting a new client (drizzle c++ client or boots) and connect to drizzle. At this point the plugin is loaded and pointing to a default memcached server localhost:11211, if you have a different configuration you can change it dinamically with
drizzle> set global query_cache_servers="localhost:11311 remotehost:11211"
You can also set the expiration time of a memcached entry (default is 0, never), that is a global parameter that everybody will be bound to, but can also be changed dynamically.
drizzle> set global query_cache_expiry= 1000;
So, where is the fun part? I have launched a discussion on Drizzle mailing list to decide on the syntax to cache or not a query. My very first orientation was to extend the sql syntax to and SQL_CACHE and SQL_NO_CACHE hints, but (thanks to all the contributor) switched to a server command that will switch the on or off the cache on demand and per Session.
drizzle>set query_cache_enable= on;
Now your next Select statement will be cached and if you execute the exact same command (including spacing) the results you will get will come out from memcached \o/. Well, not totally true, a Select statement to be cached has to be cacheable, i.e. time functions, random, user, database .. are not deterministic, thus must not be cached. Also user defined functions are not supported (we just don’t know what you’ve done out there), and queries referring to Data_dictionary schema, that will make the server stuck in a time point … not good
But, How would you know if your query is really cached! there is in fact a local cache structure that stores the current cached queries that you can freely query
drizzle> select * from emp;
+----------+------+
| name | age |
+----------+------+
| ded | 28 |
| ded2 | 27 |
| vjsamuel | 19 |
+----------+------+
3 rows in set (0 sec)
drizzle> select * from emp where name like "ded";
+------+------+
| name | age |
+------+------+
| ded | 28 |
+------+------+
1 row in set (0 sec)
drizzle> select * from emp where age < 25;
+----------+------+
| name | age |
+----------+------+
| vjsamuel | 19 |
+----------+------+
1 row in set (0 sec)
drizzle> select sysdate() from emp where age < 25;
+---------------------+
| sysdate() |
+---------------------+
| 2010-08-16 03:19:52 |
+---------------------+
1 row in set (0 sec)
drizzle> select * from data_dictionary.query_cache_entries;
+----------------------------------+--------+-----------------------------------------+
| key | schema | sql |
+----------------------------------+--------+-----------------------------------------+
| 376555ba10dae3d09ca6df52a0839be6 | cool | select * from emp where name like "ded" |
| b1a551e4aa3f717798380e41dcee0960 | cool | select * from emp |
| e2e2402c2b29ac0c2fad1769a31d2d0a | cool | select * from emp where age < 25 |
+----------------------------------+--------+-----------------------------------------+
3 rows in set (0.01 sec)
Here we issued 5 select statement and the 3 of them is in the cache except the ones calling sysdate(), or the data_dictionary. The hash key that you see is an md5 of your query, it’s the sole identification of your query cache in memcached and locally. One can play and look at the local cache content (It is a lovely protobuff message) with the following function:
drizzle> select print_query_cache_meta("e2e2402c2b29ac0c2fad1769a31d2d0a");
key: "e2e2402c2b29ac0c2fad1769a31d2d0a"
schema: "cool"
sql: "select * from emp where age < 25"
select_header {
table_meta {
schema_name: "cool"
table_name: "emp"
}
field_meta {
field_name: "name"
field_alias: "name"
table_name: "emp"
table_alias: "emp"
schema_name: "cool"
}
field_meta {
field_name: "age"
field_alias: "age"
table_name: "emp"
table_alias: "emp"
schema_name: "cool"
}
}
That’s nice, but what happen if the data cached gets tempered with (DML, DDL) ! don’t worry, the invalidation will take that in charge, in fact the plugin will keep a track of all the tables referenced in a query and will build a table similar to query_cache_entries, but will contain the reverse information, thus the table and all the entries of the cache making reference to it, that will avoid the burden of iterating overall the cache.
drizzle> select * from data_dictionary.query_cached_tables;
+---------+--------------------------------------------------------------------------------------------------------+
| Table | Cache_Keys |
+---------+--------------------------------------------------------------------------------------------------------+
| coolemp | ::b1a551e4aa3f717798380e41dcee0960::376555ba10dae3d09ca6df52a0839be6::e2e2402c2b29ac0c2fad1769a31d2d0a |
+---------+--------------------------------------------------------------------------------------------------------+
1 row in set (0 sec)
Next, if one will change (or drop) the content of the table emp in the schema cool the replication system will cause the invalidation of all the cache entries of the cache:
drizzle> delete from emp where name like "ded1";
Query OK, 1 row affected (0 sec)
drizzle> select * from data_dictionary.query_cached_tables;
Empty set (0 sec)
drizzle> select * from data_dictionary.query_cache_entries;
Empty set (0 sec)
Final capability is the query_cache_flush() function, it will basically wipe out the whole content the local cache and the content of the memcached instances registred (make sure your instances are not used for another purpose)
drizzle> select query_cache_flush();
+---------------------+
| query_cache_flush() |
+---------------------+
| 1 |
+---------------------+
1 row in set (0 sec)
drizzle> select * from data_dictionary.query_cached_tables;
Empty set (0 sec)
drizzle> select * from data_dictionary.query_cache_entries;
Empty set (0 sec)
Conclusion:
That was a small kinda manual to get you started with the query cache plugin, I hope to extend the functionalities to support fine grained invalidation in the near future. But for now there is still many work to be done to get rid of all the withstanding problems.
Drizzle QueryCache plugin [Plugin Interface]
It has been a while since I didn’t wrote on my GS0C project “Query Cache”, though many advancements have been done. So far I can say that I have a much clearer idea of the query processing flow and the different data structures used. I have been able to write a first plugin that can actually cache a query into memcached and keep some metadata on a local map object. Another cool option is to use data dictionary views to check int real time the cache content (thanks to the very easy interface). The next step from here is to write the retrieval functionality, this is probably the most difficult part for me cuz I still need to understand the functioning of the client interface.
Before making any further step, here is wiki I have posted on the of drizzle’s wiki page. It is basically an explanation of the query cache plugin interface and the hooks, this has yet to be approved and merged into drizzle trunk.
Principle of the Query Cache
The Query Cache permits to return the results of a query from a cache repository if the query has been already cached; or, to add its results to a cache repository if it’s not cached yet. Therefore the simplest capability that a Query Cache plugin interface needs to provide is the ability to say if the received query is cached or not.
Then:
If the query is cached:
- Retrieve the resultset from the cache and skip the query parsing/processing
If the query is not cached:
- Initiate the Query Cache plugin, where the specific plugin will:
- Check if the query is cacheable
- Reserve a temp area and return a pointer to the session
- Build the hash key
- etc
- Add the results to the current temp area.
- Push the the temp variable and the respective cache key to the cache.
The Query Cache plugin Interface design
In the current design we have introduced the following functions to the interface (drizzled/plugin/query_cache.h)
static bool isCached(Session *session);
This method is responsible to probe the used cache system for an entry corresponding to the received query.
”Note: The plugin is responsible to check if the query is a Select.”
Static bool sendCachedResultset(Session *session);
This method is called directly if the query is cached and skip the lexical and query processing.
static bool prepareResultset(Session *session);
This method is called before the query handling (see: handle_select() method) in order to prepare the resultset that will receive the generated data. An opened session will have its own resultset variable and current query key, the plugin will have to initiate these two variables added to the Class Session.
static bool insertRecord(Session *session, List<Item> &item);
The result rows generated will ultimately go though the client interface, the query cache system intercept these calls and populate the resultset with the data being send to the client (see: select_send() method)
static bool setResultset(Session *session);
Once the query processing is terminated this method will send the resultset variable to the cache repository and finalize the environment variables.
The query resultset being stored and populated in the session is of type Google Protobuf:
drizzled/message/resultset.proto
Plugin Hooks
The following branch contains the plugin interface hooks in the kernel drizzle:
[lp:~dedzone/drizzle/query-cache-hook/changes]
The query interception
Upon reception of a query, (sql_parse.cc:717) in the mysql_parse method, and before even proceeding to the lexical and query processing, a call to ”isCached(session)” is made, a session pointer is passed for the query text, schema, and may be other parameters (for future use).
If isCached returns ”’true”’ then we’ll try to retrieve the results from the cache with ”sendCachedResultset”. If the retrieving operation succeeds (returns false) then we skip the remaining steps.
In the contrary – ie:
- If fetching the results fails or
- If the query is not cached
Then => proceed normally with the query parsing and processing
void mysql_parse(Session *session, const char *inBuf, uint32_t length)
{
lex_start(session);
session->reset_for_next_command();
/* Check if the Query is Cached if and return true if yes
* TODO the plugin has to make sure that the query is cacheble
* by setting the query_safe_cache param to TRUE
*/
bool res= true;
if (plugin::QueryCache::isCached(session))
{
res= plugin::QueryCache::sendCachedResultset(session);
}
if (not res)
{
#if defined(DEBUG)
errmsg_printf(ERRMSG_LVL_DBUG,_("Results retrieved from cache"));
#endif /* DEBUG */
return;
}
LEX *lex= session->lex;
Lex_input_stream lip(session, inBuf, length);
bool err= parse_sql(session, &lip);
........
........
lex->unit.cleanup();
session->set_proc_info("freeing items");
session->end_statement();
session->cleanup_after_query();
}
The Cache preparation and reception of the Resulset
In our journey executing a query, if the query is a Select will land in the (sql_parse.cc:494) execute_sqlcom_select method, this is basically the entry point of the query processing because it prepares to launch the handle_select method. before making this call will insert our prepareResultset hook, which will instruct the plugin implementation to:
- Prepare a cache to receive the resultset, and link it to the session::resultset pointer.
- create a hash key for our query, and link it to the session::query_cache_key member.
At this point the Session will be ‘aware’ that there is a plugin trying to cache the results of the query.
Immediatly after executing the handle_select, where all the resultset were generated, send to the client and added to the current resultset (As we shall see in the next session), we make a setResultset call, this will:
- Push the resultset to the cache area initiated by the plugin
- Reset the session members, resultset and query_cache_key.
bool execute_sqlcom_select(Session *session, TableList *all_tables)
{
LEX *lex= session->lex;
select_result *result=lex->result;
bool res= false;
.....
.....
if (!result && !(result= new select_send()))
return true;
/* Init the Query Cache plugin */
plugin::QueryCache::prepareResultset(session);
res= handle_select(session, lex, result, 0);
/* Send the Resultset to the cache */
plugin::QueryCache::setResultset(session);
if (result != lex->result)
delete result;
}
}
return res;
}
Populating the resultset
The idea behind adding data to the resultset is to intercept the send_data() method in (select_send.h:87) When sending a row to the client interface. So the hook will just pass the row (items) to the query cache plugin using insertResultset(session, items).
/* Send data to client. Returns 0 if ok */
bool send_data(List<Item> &items)
{
.......
.......
while ((item=li++))
{
if (item->send(session->client, &buffer))
{
my_message(ER_OUT_OF_RESOURCES, ER(ER_OUT_OF_RESOURCES), MYF(0));
break;
}
}
/* Insert this record to the Resultset into the cache */
if (session->query_cache_key != "" && session->getResultsetMessage() != NULL)
plugin::QueryCache::insertRecord(session, items);
session->sent_row_count++;
......
}
Drizzle QueryCache plugin [Design notes]
I’ll start writing a series of articles concerning my GSoC project “Drizzle Query Cache plugin”. At the current stage me and Siddharth and our respective mentors Toru Maesaka, Padraig O’Sullivan and the help of Jay Pipes, will all try to come up with the design ideas for the plug-in. As agreed I’ll take care of the caching part and Siddharth will manage the invalidation/replication, our common objective is to reduce the number of invalidation.
The basic capabilities of the Query Cache Plugin are:
- Cache Select queries result sets into Memcached. (Not insert, update ,delete, embedded select, non-cacheable select)
- Return the result set from the cache if an exact similar query (to one cached) is executed.
- Invalidate the cache entry if a table referenced by the query is altered. (with some enhancements)
What is still is to be discussed (but may be at a latter stage):
- Is it possible to handle complex queries: multiple joins, nested queries etc. For sure we can cache them, but can we apply our smart invalidation on them?
I have to mention that MySQL has its own implementation of query cache, and it’s not a bad idea to be “inspired” by some of its concepts, and leave or at least be aware of the unsuccessful ones. It’s also important to mention that MySQL handles the cache in memory only, using Memcached will remove the segmentation problem but also mean communicate over the network if one decides to scale out.
The overall idea of our Query Cache would require the following:
- A query cache mode: MySQL has a deactivated, permanent or an on-demand mode (with SELECT SQL_CACHE directive). I would rather go for just an on demand one, but might be cumbersome for a developer.
- Metadata: And here is the most critical part, in fact to be able to act smart during the invalidation process the plugin have to have some knowledge about the content of the cached queries. Here is some points stating my ideas concerning the importance of each item:
Tables and their aliases: This information is mandatory if a change is made to a table then all cached queries referring to it should be checked for invalidation. A first step would be to work on that.
Selection fields: the selection field can be an important factor later on. imagine the following scenario
cached query: “select name from employee”
update query “update employee set salary=salary+100″.
Here modifying the salary has no impact on the name and should therefore not invalidate our cache!
Condition fields: Our approach is to identify overlapping range selections, and decide if it is necessary to invalidate a cached query based on that
Example:
Cache Query: “Select * from employee e where e.salary>10000″
Update Query: “Update employee e set e.salary=e.salary*rate where e.salary”
Obviously the range of both queries is not overlapping but one must be careful, if rate=4 then the cached query will be deprecated and thus must be invalidated! The range query method can be used with Delete statements but for simple queries. Finally I hope to come up with a sufficient technique to make use of the condition fields adequately!
Now that the basic functions and ideas has been exposed, we need the think about how to store and process these information. My initial proposition was to add a qcache table with following fields:
+Query+ list of tables in the query+ list of selection field+ list of condition fields and their ranges+
With this method the query cache adds the necessary information to that table whenever some query is added to memcached, whereas the Checker will simply look at that table and know exactly what is in the cache and decide how to behave (invalide or not). It turns out that using one table is not realistic, because the list of tables can grow bigger and will involve long parsing, for example to decide if the table employee is used by some cache entry we need to sweep through all the qcache table and parse the table list.
I think that we need a more elegant (complex) design with either a relational schema for the query cache system or add a “cached” column to TABLES, COLUMNS ..ect in the information schema.
Your comments/ideas are welcome!