Surprising mysql queries in Drupal

May 27, 2010 in Web Development

Warning: This blog post is unusually technical for most of the posts here.

Slow mysql query in Drupal

The hosting provider I work for in Iceland has been having some problems with one of the servers. Turns out the problems are caused by a single website running Drupal and it has a lot of traffic.

Currently, most websites do not have any issues even though they have quite a bit of traffic, so I wanted to analyze things.

My coworker set up something called mysql slow query log. It’s basically logs all sql queries that take a long time to parse. What I found was pretty interesting.

The recurring theme of the log file was this query:

SELECT DISTINCT n.nid, n.uid, n.title, n.type, e.event_start, e.event_start AS event_start_orig, e.event_end, e.event_end AS event_end_orig, e.timezone, e.has_time, e.has_end_date, tz.offset AS offset, tz.offset_dst AS offset_dst, tz.dst_region, tz.is_dst, e.event_start - INTERVAL IF(tz.is_dst, tz.offset_dst, tz.offset) HOUR_SECOND AS event_start_utc, e.event_end - INTERVAL IF(tz.is_dst, tz.offset_dst, tz.offset) HOUR_SECOND AS event_end_utc, e.event_start - INTERVAL IF(tz.is_dst, tz.offset_dst, tz.offset) HOUR_SECOND + INTERVAL 0 SECOND AS event_start_user, e.event_end - INTERVAL IF(tz.is_dst, tz.offset_dst, tz.offset) HOUR_SECOND + INTERVAL 0 SECOND AS event_end_user, e.event_start - INTERVAL IF(tz.is_dst, tz.offset_dst, tz.offset) HOUR_SECOND + INTERVAL 0 SECOND AS event_start_site, e.event_end - INTERVAL IF(tz.is_dst, tz.offset_dst, tz.offset) HOUR_SECOND + INTERVAL 0 SECOND AS event_end_site, as timezone_name FROM node n
INNER JOIN event e ON n.nid = e.nid
INNER JOIN event_timezones tz ON tz.timezone = e.timezone
INNER JOIN node_access na ON na.nid = n.nid
LEFT JOIN domain_access da ON n.nid = da.nid
LEFT JOIN node i18n ON n.tnid > 0 AND n.tnid = i18n.tnid AND i18n.language = 'en'
WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ((da.realm = "domain_id" AND da.gid = 4) OR (da.realm = "domain_site" AND da.gid = 0)) AND (n.language ='en' OR n.language ='' OR n.language IS NULL OR n.language = 'is' AND i18n.nid IS NULL) AND (  n.status = 1 AND ((e.event_start >= '2010-01-31 00:00:00' AND e.event_start <= '2010-03-01 23:59:59') OR (e.event_end >= '2010-01-31 00:00:00' AND e.event_end <= '2010-03-01 23:59:59') OR (e.event_start <= '2010-01-31 00:00:00' AND e.event_end >= '2010-03-01 23:59:59')) )
GROUP BY n.nid HAVING (event_start >= '2010-02-01 00:00:00' AND event_start <= '2010-02-28 23:59:59') OR (event_end >= '2010-02-01 00:00:00' AND event_end <= '2010-02-28 23:59:59') OR (event_start <= '2010-02-01 00:00:00' AND event_end >= '2010-02-28 23:59:59')
ORDER BY event_start ASC;

That’s one hell of a query. It’s from the built-in events system of Drupal which has a few plugins, such as internationalization and so fourth. The query usually took around 5-6 seconds to compute. You have to realize that we’re running a Dell PowerEdge machine with lots of processing power and memory.

I ran the mysql explain command on that query to find out what’s going on and this is what I found:

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE da index PRIMARY,nid PRIMARY 775 NULL 2146 Using where; Using index; Using temporary; Using f…
1 SIMPLE n eq_ref PRIMARY, node_status_type PRIMARY 4 dbname.da.nid 1 Using where
1 SIMPLE i18n ref tnid tnid 4 dbname.n.tnid 240 Using where
1 SIMPLE na eq_ref PRIMARY PRIMARY 775 dbname.n.nid,const,const 1 Using where
1 SIMPLE e eq_ref PRIMARY, event_start, event_end, timezone PRIMARY 4 1 Using where
1 SIMPLE tz eq_ref PRIMARY PRIMARY 4 dbname.e.timezone 1

My “EXPLAIN”-understanding is limited, but there is a thumb rule that I heard once that you can multiply the number in the “rows”-column together to find out how many rows are actually being read, so 2146 * 240 = 515,040 rows. That would certainly explain the slowness, up to a point. Another hint of something being slow is that the query is not using an index for the i18n table (internationalization plugin), which means it’s querying 240 rows all the time without a key.

Number of queries in Drupal

This is clearly a problem with the i18n plugin, so you can’t really blame Drupal for this. However, when I googled stuff like “drupal slow” or “drupal mysql queries”, I found this rediculous quote in the Drupal forums:

160 [mysql queries per request] doesn’t sound like very many to me. But there may be a few really bad ones. I would enable the devel module and see if there are any queries taking much longer than the others. Also the total time spent on queries is also an important stat.

Written by “dalin

This person is probably not one of the core devs of Drupal but this is certainly a hint at what kind of standards are set by Drupal in regards to performance.

Increasing performance

There’s a lot of information out there regarding Drupal performance, especially in their documentations, so if you’re running some sort of Drupal site they are definitely worth taking a look at.

Drupal has a caching module that is not enabled by default. It has two modes, basic caching and advanced caching, which is not available in this instance because of modules that don’t support it (namely the i18n module, as well as others), so I enabled basic caching.

I also increased our server’s mysql query cache size to 256 MB, which isn’t actually so much, but helped a lot.

Clearly, if this site’s user base is to grow any more, some more adjustments need to be done on the website, it’s not in a sustainable mode.

Number of MySQL queries

There aren’t any international standards or best practices for the number of queries per request on a website, so there’s not really anything I can complain about. However, 160 queries sounds like an awfully high number to me. I try not to use more than around 10-20, maybe 30 queries per request and I stay away from large joins like Drupal seems to use. Sometimes you need to sacrifice normalization for performance.

Any thoughts?

  • Pingback: Tweets that mention Drupal and slow MySQL queries – arnorhs - Web Development Blog --

  • CraniumSlows

    Wow. This has been really informative. Thank you for posting this and explaining things. I enjoyed it and will share it with others.

  • Cafuego

    I expect your query is having to create a temporary table and it’s *probably* doing that as a disk table, so you’re waiting for disk I/O. Ouch. You can try increasing the max tmp table size or making MySQL use a tmpfs mount as temp storage.

    Be careful setting a huge query cache; in this case a change to any of the involved tables means the query can’t be grabbed from cache and needs to re-run anyway. Generally you’re better off making sure large query results can’t be cached, so the cache fills up with a *lot* of tiny queries (like say node_load or user_load results)

    Also, which storage engine are you using and has it been properly configured?

    • arnorhs

      Thanks a lot for the tips, Cafuego. That’s an excellent article you have there.

      I did specify a “query_cache_limit”, so the queries that get cached are never larger than this limit. I have it set to the default value, which is 8192. The storage engine being used is MyISAM – Drupal specifies this itself.

      This isn’t really a website that I manage, but I found the load and performance intriguing since I haven’t had much experience with Drupal websites. I will probably not take any additional steps in managing this load.

      Your article on MySQL configuration is excellent. I will put those settings into use and see how well they do.

  • Cafuego

    Hmm… 8192 without a qualifier means 8K. That’s probably a bit smaller than than you really want. Mine is set to 128K. 8M (8192K) on the other hand would be too big imo.

    Actually, adding an index on the i18n table (tnid and language) is a very good idea and that should probably be submitted as bug report on that module.

  • Nicholas Moline

    I think it is worth nothing that at least this query isn’t drupal’s fault at all, only 2 of the tables mentioned in that query are in a stock drupal installation (node and node_access) and in the default installation and all but the most complex installations, node_access has only 1 record in it to indicate that nodes should be accessible as a general rule.

    Event, i18n, event_timezone, and Domain Access are all third party modules. event is particularly known for being a bit of a resource hog and having a lot of slow queries, and it has not been well maintained in a long time (the latest “stable” release is for Drupal 5 back in 2007, all Drupal 6 releases are considered unstable)

    i18n and domain access are particularly complex modules, i18n of course is used to allow multiple language translations of a site based on joining a table against a translation table which, as you pointed out, is not done with query optimization in mind, and domain access is about hosting multiple domains on the same installation and same database with logic to indicate which content should be available on which “site”, which is known to break a lot of caching.

    Now drupal is not perfect, some of the core modules, such as the user module, are known to have some queries that can be frightening once you get above a few hundred users in your database, but it’s hard to blame drupal for modules written by third parties.

  • Nathan

    I ran into something like this. The combination of modules we had enabled filled up node_access which did not have any indexes. Adding them turned 5 minute queries into sub second queries.

  • Joe

    The event module isn’t exactly the best, nor supported as stable in the current release of drupal. I would suggest using CCK + Views + Calendar modules, using the date field objects you should be able to generate smarter queries and cache more info, resulting in much faster queries. Furthermore, quite a few more Drupal modules have support for CCK than they do for events (also, CCK is built in in Drupal 7, so cleaner upgrade path).

    Hope this helps.

  • Rob Ristroph

    Drupal, the node_access, domain_access, and i18n modules will all cause an additional INNER JOIN to be added to certain queries. This is not easy to change without re-writing a lot of code, so you should focus on other things first; however, whoever made the site may have enabled those modules when they didn’t really need them.

    From the other tables in the query, you seem to be using the Drupal event module. I presume the query is for a view of some type of upcoming events, or something like that.

    This type of query is very cacheable. I would advise using advanced caching inspite of the warning about problems with i18n – in my experience you probably won’t have problems.

    If the site in question, like most sites, receives mostly anonymous traffic, you can put a reverse proxy cache such as varnish in front of it. The easiest way to get a Drupal / Varnish setup is to use the Pantheon / Project Mercury stack from Chapter Three.

    As a hosting provider, you may want to look more into Varnish. It is incredibly fast; as long as you configure it so that people who are logged in (look for the session cookie) punch through it to the back end, setting up a varnish will take a huge amount of load off of your servers. A single varnish box can front many web servers.

  • Stef

    160 is not a lot, for a Drupal site. I’ve worked on Drupal sites that had over a 1000 queries on the homepage. No kidding.

    I find it a slow shitty-ass CMS but it is powerful.

  • Miguel Pérez

    The primary keys for tables da and na are huge. The data for the first/main table to read will be read just from indexes, not actual row data (though not a big improvement if it’s an InnoDB table). The biggest problem here is not the number of rows to evaluate, but the fact it’ll have to use a temporary table (MEMORY if it fits within a configurable limit; MyISAM if it doesn’t – big performance impact) due to GROUP BY, then it’ll have to sort rows (in memory if it fits, on disk if it doesn’t – big performance impact again).

    I haven’t bothered to analyze the query and tables itself, i.e. whether what it does makes any sense and is the efficient way to do it, though they’d want to move some processing to the PHP side. I’d reduce superfluous fields and do time zone a different way. Contrary to popular belief, when working in a PHP+MySQL architecture, you should move as much processing off the database as possible, as PHP is ridiculously easy to scale when necessary, while MySQL is not. In fact all relational databases scale moderately poorly (and MySQL isn’t the best in the group), while mostly stateless servers will always scale very easily.

  • Raja

    Add an index on `domain_access` and `node` table — maybe use memcache for that site.

    Email me if you want to discuss in detail

  • Colin Dean

    I have not used Drupal, so I’m not sure of its internals. However, it sounds like this data assembled by this query isn’t very dynamic. Meaning, it doesn’t really need to be executed per page load.

    Does Drupal have a memcached plugin? If not, perhaps you could modify where this query is executed and do a poor man’s cache to disk and only run the query every hour or so.

    I know this only dodges the problem at hand, but it’s another workaround.

  • Matt

    I usually don’t mind Drupal’s large number of queries because the vast majority are only run when it’s about to cache a page. The flexibility of the plugin system breeds many queries, and I think that’s a valuable trade-off given effective caching.

  • Liehann

    I have no idea about drupal or the tables in your example, but I find SQL optimization problems interesting and I would be suprised if there aren’t a few small changes you can make to the query or schema (adding indexes) that would give you a great performance increase.

    I would be interested in some more info on what the tables look like – mostly the domain_access table and also the node_access table. Questions that come to mind are:
    – how many rows are in the tables
    – how many rows are expected to be returned from the query and from the table specific sub-clauses
    – how are the tables indexed

    For example the where clause seems to be four and clauses and the joins are all simple. You would want MySQL to start with the table where the least number of rows need to be evaluated as the rest of the joins should be simple indexed joins.

    It seems suprising that so many rows are evaluated for the domain_access table – how many rows are returned if you execute:
    select count(*) from domain_access da where (da.realm = “domain_id” AND da.gid = 4) OR (da.realm = “domain_site” AND da.gid = 0)
    select count(*) from node_access na where na.grant_view >= 1 AND na.gid = 0 AND na.realm = ‘all’

    What do the indexes on those tables look like:
    describe domain_access
    describe node_access

    First things to check – is there a compound key on domain_access over realm and gid. Second – do both tables have indexes on nid. Adding a compound key should be something like (syntax could be a little off, going from memory). Depending on your uptime requirements run the query on a copy of your DB first to check run time and impact. If the table is sufficiently large and the DB is under load adding an index could run for long enough to have end user impact.
    alter table domain_access add index da_realm_gid(realm, gid)

    Beyond that, assuming the query can’t be changed (I’m guessing it’s auto-generated) I would muck about with other indexes on the tables in question.

    Good luck. Post some results if you tinker.

  • Ilan Hazan

    If your hosting account don’t have access to the my.cnf file, you can log the Drupal site queries from the Drupal code itself:
    See How: