Capillary’s cloud platform has a Service Oriented Architecture and MySQL is our primary data store. Each service is dependent on one or multiple instances of MySQL and the latency of each service is heavily dependent on the performance of the SQL queries. On average 200 million SQL queries are fired each day by the core services of our platform. Hence its crucial to monitor the performance of the queries to identify badly written ones, creating new indexes in the schema and add more caching at the application layer.

To this end, we collect all the SQL queries from the application logs and MySQL slow query logs through a log collection framework (which is outside the scope of this post) and dumped into an 8 node HDFS cluster. Along with the queries we also collect other metrics like query times, lock wait times, rows scanned etc. Once the data is pushed into HDFS, we use Hive to analyze the queries to obtain various metrics such as frequency of the queries, percentile distribution of the query latency etc. Hive provides many aggregate functions (called UDAF’s) like Sum, Average,  Percentile that can be used to extract these metrics. But these functions can be effective only if the strings, SQL queries in our case, are an exact match else the “GROUP BY” clause won’t be able to aggregate the strings. For example, consider the two SQL queries given below:

The two queries differ only by the values/constants and both the queries should be considered as same when applying the aggregation functions. Hence, we need a mechanism to normalize all the queries to identify the similar ones and generate accurate performance metrics.

We have written a Hive UDF, called “query_hash”, that takes a SQL query string and computes a hash that is consistent for all queries that are logically similar and differ only on the input values. The hash is computed for each query with the UDF before feeding it into HDFS and stored along with other parameters like query time, rows scanned etc. The UDAFs are used with the GROUP BY clause applied on the query hash, instead of the query string, to generate aggregate level statistics. The rest of the post explains the implementation details of the UDF.


The UDF computes the query hash in two steps:

1) Syntax Tree Evaluation:

We employ a SQL parser to build a syntax tree of the query. The tree is traversed recursively to find and replace all the values/constants in the query with a placeholder character. The resultant string is a constant free query string. For the two queries given above, using ‘?’ as the placeholder, the resulting string would be:

The parser used to build the syntax tree is based on the grammar of ANSI-SQL and thus fails to evaluate several queries that are valid in MySQL but do not comply with ANSI-SQL. For instance  functions like “DATE_ADD(‘2008-01-02’, INTERVAL 31 DAY)” are not part of ANSI-SQL but provided by MySQL. Similarly, ANSI-SQL mandates the presence of an aggregate function in the query when “GROUP BY” clause is used but its possible to write queries in MySQL without honouring this restriction. The Positional Index Hash approach helps to overcome this limitation.

2) Positional Index Hashing:

All the queries for which the syntax tree cannot be created, we use the positional index hashing approach. We maintain a list of several keywords each assigned with a unique integer value.  These keywords are:

  1. Mysql reserved words
  2. Column names obtained by crawling all the database tables on a regular basis.

The queries are scanned to find and index the keywords with their associated integer values. The keywords occur at the same positions in all the similar queries, hence the hash of the above obtained positional index-values is taken.


As mentioned above, 200 million queries are analyzed every day to generate various reports. A sample Hive query using query_hash to generate a report is shown below :

The queries here are stored in a hive table named loghive with one partition for each day’s query dump. This reports shows the most frequently executed queries by different services and the percentile distribution of their execution times.

Host Module Sample Query Query template Count 95 percentile 90 percentile
appserver import service SELECT, cfd.value FROM c_fields_data cfd JOIN c_fields cf ON = cfd.cf_id WHERE cfd.org_id = 655 SELECT, cfd.value FROM c_fields_data cfd JOIN c_fields cf ON = cfd.cf_id WHERE cfd.org_id = ? 52376 12.411 12.113
apiserver api SELECT * FROM entities WHERE id = 10357613 AND org_id = 253 SELECT * FROM entities WHERE id = ? AND org_id = ? 14879 13.840 5.407
host1 communication engine UPDATE e_status SET hard_bounce= hard_bounce+1 WHERE email = ‘’ UPDATE e_status SET hard_bounce= hard_bounce+1 WHERE email = ‘?’ 5043 3.235 1.356

This report shows the frequently fired queries which take a significant time to execute. Many such reports are generated and visualized on a dashboard to observe the trend over time.

The graph shows the number of slow queries fired by different services on a daily basis. The detailed list of the exact queries can be downloaded and optimized to help scale up the services and the platform.

Future work:

We are planning to open source the “query_hash” UDF very soon. Stay tuned for more updates!

If working on challenging problems of similar nature excites you, we’re always looking for new members to join the team.  Take a look at our current openings or contact