I was recently given the opportunity to optimize a query that processed a total of 660 million rows. The problem with this query was that it took 150 minutes to complete, provided that it did not time out (which it did ~40% of the time).
The query timing out caused two key problems, namely: 1) another 150 minutes execution time was required (excluding debugging), and 2) missing output data affected downstream processes and they were not able to meet their SLA.
With my query optimization, I managed to get the run time down to an average of ~73 minutes. This 51.3% reduction of execution time allowed us to reliably execute and output data to our downstream processes. In this post, I will generalize the problem and explain how I optimized the query.
To apply the problem onto a fictional context, let’s say I own 10 hospitals globally (I hope this is true), and each hospital has a database of available medicine/vaccines/etc. As their stakeholder, I would like to get a weekly Report of prescribed medicine in descending order (perhaps to identify trends).
The challenge is that the range of medicine prescribed by each hospital is a subset of all the medicine available in a master list. To further complicate, not all attributes of the medicine are stored by the hospital.
Given the schemas as follows:
- Hopsital_A_Records (medicine_id, description, symptoms, units_prescribed, …)
- Hopsital_B_Records (medicine_id, description, symptoms, units_prescribed, …)
- Full_Medicine_Info (medicine_id, alias_name, manufacturer, embargo, …)
With each table having a non-trivial number of rows.
The initial query to produce the Report looked something like this (not syntactically correct):
SELECT alias_name, manufacturer, SUM(units_prescribed) FROM (Hopsital_A_Records LEFT JOIN Full_Medicine_Info ON Hopsital_A_Records.medicine_id = Full_Medicine_Info.medicine_id) UNION (Hopsital_B_Records LEFT JOIN Full_Medicine_Info ON Hopsital_B_Records.medicine_id = Full_Medicine_Info.medicine_id) UNION ... GROUP BY alias_name, manufacturer
Note: LEFT JOIN due to business requirements
And the new query that takes half the time looks something like this (not syntactically correct):
SELECT alias_name, manufacturer, SUM(units_prescribed) FROM (SELECT alias_name, manufacturer, units_prescribed FROM Hopsital_A_Records LEFT JOIN Full_Medicine_Info ON Hopsital_A_Records.medicine_id = Full_Medicine_Info.medicine_id) UNION (SELECT alias_name, manufacturer, units_prescribed FROM Hopsital_B_Records LEFT JOIN Full_Medicine_Info ON Hopsital_B_Records.medicine_id = Full_Medicine_Info.medicine_id) UNION ... GROUP BY alias_name, manufacturer
Note: LEFT JOIN due to business requirements
Why is this Faster?
By limiting the number of columns in the sub-queries, only the necessary pieces of data that contribute to the results of the final query are used in later steps. As a general rule, lesser data == faster processing.
The underlying cause of the increase in execution time is likely due to the size of memory available on the instance. Given an infinite amount of memory with little lookup overheads, I believe that the first query would take less than twice as long as compared to the second.
Due to the limited size of memory and unnecessary columns that the first query did not filter out, it a resulted in a larger number of disk I/Os that are significantly more expensive than reading from or writing to memory. The second query did the filtering early in the sub-queries and allowed a larger number of rows to be processed before having to write to disk. This resulted in a faster processing time.
To illustrate the concept in an overly simplified manner, let’s assume that:
- every attribute is of one “unit” in length,
- “Hopsital_A_Records” table has 10 attributes and 100 rows
- “Full_Medicine_Info” table has 20 attributes and 1000 rows
Then assuming an instance with a memory size of 50 available “units”, to process every row in “Hopsital_A_Records” table in the worst case scenario*, it would require 502 disk I/O operations**.
By doing the SELECT early and bringing in only the necessary attributes, it only takes 65 disk I/O operations**!
* No index available, full table sequential scan
** Breakdown: read 1 row from “Hospital_A_Records”, read all rows from “Full_Medicine_Info”, write 1 result row)
Of course, there are many other factors that affect the performance, such as the presence of indexes, type of database (e.g. RDBMS, NoSQL), the brand of database, how the query engine optimizes the execution, competing processes in memory that may result in increased paging or thrashing, just to name a few.
With the Cloud, it is very tempting to throw computing power at a problem to solve it. After all, if a long running query takes 150 minutes on an instance with 16 core machine, why not speed it up by running it with 32 cores?
However, this approach can be both excessive and limiting. In the real world, there is always an upper bound on the resources – how much more will it cost the business to increase the instance to 32 cores? As the business grows, how many more cores can we add until the overheads become costlier than the savings?
To be aligned with long term business needs, it is more important to understand why the problem is happening, and seek opportunities to remove/optimize the bottleneck; instead of seeking a quick short-term win (such as throwing compute resources at it).