Good day, and welcome back! Today I will share about SQL Query optimization. A step which is always important, but also always getting forgotten.
So, you might be already did some works on SQL. For somebody who doesn’t know what SQL is and how is it different, you can read my post here. Let’s imagine a simple use case, you want to get the list of employees at your company which get the increment last month, it might be look like this:
SELECT E.ID, E.NAME FROM EMPLOYEE E WHERE E.ID IN ( SELECT EMPLOYEE_ID FROM INCREMENTS WHERE DELETED_AT IS NULL )
That was a nice looking query, and you can find queries like that on your everyday job. But, is there any problem? Yes!
The EXPLAIN statement
There is a nice statement on SQL that you can use. The EXPLAIN statement. This statement will help you find out about how big your query really is. How to use it? You can just put the EXPLAIN keyword in front of the query you have. Like this:
EXPLAIN SELECT E.ID, E.NAME FROM EMPLOYEE E WHERE E.ID IN ( SELECT EMPLOYEE_ID FROM INCREMENTS WHERE DELETED_AT IS NULL )
And it will analyze your query into something that looks like this:
The select_type column describe your use of query, which is both the select statements from the query above. The table column is straight forward. The rows column here is the most important column you need to read.
You can basically multiply all the rows there to get how many rows which was processed in the query. And it is an incredible 5 Billions rows processed, while you have only 150k rows there.
If I can put the query into English, it basically tells the machine to get a list of employee, and check it one by one, is he/she got the increment or not, by looking at the increment data from the top of the list for each employee.
Oh shit! Yes, that was shit. Optimization is the only way to go. And the parameter of optimization is, easily, how to make that EXPLAIN statement gives you the least number of rows possible.
Let’s get on the optimization vehicle
Let’s take a look on how we can optimize the query.
1. The best practices
Following the SQL query guidelines can be a good start. Here are the things you can do according to the best practices.
- Index every columns you used as JOIN predicate, WHERE predicates, ORDER BY, and GROUP BY
- Avoid wildcard (%) on predicates
- Please don’t use * (star) as often as possible. Remove unnecessary columns from SELECT statement.
- Change OUTER JOIN to INNER JOIN if applicable
- Don’t rely on DISTINCT and UNION if possible
2. Move predicates to OUTER JOIN if possible
Don’t do this
SELECT A.COLUMN_1, B.COLUMN_2 FROM A LEFT JOIN B ON A.ID=B.ID WHERE A.COLUMN_3=31 AND B.COLUMN_4=40
Do this instead
SELECT A.COLUMN_1, B.COLUMN_2 FROM A LEFT JOIN B ON A.ID=B.ID AND A.COLUMN_3=31 AND B.COLUMN_4=40
Because it will reduce the number to be processed, as the column should be already there when the JOIN takes place, so its like throwing one stone and hitting two birds. For INNER JOIN, you can skip this tips.
3. Avoid SUB-QUERIES, change it into a simple JOIN if possible
This could be a little trick, but is also powerful. A sub-query (especially on where clause) tells your machine to do the sub-query once for every single row returned by the main query. This is less important than the two above.
Our employee query is also using the sub-query. Can you make the optimized version of it?
4. Common sense, Like splitting query
You can also rely on your common sense when sensing danger while looking at the query. I also did that when starting my job in data and looking at the queries created by my predecessor.
Some splitting action could be really useful if you think joining via application is much better than via query. You can use this only if you are already follow those three steps I provided above. Use this as your last resort. Or, use this if you are already give up while reading the query your teammates have created. LOL.
So, Let’s take a look on the optimized employee query
SELECT E.ID, E.NAME FROM EMPLOYEE E INNER JOIN INCREMENTS I ON E.ID=I.EMPLOYEE_ID WHERE I.DELETED_AT IS NULL
Use the EXPLAIN keyword, and voila!
By moving the sub-query into a JOIN statement reduce the need to scan the whole INCREMENTS table for each employee row. It is really nice. And leaving DELETED_AT in where predicates is totally doable since it doesn’t contradict with the tips no. 2.
Take a breather, and look back. Maybe you just missed something more important.
Please share it if you like it!