Greetings my dear readers! For this time around, I will share you about my experience in using ETL as a tool to help our product team tackling their very slow API. And also what database I chose for this particular problem. If you don’t understand about ETL, you can read my post here.
So, here comes the problem. The engineering leader showed me that their detail page loads for a whole 10 seconds. This destroyed the UX of the website and the app. While waiting those 10 seconds to load, I am sure many users decided to leave the app. He asked me about it and wanted to get some ideas from data side.
A Lot of Joins
So, I decided to take a look on how the data was requested by the API to the database. This gave me a shock. The data presented was from 18 (Eighteen) different tables! Called by 3 different endpoints. A lot of joins happened here. Obviously, it was slow, really slow, maybe a turtle is faster.
The tables joined there were not some kind of jokes. One of them had more than 100.000 rows and still increasing, the others had 10.000-100.000 rows. It was basically trying to process millions of data for each API call which was insane.
This problem reminded me on the upper limit of RDBMS databases. Relational is really great. But the benefits diminished linearly with the increasing number of data. In particular, if the row numbers touch millions.
The so called solution
So, we think about how to get rid those 18 joins. We need to make a database which the data can be available on a single table, without joining.
Oh, sounds like a NoSQL idea. We don’t really like NoSQL initially, since there is no managed service available back then. And we don’t want to risk the loss of our data when the database crashed. But still, we need to let go our ego and try this thing.
Here comes Elasticsearch. Elasticsearch is an open source search engine built on top of the popular search engine Lucene. It boasts a great throughput which greatly reduces response time. This was exactly what we need. Moreover, it was optimized to index text, wall of text. Another plus point on this technology.
Let’s start build the solution!
Looking at the diagram above, I used another technology, Kafka, which I will cover about it later. Basically, the logic was like this:
- Every time changes happened in MySQL, it notify Kafka about the change
- The streaming ETL job will subscribe the Kafka to know about the change.
- If change is found from Kafka, the job will pull relevant data from MySQL and put it into the Elasticsearch
- The API can get the newest data from Elasticsearch without joining those freaking 18 tables in MySQL
This change was really successful for the product. As it reduced the API slow response time from 10 seconds, to half a second. A 2000% improvement! Great deal.
Is it done?
This solution was really great. But, is it really that great? Weren’t there concerns about the technology? I said above it before, we didn’t believe about NoSQL initially. And the ETL process is bound to fail somewhere. The first concern was answered by AWS Elasticsearch, the managed Elasticsearch by Amazon Webservice. Even, the original creator of Elasticsearch has their own managed AWS Service.
How about the second concern? Yes, the job is bound to fail somewhere. But, we can just put someone in charge of the manual data movement, and it will work just fine. There wasn’t any other problem until now, and we can be sure, that this is a good solution for us. At least, for the time being, until a better solution can be found.
Skepticism on newer technology often obstructing our vision to the solution which is only one meter away
Please share it if you like the post!