I have said before, working in this field is like a janitor. Yes, janitor. The software engineers will create a machine that produce the mess (data), and you, manage it. Means you have the responsibility in data integrity. Every decision in a data driven organization will come from the data team. No matter right or wrong the report is. And so, as a janitor, data cleaning will be one of the most important tasks you have to do.
So, what are the steps? There are only 4 (four) simple steps you actually need to do.
First, use your common sense, does the data looks correct?
Look at this figure.
Do you find something interesting?
Oh, yes, J4ck. His name is so cool. Wait, wait. It is not cool at all, if our common sense said that name could only constructed with alphabets, then this is wrong. We need to clean the data.
Anything else? The phone numbers look absurd. It has many formats. We can read it better and analyze (maybe we can find some pattern in phone numbers) if the formats are all the same.
In general, these are the things you need to take a look for this step.
- Take a look at a sample of your data. You can probably just shuffle your data, and take, for example 50 rows of data.
- Look at the column name, and the value of those 50 rows. Do the value match with the supposed value? In example, name should be all alphabets and spaces maybe, age should be number, email should be using the correct format, and so on.
- Do a simple statistics on the number columns of those 50 rows. Check the mean, min, max, and mode, does it make sense?
- If you found empty values, in example, age is 0 or address is empty string, you need to take note of those. Is the empty values acceptable?
- You found duplicates. In example, there are 5 people with exactly same name, is that acceptable?
Second, correct the supposedly wrong values with the correct ones.
You have already found problems on your data. The next step is to correct them. So, for our previous table, we should correct it to be like this.
Now, J4ck is corrected to Jack. And all phone numbers have the same format. This is obviously better.
Of course, the reality is not as simple as that. I only give you 7 rows of data. Real world database hold millions to billions number of rows. And so, these are the things you need to do for this step.
- Do you have enough information for the wrong value? In example, name should be alphabets, this is easy, you can make a string replace with the correct alphabets. Or you can replace all dash ‘-‘ on all phone numbers.
- For wrong numbers you have found, can you fix it someway? In example, there is a man whose age is 300. You just take him out from the rest of your data, since he will be an outlier and destroy your data distribution.
- Null values. There are several methods on filling the nulls. If the number of rows that have nulls are just a few, you can safely take them out from your data. If not, there are two ways of filling the nulls. First, if the distribution is good, means no outlier, you can fill the nulls with mean. But, if there are outliers and it is impossible to remove them, you can fill it with median.
- Duplicate values. Remove duplicates only if it is a supposedly unique id.
Third, correct tools = productive
Of course, you need to use correct tools to increase your productivity. A single excel spreadsheet maybe enough for you if your data only in thousands, or tens of thousands. But for millions of data, maybe you need a SQL client or do some python scriptings. And of course, for billions of data, you might need a real big data solution.
And for the equation I wrote, correct tools equals productive, not correct tools equals correct result. Of course, incorrect tools can also give you the same correct result. But it will be utterly unproductive. Just imagine using a single spreadsheet to clean billions of rows. You might be using your entire lifetime just for waiting the spreadsheet to open, not mentioning the time needed to correct the data.
Fourth (optional), the data is still not make sense. Contact the creator of the system who produced the data.
This is an optional step. If your data looks correct after doing the three steps. You can omit this. But if it is still not make sense, you want to talk with the one who produced the data. Most of the time, this will be software engineers.
In example, you create a report on how many visitors on a merchant page in an e-commerce. But, the result is not the same with the statistics given to the merchant via their dashboard. In this case, you need to contact the engineers and take note on how they are resulting those numbers to the merchant, and fix your report according to that.
Now the cleaning is done, you can do your next step in building Machine Learning solution. As for your reference, you can take a look at the list of top Machine Learning projects here.
Congratulation on cleaning your data. Now look at the faucet, I mean, your data. It is clean, you are happy, other teams are satisfied, and you can enjoy your holiday.
Saving the data you want, Cleaning the data you have. Data Cleaning is as important as saving them.
Please share it if you like it!