#1 Principles of Data Cleaning: Big to Small
Tidying data doesn't go by the book. You need to find principles that guide your data cleaning decisions.
(Thanks to Ciel for asking me about a series on data cleaning!)
Most of the time, data scientists spend cleaning data.
It is, arguably, the most important, but least sexy step of any data science project. Worst of all, data cleaning depends on your data. While an obvious statement, it is also a disappointing one, because it implies that there is no step-by-step recipe on how to clean your dataset.
Be it text, image, video, tabular, or graph data: Even those high-level differences in kind imply major changes in the cleaning process. Not to mention the nitty-gritty details.
As data cleaning is too complex of a topic to solve in a 2-min blog post, see this more as a primer. Both for your own thought, and for future posts on this page.
Much has been written on how to clean and wrangle data, where one of the most prominent articles is the 2014 paper by Hadley Wickham, Tidy Data. Below, we use some takeaways from this paper.
Each messy dataset is messy in its own way
As the paper points out
“Like families, tidy datasets are all alike but every messy dataset is messy in its own way.”
Hence, we will focus on some principles, rather than concrete techniques, as there are no “one-size-fits-all” solutions.
My hope is, that you can use these principles as north-stars for your data science project. After all, the implementation of code gets more and more easy thanks to AI, and it seems that principles gain in importance.
(If you are looking for applications: In his article, Hadley also points to many practical tricks)
Standard for Tidy Data
What is tidy data? Hadley argues, that tidy data should have one variable per column, one observation per row, and one level of aggregation per table.
Arguably, this definition is targeted at tabular data, but can be extended also to e.g. text data (one document per row, one token per variable), or image data (one image per row, one pixel per column).
I think, that there can be different types of “mess” going on. Either on the dataframe level or on the variable level — See these examples:
What are messy dataframes?
Nested dataframes i.e. dataframes inside a variable
Same dataset scattered across multiple files
Trouble reading in the data (shoutout to “,” and “;”)
Several layers of aggregation merged together (while sometimes necessary, these should be stored in separate dataframes; at least prior to analysis)
What are messy variables?
Units are written inside the value, e.g. “1cm” for the variable
length
, instead of the value 1. Bonus: Different units within the same variable e.g. “1cm”, “0.02m”, …Coding errors in the data
Missing values (sometimes, as these could be viewed as information themselves)
Impossible values (e.g. human of 3.6m height)
Does the kind of mess ultimately make a difference?
Not really, as we have to deal with them either way.
I think it makes sense, to start with the dataframe mess, as it allows you to get a birds-eye-view of the data, and will help you identify issues with the variables themselves.
After getting the data into a tidy format, we can take a deeper look at the variables themselves, fix issues, and validate the values.
One exception to this order is when key variables, which identify observations in the data (e.g. and individual-id or a group-id like City
) are messed up. Then it can be necessary to first fix this key-variable, improve the dataframe structure, and then take repair the other features.
Hence, we get a first principle of data cleaning: Big to small.
Fix the macro structure of your data first, then zoom in and fix the micro issues.
References
Wickham, H. . (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10