ds-econ

Share this post

#1 Principles of Data Cleaning: Big to Small

dsecon.substack.com

#1 Principles of Data Cleaning: Big to Small

Tidying data doesn't go by the book. You need to find principles that guide your data cleaning decisions.

Finn
May 24, 2023
Share

(Thanks to Ciel for asking me about a series on data cleaning!)

Most of the time, data scientists spend cleaning data.

It is, arguably, the most important, but least sexy step of any data science project. Worst of all, data cleaning depends on your data. While an obvious statement, it is also a disappointing one, because it implies that there is no step-by-step recipe on how to clean your dataset.

Be it text, image, video, tabular, or graph data: Even those high-level differences in kind imply major changes in the cleaning process. Not to mention the nitty-gritty details.

As data cleaning is too complex of a topic to solve in a 2-min blog post, see this more as a primer. Both for your own thought, and for future posts on this page.

Thanks for reading ds-econ! Subscribe for free to receive new posts and support my work.

Much has been written on how to clean and wrangle data, where one of the most prominent articles is the 2014 paper by Hadley Wickham, Tidy Data. Below, we use some takeaways from this paper.

Each messy dataset is messy in its own way

As the paper points out

“Like families, tidy datasets are all alike but every messy dataset is messy in its own way.”

Hence, we will focus on some principles, rather than concrete techniques, as there are no “one-size-fits-all” solutions.

My hope is, that you can use these principles as north-stars for your data science project. After all, the implementation of code gets more and more easy thanks to AI, and it seems that principles gain in importance.

(If you are looking for applications: In his article, Hadley also points to many practical tricks)

Standard for Tidy Data

What is tidy data? Hadley argues, that tidy data should have one variable per column, one observation per row, and one level of aggregation per table.

Arguably, this definition is targeted at tabular data, but can be extended also to e.g. text data (one document per row, one token per variable), or image data (one image per row, one pixel per column).

I think, that there can be different types of “mess” going on. Either on the dataframe level or on the variable level — See these examples:

What are messy dataframes?

  • Nested dataframes i.e. dataframes inside a variable

  • Same dataset scattered across multiple files

  • Trouble reading in the data (shoutout to “,” and “;”)

  • Several layers of aggregation merged together (while sometimes necessary, these should be stored in separate dataframes; at least prior to analysis)

What are messy variables?

  • Units are written inside the value, e.g. “1cm” for the variable length, instead of the value 1. Bonus: Different units within the same variable e.g. “1cm”, “0.02m”, …

  • Coding errors in the data

  • Missing values (sometimes, as these could be viewed as information themselves)

  • Impossible values (e.g. human of 3.6m height)

Does the kind of mess ultimately make a difference?
Not really, as we have to deal with them either way.

I think it makes sense, to start with the dataframe mess, as it allows you to get a birds-eye-view of the data, and will help you identify issues with the variables themselves.

After getting the data into a tidy format, we can take a deeper look at the variables themselves, fix issues, and validate the values.

One exception to this order is when key variables, which identify observations in the data (e.g. and individual-id or a group-id like City) are messed up. Then it can be necessary to first fix this key-variable, improve the dataframe structure, and then take repair the other features.

Hence, we get a first principle of data cleaning: Big to small.

Fix the macro structure of your data first, then zoom in and fix the micro issues.

Thank you for reading ds-econ. This post is public so feel free to share it.

Share

References

Wickham, H. . (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10

Share
Previous
Next
Comments
Top
New
Community

No posts

Ready for more?

© 2023 Finn
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing