Once you have your data, check out these free online tipsheets and tutorials for advice on how to inspect and clean it before you start analyzing.
This story is a great example of what to do when there are gaps in terms of the data available from official bodies (2021).
Data Biographies: How to Get to Know Your Data (2017) is a blog post by Heather Krause of the Canadian data journalism consulting website idatassist that walks through the process of interrogating the contents and collection process (as well as potential shortcomings) of a dataset before analyzing the data.
The Quartz Guide to Bad Data (2018) is a file on GitHub that discusses the most common problems found in datasets and how to solve them. It has been translated into Chinese, Japanese, Portuguese, and Spanish.
ProPublica’s Guide to Bulletproofing Data (updated 2018) Put together by Jennifer LaFleur with many contributions. Best practices for checking your data. It’s a work in progress, so add your suggestions.
This tutorial by Belgian journalist Stijn Debrouwere explains how to find common flaws in data and avoid misinterpreting datasets. It is available with a free subscription to the datajournalism.com site.
Get Started with OpenRefine (2017) is a quick tutorial with screenshots that walks through the basic features of the data cleaning tool OpenRefine. It was created by UCLA professor Miriam Posner.
Cleaning Data in OpenRefine (2018) is a detailed online guide with hands-on examples and video tutorials that walks users through the process of cleaning and standardizing data in OpenRefine. It was created by John Little, a data science librarian at Duke University.
This tutorial taught by Belgian data journalist Maarten Lambrechts is an introduction to using Excel to clean and standardize messy data. The training requires a free account with Datajournalism.com.