Accessibility Settings

color options

monochrome muted color dark

reading tools

isolation ruler
» Tipsheet

Resource

Topics

Tools for Scraping, Cleaning, and Prepping Data

Read this article in

Got dirty data or pesky PDFs? These programs can help you get your data into a format you can use.

OpenRefine is a free tool for exploring, cleaning, and matching data. It is particularly useful for dealing with messy data. It is available in English, Chinese, Spanish, French, Russian, Portuguese (Brazil), German, Japanese, Italian, Hungarian, Hebrew, Filipino, Cebuano, and Tagalog. Here is a good tutorial on OpenRefine.

Extracting data from PDFs is a task that many journalists have to deal with. Several free tools make the job easier. Tabula is an open-source tool designed to extract tabular data from pdfs. Another free tool is XPDF, which supports several languages other than English. CometDocs is another tool that offers free limited accounts, as well as paid plans that offer more online storage and larger file upload sizes. 

CSVkit is a suite of command-line tools for converting to and working with CSV, the most common tabular file formats.

Workbench is a set of tools for scraping, cleaning, and analyzing data from Columbia’s School of Journalism.

Republish our articles for free, online or in print, under a Creative Commons license.

Republish this article


Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

Read Next