Tools for Scraping, Cleaning, and Prepping Data

Back to the Contents Page


Got dirty data or pesky PDFs? These programs can help you get your data into a format you can use.

OpenRefine is a free tool for exploring, cleaning, and matching data. It is particularly useful for dealing with messy data. It is available in English, Chinese, Spanish, French, Russian, Portuguese (Brazil), German, Japanese, Italian, Hungarian, Hebrew, Filipino, Cebuano, and Tagalog. Here is a good tutorial on OpenRefine.

Extracting data from PDFs is a task that many journalists have to deal with. Several free tools make the job easier. Tabula is an open-source tool designed to extract tabular data from pdfs. Another free tool is XPDF, which supports several languages other than English. CometDocs is another tool that offers free limited accounts, as well as paid plans that offer more online storage and larger file upload sizes. 

CSVkit is a suite of command-line tools for converting to and working with CSV, the most common tabular file formats.

Workbench is a set of tools for scraping, cleaning, and analyzing data from Columbia’s School of Journalism.

Back to the
Contents Page