Global Investigative Journalism Network

Reporting Tools & Tips

A Poor Journalist’s Text-Mining Toolkit

by Friedrich Lindenberg • June 27, 2016

How can you search and analyze collections of documents on your own computers with simple tools? At DataHarvest, Robert Gebeloff and I ran a workshop to answer that question. As people were seemed interested, here’s a write-up of the two key tools we worked with: Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.

Data Journalism Methodology

The People and the Technology Behind the Panama Papers

by Mar Cabra & Erin Kissane • May 10, 2016

The trove of files that make up the Panama Papers is likely the largest dataset of leaked insider information in the history of journalism. For ICIJ’s Data and Research Unit, it offered a unique set of challenges. The overall size of the data (2.6 terabytes, 11.5 million files), the variety of file types (from spreadsheets, emails and PDFs to obscure and old formats no longer in use), and the logistics of making it all securely searchable for more than 370 journalists around the world are just a few of the hurdles faced over the course of the 12 month investigation.

Accessibility Settings

text size

color options

reading tools

other

Tag

Apache Tika

Reporting Tools & Tips

A Poor Journalist’s Text-Mining Toolkit

Data Journalism Methodology

The People and the Technology Behind the Panama Papers