A Poor Journalist’s Text-Mining Toolkit

How can you search and analyze collections of documents on your own computers with simple tools? At DataHarvest, Robert Gebeloff and I ran a workshop to answer that question. As people were seemed interested, here’s a write-up of the two key tools we worked with: Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.

The People and the Technology Behind the Panama Papers

The trove of files that make up the Panama Papers is likely the largest dataset of leaked insider information in the history of journalism. For ICIJ’s Data and Research Unit, it offered a unique set of challenges. The overall size of the data (2.6 terabytes, 11.5 million files), the variety of file types (from spreadsheets, emails and PDFs to obscure and old formats no longer in use), and the logistics of making it all securely searchable for more than 370 journalists around the world are just a few of the hurdles faced over the course of the 12 month investigation.