When it comes to document management one of my favorite tools is DocumentCloud. Some of the many great features in DocumentCloud include its ability to create text versions of PDFs and build up stats from your uploaded files. But what if you need to analyze thousands, or even hundreds of thousands of documents? Then you’ll probably want to look at Overview, a document-mining tool built with investigative journalists in mind.
Overview can import documents from a range of sources, including DocumentCloud, and can handle anywhere from a few dozen pages to millions of pages. Overview can visualize the data in the document collection in a range of ways from word clouds to network graphs. It also has a range of search tools that make it relatively simple to filter your data for specific information. One of its best features is that is can automatically group your documents into folders based on their content. And, like DocumentCloud, it has built-in optical character recognition (OCR) so you can view your documents in their original format or as plain text. Add to that the ability to add tags and notes, and suddenly mining thousands of documents doesn’t seem so daunting.
Who Tweeted That?
Twitter can be a great source of data for journalists, but mining the conversations on Twitter effectively is quite difficult unless you know your way around programming and how to use the Twitter API. One of my recent favorite discoveries, however, makes a lot of this work a lot easier. Treeverse is a Chrome extension that makes finding all the participants in a Twitter thread as easy as clicking on a tweet.
Treeverse can be downloaded from the Chrome App Store. Once you’ve installed the extension, open up Twitter, find an active thread and open the original tweet. With the tweet open, click on the Treeverse extension icon and you’ll be given a tree view of all the participants in the conversation. Click on a participant and their tweet will be loaded in the sidebar for you along with the preceding tweets that lead to theirs. Threads are also colored, based on when they were posted which adds some nice visual queues.
Mining Twitter Data
While we’re on Twitter, one of the best tools for mining information about Twitter users is FollowerWonk. FollowerWonk has a broad range of tools for analyzing Twitter users. For example, you could compare the profiles of a number of Twitter users and see their relative activity on Twitter, the overlap in the users they follow or their followers. One of the most useful features of FollowerWonk is the option to find users by searching their Twitter bios. This is great for finding users in particular industries or with particular expertise.
At some point in most investigative projects you’ll need to share documents, either with collaborators or with sources. Of course there are dozens of online services like Dropbox and Google Drive that could be used and email is also good in many cases. But what if the document is sensitive and you don’t want to risk it being intercepted? Or you need to share the document anonymously? One of the easiest tools to use in these cases is OnionShare.
OnionShare is open source software and runs on most operating systems, including Windows, Mac OS and Linux. Once the software is installed the user can drag and drop files into OnionShare. When they start sharing the files, OnionShare will set up a secure Tor server and generate a URL. The recipient then opens this URL using the secure and anonymous Tor Browser to download the files.
By default OnionShare will close the connection as soon as the files are downloaded so they don’t linger around for others to find.
A Step Further
If you need something more permanent than OnionShare then you’ll need to look at setting up a Virtual Private Network (VPN). VPNs are useful for protecting communication within an organization but have a reputation for being costly or tricky to set up. Outline, however, is neither.
Created by Jigsaw, Outline makes it simple to set up a VPN using services like DigitalOcean which cost just a few dollars a month. Outline includes a management tool for setting up as many VPNs you want, as well as a client which is used to access your newly set up VPN.
There are many good reasons you may want to monitor a website for changes. Perhaps you need to see edits made to a profile page, a listing, a regulatory page or a corporate website. Rather than opening the site every morning to scan for changes, simplify your life with a monitoring tool. There are a number of good website monitoring tools available and I particularly like Versionista and VisualPing. Both are simple enough to use: add a URL and your email address and the services will send you an alert whenever it detects a change on your chosen site. Both services will also store versions of the site you’re monitoring so you can see the evolution of the page over time. Both services offer a free basic service as well as more advanced services for a fee.
Have any tools or tips that you think are worth sharing? Email them to me at firstname.lastname@example.org.
Alastair Otter is GIJN’s IT Coordinator. He is also a managing partner of Media Hack Collective, a data journalism initiative based in Johannesburg, where he programs interactive data visualizations and manages a number of online media sites.