Artificial intelligence for journalism is both promising and overhyped. Major newsrooms are using machine-learning methods to personalize story recommendations, and text-generation technology to automatically write sports and business stories. But it’s taken a lot longer to make AI work for investigative journalism. I researched what’s been done so far, why it’s been hard to make progress, and where the near-future opportunities lie.
Most discussions of AI in journalism, especially investigative journalism, focus on the possibility of “finding patterns” or “making connections” or even “uncovering social problems.” The idea is that these new algorithmic techniques will save reporters a lot of time in the analysis phase of their work, perhaps enabling new types of stories that were previously too difficult or expensive.
Does this actually work? Sometimes, yes. There are a handful of examples where AI techniques were critical. For the story License to Betray, the Atlanta Journal-Constitution scraped over 100,000 doctor disciplinary records from every state, looking for instances where doctors who had sexually abused patients were allowed to continue to practice. A custom machine learning analysis identified 6,000 documents which were likely cases, which the reporters then read and categorized manually.
BuzzFeed used machine learning to find government surveillance planes from public flight plan data. The Washington Post used sentiment analysis to determine that hundreds of negative statements had been removed from the final version of US Agency for International Development audit reports. But there are only a dozen or so success stories like these. There are a number of reasons why investigative journalism can be especially challenging for AI methods.
First, you can’t simply throw everything relevant to your investigation into one big database and let the AI go at it. Even “public” data often must be scraped, requested, negotiated, or purchased, sometimes even purchased one record at a time. Assembling the relevant data is one of the major challenges of journalism.
The required engineering can also be expensive. As opposed to many business applications where the same system can be used many times, a reporter might only use his or her custom AI code for a single story. There isn’t another pile of 100,000 unread doctor disciplinary reports for the Atlanta Journal-Constitution to analyze.
It’s also important to have realistic expectations. Many investigative journalism problems are still beyond what is possible with state-of-the-art techniques. Current AI systems are not up to the task of summarizing legal documents or backgrounding a list of companies. Progress on these types of problems will be slow because it’s difficult to assemble the large quantities of specialized training data that would be needed to create such algorithms. Today’s natural language processing methods require hundreds of thousands or even millions of individual examples to learn from.
Finally, there is the issue of accuracy. You can’t publicly accuse someone of wrongdoing based on a model that is right 95% of the time, which means AI output requires manual checking. This can erode the computational advantages of speed and scale.
More fundamentally, it will be difficult to encode “news values” computationally. Which fact patterns are newsworthy? The answer depends on a huge range of contextual social and political factors. You can program story criteria manually, like The Los Angeles Times’ earthquake bot that produced a story if there was a magnitude 3.0 quake or above. Or you can train a system on previous human choices, which is how the Reuters Tracer system determines the newsworthiness of a tweet. Hard-coded criteria will always be somewhat arbitrary and inflexible, while training from human examples will silently replicate any existing biases in coverage. There is no perfect solution.
But there is an application of investigative AI that sidesteps most of these problems: data cleaning and wrangling. For most projects, it takes far longer to prepare data than to analyze it, which means the potential gains from automation are large.
AI could be used to solve several important data wrangling problems for investigative journalists. For example, American TV stations disclose the political advertising they broadcast, but there are hundreds of local stations and they publish this information using wildly different form layouts. There are tens of thousands of these PDF documents published every election, but it’s difficult to extract this data into a spreadsheet using conventional programming techniques. My experiments show that “deep learning” methods — the type of technology used to create modern AI systems such as self-driving cars and machine translation — can be used to interpret these diverse and messy forms.
AI can also be used to help fuse multiple databases together. A person or company name may appear in datasets from different sources, but it will often be spelled slightly differently, and in any case names are not unique. It’s necessary to use other information such as addresses to determine if two records refer to the same entity. In other words, this sort of “record linkage” requires judgement calls at scale, and is an excellent application for machine learning. Automated record linkage was used on the Panama Papers and is available in tools like Dedupe.io.
In short, I am optimistic about the use of AI in investigative journalism. While we’re a long way from asking a computer to find the story in the data, AI methods can be applied today to speed up data preparation and cleaning – the time-consuming “wrangling” that every data story requires.
I cover all of these issues and more in Making AI Work for Investigative Journalism.
Jonathan Stray is a computational journalist at Columbia University, where he teaches the dual master’s degree in computer science and journalism. He’s contributed to The New York Times, The Atlantic, Wired, Foreign Policy, and ProPublica. He was formerly an editor at the Associated Press, a reporter in Hong Kong, and a research scientist in Silicon Valley.