Image: Shutterstock

Stories

•

Topics

» Data Journalism

Beyond the Hype: Using AI Effectively in Investigative Journalism

by Jonathan Stray • September 9, 2019

中文

Image: Shutterstock

Artificial intelligence for journalism is both promising and overhyped. Major newsrooms are using machine-learning methods to personalize story recommendations, and text-generation technology to automatically write sports and business stories. But it’s taken a lot longer to make AI work for investigative journalism. I researched what’s been done so far, why it’s been hard to make progress, and where the near-future opportunities lie.

Most discussions of AI in journalism, especially investigative journalism, focus on the possibility of “finding patterns” or “making connections” or even “uncovering social problems.” The idea is that these new algorithmic techniques will save reporters a lot of time in the analysis phase of their work, perhaps enabling new types of stories that were previously too difficult or expensive.

Does this actually work? Sometimes, yes. There are a handful of examples where AI techniques were critical. For the story License to Betray, the Atlanta Journal-Constitution scraped over 100,000 doctor disciplinary records from every state, looking for instances where doctors who had sexually abused patients were allowed to continue to practice. A custom machine learning analysis identified 6,000 documents which were likely cases, which the reporters then read and categorized manually.

BuzzFeed used machine learning to find government surveillance planes from public flight plan data. The Washington Post used sentiment analysis to determine that hundreds of negative statements had been removed from the final version of US Agency for International Development audit reports. But there are only a dozen or so success stories like these. There are a number of reasons why investigative journalism can be especially challenging for AI methods.

First, you can’t simply throw everything relevant to your investigation into one big database and let the AI go at it. Even “public” data often must be scraped, requested, negotiated, or purchased, sometimes even purchased one record at a time. Assembling the relevant data is one of the major challenges of journalism.

The required engineering can also be expensive. As opposed to many business applications where the same system can be used many times, a reporter might only use his or her custom AI code for a single story. There isn’t another pile of 100,000 unread doctor disciplinary reports for the Atlanta Journal-Constitution to analyze.

It’s also important to have realistic expectations. Many investigative journalism problems are still beyond what is possible with state-of-the-art techniques. Current AI systems are not up to the task of summarizing legal documents or backgrounding a list of companies. Progress on these types of problems will be slow because it’s difficult to assemble the large quantities of specialized training data that would be needed to create such algorithms. Today’s natural language processing methods require hundreds of thousands or even millions of individual examples to learn from.

Finally, there is the issue of accuracy. You can’t publicly accuse someone of wrongdoing based on a model that is right 95% of the time, which means AI output requires manual checking. This can erode the computational advantages of speed and scale.

More fundamentally, it will be difficult to encode “news values” computationally. Which fact patterns are newsworthy? The answer depends on a huge range of contextual social and political factors. You can program story criteria manually, like The Los Angeles Times’ earthquake bot that produced a story if there was a magnitude 3.0 quake or above. Or you can train a system on previous human choices, which is how the Reuters Tracer system determines the newsworthiness of a tweet. Hard-coded criteria will always be somewhat arbitrary and inflexible, while training from human examples will silently replicate any existing biases in coverage. There is no perfect solution.

But there is an application of investigative AI that sidesteps most of these problems: data cleaning and wrangling. For most projects, it takes far longer to prepare data than to analyze it, which means the potential gains from automation are large.

AI could be used to solve several important data wrangling problems for investigative journalists. For example, American TV stations disclose the political advertising they broadcast, but there are hundreds of local stations and they publish this information using wildly different form layouts. There are tens of thousands of these PDF documents published every election, but it’s difficult to extract this data into a spreadsheet using conventional programming techniques. My experiments show that “deep learning” methods — the type of technology used to create modern AI systems such as self-driving cars and machine translation — can be used to interpret these diverse and messy forms.

AI can also be used to help fuse multiple databases together. A person or company name may appear in datasets from different sources, but it will often be spelled slightly differently, and in any case names are not unique. It’s necessary to use other information such as addresses to determine if two records refer to the same entity. In other words, this sort of “record linkage” requires judgement calls at scale, and is an excellent application for machine learning. Automated record linkage was used on the Panama Papers and is available in tools like Dedupe.io.

In short, I am optimistic about the use of AI in investigative journalism. While we’re a long way from asking a computer to find the story in the data, AI methods can be applied today to speed up data preparation and cleaning – the time-consuming “wrangling” that every data story requires.

I cover all of these issues and more in Making AI Work for Investigative Journalism.

Jonathan Stray is a computational journalist at Columbia University, where he teaches the dual master’s degree in computer science and journalism. He’s contributed to The New York Times, The Atlantic, Wired, Foreign Policy, and ProPublica. He was formerly an editor at the Associated Press, a reporter in Hong Kong, and a research scientist in Silicon Valley.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

AI artificial intelligence deep learning machine learning

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Beyond the Hype: Using AI Effectively in Investigative Journalism</h2> by <a href="https://twitter.com/jonathanstray?lang=en">Jonathan Stray</a> for Global Investigative Journalism Network &bull; September 9, 2019 <a href="https://cn.gijn.org/2019/10/22/%e5%9c%a8%e6%b7%b1%e5%ba%a6%e6%8a%a5%e9%81%93%e9%a2%86%e5%9f%9f%ef%bc%8c%e4%ba%ba%e5%b7%a5%e6%99%ba%e8%83%bd%e5%8f%af%e4%bb%a5%e6%9c%89%e5%93%aa%e4%ba%9b%e5%ba%94%e7%94%a8%ef%bc%9f/">中文</a>Artificial intelligence for journalism is both promising and overhyped. Major newsrooms are using machine-learning methods to personalize story recommendations, and text-generation technology to automatically write sports and business stories. But it&rsquo;s taken a lot longer to make AI work for investigative journalism. I <a href="https://www.tandfonline.com/doi/abs/10.1080/21670811.2019.1630289?journalCode=rdij20">researched</a> what&rsquo;s been done so far, why it&rsquo;s been hard to make progress, and where the near-future opportunities lie.Most discussions of AI in journalism, especially investigative journalism, focus on the possibility of &ldquo;finding patterns&rdquo; or &ldquo;making connections&rdquo; or even &ldquo;uncovering social problems.&rdquo; The idea is that these new algorithmic techniques will save reporters a lot of time in the analysis phase of their work, perhaps enabling new types of stories that were previously too difficult or expensive.<aside class="module align-right half type-pull-quote">There are a number of reasons why investigative journalism can be especially challenging for AI methods.</aside>Does this actually work? Sometimes, yes. There are a handful of examples where AI techniques were critical. For the story <a href="http://doctors.ajc.com/about_this_investigation/">License to Betray</a>, the Atlanta Journal-Constitution scraped over 100,000 doctor disciplinary records from every state, looking for instances where doctors who had sexually abused patients were allowed to continue to practice. A custom machine learning analysis identified 6,000 documents which were likely cases, which the reporters then read and categorized manually.BuzzFeed used machine learning to <a href="https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes">find</a> government surveillance planes from public flight plan data. The Washington Post used sentiment analysis to determine that hundreds of negative statements had been <a href="https://www.washingtonpost.com/investigations/whistleblowers-say-usaids-ig-removed-critical-details-from-public-reports/2014/10/22/68fbc1a0-4031-11e4-b03f-de718edeb92f_story.html?noredirect=on">removed</a> from the final version of US Agency for International Development audit reports. But there are only a dozen or so success stories like these. There are a number of reasons why investigative journalism can be especially challenging for AI methods.First, you can&rsquo;t simply throw everything relevant to your investigation into one big database and let the AI go at it. Even &ldquo;public&rdquo; data often must be scraped, requested, negotiated, or purchased, sometimes even purchased one record at a time. Assembling the relevant data is one of the major challenges of journalism.<aside class="module align-right half type-pull-quote">You can&rsquo;t publicly accuse someone of wrongdoing based on a model that is right 95% of the time, which means AI output requires manual checking.</aside>The required engineering can also be expensive. As opposed to many business applications where the same system can be used many times, a reporter might only use his or her custom AI code for a single story. There isn&rsquo;t another pile of 100,000 unread doctor disciplinary reports for the Atlanta Journal-Constitution to analyze.It&rsquo;s also important to have realistic expectations. Many investigative journalism problems are still beyond what is possible with state-of-the-art techniques. Current AI systems are not up to the task of summarizing legal documents or backgrounding a list of companies. Progress on these types of problems will be slow because it&rsquo;s difficult to assemble the large quantities of specialized training data that would be needed to create such algorithms. Today&rsquo;s natural language processing methods require hundreds of thousands or even millions of individual examples to learn from.Finally, there is the issue of accuracy. You can&rsquo;t publicly accuse someone of wrongdoing based on a model that is right 95% of the time, which means AI output requires manual checking. This can erode the computational advantages of speed and scale.<aside class="module align-right half type-pull-quote">For most projects, it takes far longer to prepare data than to analyze it, which means the potential gains from automation are large.</aside>More fundamentally, it will be difficult to encode &ldquo;news values&rdquo; computationally. Which fact patterns are newsworthy? The answer depends on a huge range of contextual social and political factors. You can program story criteria manually, like The Los Angeles Times&rsquo; earthquake <a href="https://niemanreports.org/articles/automation-in-the-newsroom/">bot</a> that produced a story if there was a magnitude 3.0 quake or above. Or you can train a system on previous human choices, which is how the <a href="https://blogs.thomsonreuters.com/answerson/making-reuters-news-tracer/">Reuters Tracer</a> system determines the newsworthiness of a tweet. Hard-coded criteria will always be somewhat arbitrary and inflexible, while training from human examples will silently replicate any existing biases in coverage. There is no perfect solution.But there is an application of investigative AI that sidesteps most of these problems: data cleaning and wrangling. For most projects, it takes far longer to prepare data than to analyze it, which means the potential gains from automation are large.AI could be used to solve several important data wrangling problems for investigative journalists. For example, American TV stations disclose the political advertising they broadcast, but there are hundreds of local stations and they publish this information using wildly different form layouts. There are tens of thousands of these PDF documents published every election, but it&rsquo;s difficult to extract this data into a spreadsheet using conventional programming techniques. My <a href="http://jonathanstray.com/extracting-campaign-finance-data-from-gnarly-pdfs-using-deep-learning">experiments</a> show that &ldquo;deep learning&rdquo; methods &mdash; the type of technology used to create modern AI systems such as self-driving cars and machine translation &mdash; can be used to interpret these diverse and messy forms.<aside class="module align-right half type-pull-quote">While we&rsquo;re a long way from asking a computer to find the story in the data, AI methods can be applied today to speed up data preparation and cleaning &ndash; the time-consuming &ldquo;wrangling&rdquo; that every data story requires.</aside>AI can also be used to help fuse multiple databases together. A person or company name may appear in datasets from different sources, but it will often be spelled slightly differently, and in any case names are not unique. It&rsquo;s necessary to use other information such as addresses to determine if two records refer to the same entity. In other words, this sort of &ldquo;record linkage&rdquo; requires judgement calls at scale, and is an excellent application for machine learning. Automated record linkage was used on the Panama Papers and is available in tools like <a href="https://dedupe.io/">Dedupe.io</a>.In short, I am optimistic about the use of AI in investigative journalism. While we&rsquo;re a long way from asking a computer to find the story in the data, AI methods can be applied today to speed up data preparation and cleaning &ndash; the time-consuming &ldquo;wrangling&rdquo; that every data story requires.I cover all of these issues and more in <a href="http://jonathanstray.com/papers/Making%20Artificial%20Intelligence%20Work%20for%20Investigative%20Journalism.pdf">Making AI Work for Investigative Journalism</a>.<hr><a href="https://twitter.com/jonathanstray?lang=en"><img class="alignleft size-thumbnail wp-image-169319" src="https://gijn.org/wp-content/uploads/2019/09/Jonathan-Stray-140x140.png" alt="" width="140" height="140">Jonathan Stray</a> is a computational journalist at Columbia University, where he teaches the dual master's degree in computer science and journalism. He&rsquo;s contributed to The New York Times, The Atlantic, Wired, Foreign Policy, and ProPublica. He was formerly an editor at the Associated Press, a reporter in Hong Kong, and a research scientist in Silicon Valley.
	This <a target="_blank" href="https://gijn.org/stories/beyond-the-hype-using-ai-effectively-in-investigative-journalism/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

How AI Is Helping Independent Journalists Track Wartime Casualties of Russia

by Katya Bonch-Osmolovskaya • March 31, 2025

Exiled Russian media site IStories has shared with GIJN how it built an AI-powered database of Russian military war dead and missing, and why it was worth creating.

Data Journalism How They Did It

How I Did It: Extracting and Analyzing National Budget Data Using a Custom AI Bot

by Jaemark Tordecilla, Reuters Institute • January 24, 2025

Filipino journalist Jaemark Tordecilla explains how he made and refined a custom AI tool that makes it easier to report on public spending from budgets published online.

Data journalism training class at Izmir University of Economics in Türkiye

Data Journalism Reporting Tools & Tips

Tips for Using Data in a Small Newsroom

by Pınar Dağ • June 5, 2024

Small newsrooms need to focus on the importance of data use more than ever — but they often face numerous hurdles, including a lack of funding and limited human resources.

GIJC23 panel Text Analysis for Investigative Reporting

Data Journalism GIJC23 Reporting Tools & Tips

Tips to Guide Investigative Journalists in Document Text Analysis

by Patrick Egwu • November 30, 2023

Investigative journalists often face the challenge of reviewing and combining large documents or data in text forms. This can be very exhausting and labor intensive.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Beyond the Hype: Using AI Effectively in Investigative Journalism

Read other stories tagged with:

Republish this article

Read Next

Data Journalism Methodology

How AI Is Helping Independent Journalists Track Wartime Casualties of Russia

Data Journalism How They Did It

How I Did It: Extracting and Analyzing National Budget Data Using a Custom AI Bot

Data Journalism Reporting Tools & Tips

Tips for Using Data in a Small Newsroom

Data Journalism GIJC23 Reporting Tools & Tips

Tips to Guide Investigative Journalists in Document Text Analysis

Stories

Topics

Beyond the Hype: Using AI Effectively in Investigative Journalism

Related Resources

GIJC23 – The Future of Data Journalism: New Analytical Tools, Data Visualization, and AI

GIJN Guide to Investigating Foreign Lobbying

Guide to Investigating Caste

Gathering Evidence and Documents in Conflict and War Zones — A MENA Case Study

Share

Related Resources

GIJC23 – The Future of Data Journalism: New Analytical Tools, Data Visualization, and AI

GIJN Guide to Investigating Foreign Lobbying

Guide to Investigating Caste

Gathering Evidence and Documents in Conflict and War Zones — A MENA Case Study

Related Stories

How AI Is Helping Independent Journalists Track Wartime Casualties of Russia

How I Did It: Extracting and Analyzing National Budget Data Using a Custom AI Bot

Tips for Using Data in a Small Newsroom

Tips to Guide Investigative Journalists in Document Text Analysis

Read other stories tagged with:

Republish this article

Read Next

Data Journalism Methodology

How AI Is Helping Independent Journalists Track Wartime Casualties of Russia

Data Journalism How They Did It

How I Did It: Extracting and Analyzing National Budget Data Using a Custom AI Bot

Data Journalism Reporting Tools & Tips

Tips for Using Data in a Small Newsroom

Data Journalism GIJC23 Reporting Tools & Tips

Tips to Guide Investigative Journalists in Document Text Analysis