Stories

•

Topics

» News & Analysis

How Machine Learning Can (And Can’t) Help Journalists

by Floris Wu • March 19, 2019

Kevin Wall, a visual journalist at The Boston Globe, is just beginning to use machine learning in his reporting. But the large amounts of data he needs to leverage this kind of artificial intelligence isn’t always easy to find.

“We need a lot of data for machine learning and deep learning, so it can be tough because you will need teams of people to get [that] amount of data,” he says.

Like Wall, the journalism industry is still scratching the surface of these cutting-edge data science tools. There are only a handful of projects out there – such as the search for surveillance aircraft by BuzzFeed News, the analysis of the misclassification of serious assaults by The Los Angeles Times and image recognition of members of Congress by The New York Times.

“For now, journalists and the media industry as a whole are recognizing that AI and [machine learning] can benefit them, but it also represents this drastic shift from what otherwise has been a very stable industry for the last couple hundred years,” says Alex Siegman, an AI technical program manager at Dow Jones. “This is something that’s still very new, and a lot of newsrooms are exploring what it means for them and how they can derive benefits from it.”

What Is Machine Learning?

Simply put, machine learning is when a computer model is trained with a “teaching set” of data to identify patterns, insights and predictions substantially faster and more effectively than a human being. An example of this is training a model on a large set of cat and dog images and then asking the model to distinguish between pictures of cats and dogs with a high level of accuracy.

As Siegman puts it, machine learning is “finding patterns in large amounts of data and making predictions based on historical data.” There are two aspects to using machine learning in journalism: as part of investigative reporting, or as a day-to-day tool to make journalists’ lives easier.

Machine Learning for Investigative Reporting

“There are probably relatively few circumstances under which reporters are going to need … to acquire machine learning – it’s really where you’ve got a classification task,” says Peter Aldhous, a reporter on the science desk at BuzzFeed News.

Aldhous is behind Hidden Spy Planes, an investigative project for which he used machine learning – specifically a “random forest” algorithm described here – to identify out of a massive amount of airplane flight data which ones might be covert spy planes. The project won a 2018 Data Journalism Award for innovation in data journalism.

Some findings from Peter Aldhous’s spy planes investigation. Photo: Peter Aldhous.

Aldhous says his plane project was a rare case in which machine learning was actually a good fit, because there was a large enough data set to train the model. “I had very good data on these aircraft, and a lot of it,” he says.

Aldhous successfully acquired four months of flight data from more than 100 known government aircraft. From that, he was able to build a model which could flag planes that might have been surveillance aircraft based on “their turning rates, speeds and altitudes flown, the areas of rectangles drawn around each flight path and the flights’ durations.”

But Aldhous warns that there is danger of data reporters getting too excited about this shining new tool. He says Rachel Shorey, a software engineer in the interactive news department of The New York Times, summarized this sentiment well at the National Institute for Computer-Assisted Reporting (NICAR) conference last year: Sometimes, simple things like a keyword alert or standard statistical sampling techniques might just do as good of a job in an even shorter amount of time.

A slide from Rachel Shorey’s 2018 NICAR talk.

“We need to use the right tool for the right job,” says Aldhous. “[For much of what we do], we don’t need machine learning; we need good data reporting.”

Although the need for machine learning in the newsroom is relatively rare, Shorey pointed out what actually happens when journalists implement this technology in their reporting. The process is “much more haphazard than is desirable,” Shorey wrote in an email. First, reporters find a good library in their favorite programming language; second, they read the documentation; third, they confirm that the methods are a good approach and they understand the inputs and outputs (even if not all the underlying math); fourth, they spend days to weeks cleaning data; and last, they write about 10 lines of code to execute the machine learning process.

Machine Learning as a Day-To-Day Tool

“There’s a lot to what journalists have to do,” says Siegman at Dow Jones. “If you can use technology or machine learning to automate or even semi-automate any part of that, that is a great benefit to journalists.”

Machine learning can help journalists with their day-to-day tasks, such as finding stories, doing photography and videography work, or editing and publishing their work on social media, he says. This can be done through little things, such as automatically transcribing recordings, using image recognition to identify someone in a photo and captioning videos; or through a larger task, such as finding specific information that’s beneficial from a huge influx of content from sources such as social media.

Siegman thinks machine learning or artificial intelligence is nothing more than just a tool. Ten or 20 years from now, he says, people will think about machine learning just like how we think about Microsoft Excel today: “It’s [just] a tool that we are using to perform certain job functions.”

The Ethics of Machine Learning in Journalism

“I would not be happy, in journalism, using black box machine learning methods [where] I don’t know what they are doing,” says Aldhous, referring to the critique that many algorithms lack transparency in how they are designed and trained.

Aldhous says transparency is crucial in journalism – reporters should be able to explain what they did. And at the same time, readers should be able to repeat what reporters did.

Algorithmic accountability is also vital. “One of the most important things journalists need to be doing is actually doing watchdog reporting on how machine learning algorithms are being used by companies and by government,” says Aldhous.

Aldhous thinks watchdog reporting around those issues is even more important than journalists using the algorithms themselves. He says there is a “potential for bias in any algorithmic decision.”

This can happen when a training set includes societal bias that machine learning picks up on, says Carlos Scheidegger, a computer scientist from the University of Arizona.

“There’s very little you can do to validate your results if there’s a problem with the way that a classifier you are using worked,” he says.

Both Siegman and Aldhous mentioned an example of how Amazon used an algorithm that was biased against women as their recruitment tool. The system was trained on data over a 10-year period submitted by mostly male applicants. It then started penalizing resumes that included the word “women.”

“The bias precipitated through the algorithm, and into the real world,” says Siegman.

Siegman thinks privacy concerns are also alarming. “To use any machine learning, you need lots and lots of data,” he says. “And there are privacy concerns around how you are collecting that data from users.”

The Future of Machine Learning in Journalism

Aldhous thinks there is a future in machine learning, but more on the publishing side – such as how to organize, distribute, share and display content to attract more readers.

“But as time goes on, we will get a better idea of when it’s the right tool for the job, and when it just overkills or is not necessary,” he says.

Siegman agrees. “Don’t think about where we can use AI,” he says. “Think about what problems you are facing on a day-to-day [basis], and then evaluate whether or not AI might be a possible solution to that.”

This post originally appeared on Storybench and is reproduced here with permission.

Floris Wu is a graduate student at Northeastern’s School of Journalism, where she studies journalism with a focus on data science. She is a regular contributor to Storybench.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

artificial intelligence data journalism deep learning machine learning

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>How Machine Learning Can (And Can&rsquo;t) Help Journalists</h2> by <a href="https://twitter.com/floriswu690">Floris Wu</a> for Global Investigative Journalism Network &bull; March 19, 2019 <a href="http://www.storybench.org/the-future-of-machine-learning-in-journalism/"><img class="aligncenter size-full wp-image-125114" src="https://gijn.org/wp-content/uploads/2019/03/machine-learning.png" alt="" width="730" height="396"></a>Kevin Wall, a visual journalist at The Boston Globe, is just beginning to use machine learning in his reporting. But the large amounts of data he needs to leverage this kind of artificial intelligence isn&rsquo;t always easy to find.&ldquo;We need a lot of data for machine learning and deep learning, so it can be tough because you will need teams of people to get [that] amount of data,&rdquo; he says.Like Wall, the journalism industry is still scratching the surface of these cutting-edge data science tools. There are only a handful of projects out there &ndash; such as the&nbsp;<a href="https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes">search for surveillance aircraft</a>&nbsp;by BuzzFeed News, the&nbsp;<a href="https://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html">analysis of the misclassification of serious assaults</a>&nbsp;by The Los Angeles Times and&nbsp;<a href="https://www.nytimes.com/2018/01/24/us/politics/pro-trump-fundraising-trump-hotel.html">image recognition of members of Congress</a>&nbsp;by The New York Times.<aside class="module align-right half type-pull-quote">There are two aspects to using machine learning in journalism: as part of investigative reporting, or as a day-to-day tool to make journalists&rsquo; lives easier.</aside>&ldquo;For now, journalists and the media industry as a whole are recognizing that AI and [machine learning] can benefit them, but it also represents this drastic shift from what otherwise has been a very stable industry for the last couple hundred years,&rdquo; says Alex Siegman, an AI technical program manager at Dow Jones. &ldquo;This is something that&rsquo;s still very new, and a lot of newsrooms are exploring what it means for them and how they can derive benefits from it.&rdquo;<h4>What Is Machine Learning?</h4>Simply put,&nbsp;<a href="https://www.forbes.com/sites/bernardmarr/2016/09/30/what-are-the-top-10-use-cases-for-machine-learning-and-ai/#5fe1a81a94c9">machine learning</a>&nbsp;is when a computer model is trained with a &ldquo;teaching set&rdquo; of data to identify patterns, insights and predictions substantially faster and more effectively than a human being. An example of this is training a model on a large set of cat and dog images and then asking the model to distinguish between pictures of cats and dogs with a high level of accuracy.As Siegman puts it, machine learning is &ldquo;finding patterns in large amounts of data and making predictions based on historical data.&rdquo; There are two aspects to using machine learning in journalism: as part of investigative reporting, or as a day-to-day tool to make journalists&rsquo; lives easier.<h4>Machine Learning for Investigative Reporting</h4>&ldquo;There are probably relatively few circumstances under which reporters are going to need ... to acquire machine learning &ndash; it&rsquo;s really where you&rsquo;ve got a classification task,&rdquo; says Peter Aldhous, a reporter on the science desk at BuzzFeed News.Aldhous is behind&nbsp;<a href="https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes">Hidden Spy Planes</a>, an investigative project for which he used machine learning &ndash; specifically a &ldquo;random forest&rdquo; algorithm <a href="https://buzzfeednews.github.io/2017-08-spy-plane-finder/">described here</a> &ndash; to identify out of a massive amount of airplane flight data which ones might be covert spy planes. The project won a <a href="https://medium.com/data-journalism-awards/this-is-what-the-best-of-data-journalism-looks-like-6f1713d60479">2018 Data Journalism Award</a> for innovation in data journalism.Aldhous says his plane project was a rare case in which machine learning was actually a good fit, because there was a large enough data set to train the model. &ldquo;I had very good data on these aircraft, and a lot of it,&rdquo; he says.<aside class="module align-right half type-pull-quote">Aldhous warns that there is danger of data reporters getting too excited about this shining new tool.</aside>Aldhous successfully acquired four months of flight data from more than 100 known government aircraft. From that, he was able to build a model which could flag planes that might have been surveillance aircraft based on &ldquo;their turning rates, speeds and altitudes flown, the areas of rectangles drawn around each flight path and the flights&rsquo; durations.&rdquo;But Aldhous warns that there is danger of data reporters getting too excited about this shining new tool. He says&nbsp;<a href="https://www.nytimes.com/by/rachel-shorey">Rachel Shorey</a>, a software engineer in the interactive news department of The New York Times, <a href="https://paldhous.github.io/NICAR/2018/machine-learning.html">summarized this sentiment</a> well at the National Institute for Computer-Assisted Reporting (NICAR) conference last year: Sometimes, simple things like a keyword alert or standard statistical sampling techniques might just do as good of a job in an even shorter amount of time.&ldquo;We need to use the right tool for the right job,&rdquo; says Aldhous. &ldquo;[For much of what we do], we don&rsquo;t need machine learning; we need good data reporting.&rdquo;Although the need for machine learning in the newsroom is relatively rare, Shorey <a href="https://paldhous.github.io/NICAR/2018/machine-learning.html">pointed out</a> what actually happens when journalists implement this technology in their reporting. The process is &ldquo;much more haphazard than is desirable,&rdquo; Shorey wrote in an email. First, reporters find a good library in their favorite programming language; second, they read the documentation; third, they confirm that the methods are a good approach and they understand the inputs and outputs (even if not all the underlying math); fourth, they spend days to weeks cleaning data; and last, they write about 10 lines of code to execute the machine learning process.<h4>Machine Learning as a Day-To-Day Tool </h4>&ldquo;There&rsquo;s a lot to what journalists have to do,&rdquo; says Siegman at Dow Jones. &ldquo;If you can use technology or machine learning to automate or even semi-automate any part of that, that is a great benefit to journalists.&rdquo;Machine learning can help journalists with their day-to-day tasks, such as finding stories, doing photography and videography work, or editing and publishing their work on social media, he says. This can be done through little things, such as automatically transcribing recordings, using image recognition to identify someone in a photo and captioning videos; or through a larger task, such as finding specific information that&rsquo;s beneficial from a huge influx of content from sources such as social media.Siegman thinks machine learning or artificial intelligence is nothing more than just a tool. Ten or 20 years from now, he says, people will think about machine learning just like how we think about Microsoft Excel today: &ldquo;It&rsquo;s [just] a tool that we are using to perform certain job functions.&rdquo;<h4>The Ethics of Machine Learning in Journalism</h4>&ldquo;I would not be happy, in journalism, using black box machine learning methods [where] I don&rsquo;t know what they are doing,&rdquo; says Aldhous, referring to the critique that many algorithms <a href="http://nymag.com/intelligencer/2018/12/sundar-pichais-vague-explanation-of-how-google-search-works.html">lack transparency</a> in how they are <a href="https://www.the-tls.co.uk/articles/public/ridiculously-complicated-algorithms/">designed and trained</a>.Aldhous says transparency is crucial in journalism &ndash; reporters should be able to explain what they did. And at the same time, readers should be able to repeat what reporters did.Algorithmic accountability is also vital. &ldquo;One of the most important things journalists need to be doing is actually doing watchdog reporting on how machine learning algorithms are being used by companies and by government,&rdquo; says Aldhous.Aldhous thinks watchdog reporting around those issues is even more important than journalists using the algorithms themselves. He says there is a &ldquo;potential for bias in any algorithmic decision.&rdquo;<aside class="module align-right half type-pull-quote">Don&rsquo;t think about where we can use AI. Think about what problems you face day-to-day, and then evaluate whether or not AI might be a possible solution. &mdash; Alex Siegman</aside>This can happen when a training set&nbsp;<a href="http://www.storybench.org/carlos-scheidegger-data-science-needs-done-humanely/">includes societal bias</a>&nbsp;that machine learning picks up on, says Carlos Scheidegger, a computer scientist from the University of Arizona.&ldquo;There&rsquo;s very little you can do to validate your results if there&rsquo;s a problem with the way that a classifier you are using worked,&rdquo; he says.Both Siegman and Aldhous mentioned an example of <a href="https://www.bbc.com/news/technology-45809919">how Amazon used an algorithm</a> that was biased against women as their recruitment tool. The system was trained on data over a 10-year period submitted by mostly male applicants. It then started penalizing resumes that included the word &ldquo;women.&rdquo;&ldquo;The bias precipitated through the algorithm, and into the real world,&rdquo; says Siegman.Siegman thinks privacy concerns are also alarming. &ldquo;To use any machine learning, you need lots and lots of data,&rdquo; he says. &ldquo;And there are privacy concerns around how you are collecting that data from users.&rdquo;<h4>The Future of Machine Learning in Journalism</h4>Aldhous thinks there is a future in machine learning, but more on the publishing side &ndash; such as how to organize, distribute, share and display content to attract more readers.&ldquo;But as time goes on, we will get a better idea of when it&rsquo;s the right tool for the job, and when it just overkills or is not necessary,&rdquo; he says.Siegman agrees. &ldquo;Don&rsquo;t think about where we can use AI,&rdquo; he says. &ldquo;Think about what problems you are facing on a day-to-day [basis], and then evaluate whether or not AI might be a possible solution to that.&rdquo;<hr>This post <a href="http://www.storybench.org/the-future-of-machine-learning-in-journalism/">originally appeared on Storybench</a> and is reproduced here with permission.<img class="alignleft size-thumbnail wp-image-125119" src="https://gijn.org/wp-content/uploads/2019/03/Floris-Wu-140x140.jpg" alt="" width="140" height="140"><a href="https://twitter.com/floriswu690">Floris Wu </a>is a graduate student at Northeastern&rsquo;s School of Journalism, where she studies journalism with a focus on data science. She is a regular contributor to Storybench. 

	This <a target="_blank" href="https://gijn.org/stories/how-machine-learning-can-and-cant-help-journalists/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

Editor’s Pick: Top 10 Data Journalism Projects of 2024

by Ana Beatriz Assam • December 13, 2024

Highlights from a year of data journalism columns, from elections to the Olympics and the evolution of the love song.

Data Journalism LATAM Focus News & Analysis

‘Shining a Light Where There Are Shadows’: Latin American Outlets Innovating With Data

by Lucero Hernández García • July 12, 2024

Data journalism is helping outlets across the region carry out innovative projects that reveal the stories hidden in large volumes of data.

News & Analysis

Why Investigative Journalists Should Report on Lax Oversight and Fraud in Research Data

by Denise-Marie Ordway, The Journalist's Resource • May 31, 2024

Uri Simonsohn, a behavioral scientist who coauthors the Data Colada blog, urges reporters to ask researchers about preregistration and expose opportunities for academic fraud.

data journalism missing piece common mistake

Data Journalism News & Analysis

Lessons Learned: 10 Common Mistakes in Data Journalism

by Rowan Philp • April 24, 2024

GIJN asked speakers and attendees in the NICAR conference hallways for the data journalism gaps they see and for under-covered topic areas newsrooms can address.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

How Machine Learning Can (And Can’t) Help Journalists

What Is Machine Learning?

Machine Learning for Investigative Reporting

Machine Learning as a Day-To-Day Tool

The Ethics of Machine Learning in Journalism

The Future of Machine Learning in Journalism

Read other stories tagged with:

Republish this article

Read Next

Data Journalism Top 10 Editor's Picks News & Analysis

Editor’s Pick: Top 10 Data Journalism Projects of 2024

Data Journalism LATAM Focus News & Analysis

‘Shining a Light Where There Are Shadows’: Latin American Outlets Innovating With Data

News & Analysis

Why Investigative Journalists Should Report on Lax Oversight and Fraud in Research Data

Data Journalism News & Analysis

Lessons Learned: 10 Common Mistakes in Data Journalism

Stories

Topics

How Machine Learning Can (And Can’t) Help Journalists

Related Resources

Investigating Latin America’s Global Reach Of Illicit Activities

The Investigative Agenda for Climate Change Journalism

GIJC23 – Measuring Impact

GIJC23 – What Does AI Have to Do with Investigative Journalism? Everything!

Share

What Is Machine Learning?

Machine Learning for Investigative Reporting

Machine Learning as a Day-To-Day Tool

The Ethics of Machine Learning in Journalism

The Future of Machine Learning in Journalism

Related Resources

Investigating Latin America’s Global Reach Of Illicit Activities

The Investigative Agenda for Climate Change Journalism

GIJC23 – Measuring Impact

GIJC23 – What Does AI Have to Do with Investigative Journalism? Everything!

Related Stories

Editor’s Pick: Top 10 Data Journalism Projects of 2024

‘Shining a Light Where There Are Shadows’: Latin American Outlets Innovating With Data

Why Investigative Journalists Should Report on Lax Oversight and Fraud in Research Data

Lessons Learned: 10 Common Mistakes in Data Journalism

Read other stories tagged with:

Republish this article

Read Next

Data Journalism Top 10 Editor's Picks News & Analysis

Editor’s Pick: Top 10 Data Journalism Projects of 2024

Data Journalism LATAM Focus News & Analysis

‘Shining a Light Where There Are Shadows’: Latin American Outlets Innovating With Data

News & Analysis

Why Investigative Journalists Should Report on Lax Oversight and Fraud in Research Data

Data Journalism News & Analysis

Lessons Learned: 10 Common Mistakes in Data Journalism