GIJC23 panel Text Analysis for Investigative Reporting

Image: GIJN

Stories

•

Topics

» Data Journalism » GIJC23 » Reporting Tools & Tips

Tips to Guide Investigative Journalists in Document Text Analysis

by Patrick Egwu • November 30, 2023

Read this article in

Investigative journalists often face the challenge of reviewing and combining large documents or data in text forms. This can be very exhausting and labor intensive.

Ideally, data is accessible in a friendly format — such as spreadsheet, CSV, or a JSON file — for easy analysis. But many times, data is stuck in hard to extract sources like PDFs, emails, articles, and social network posts, noted Fernanda Aguirre, a data analyst at The Examination, a nonprofit newsroom that investigates preventable health threats, and Data Critica, a data journalism organization with a focus in Latin America.

“These kinds of documents are not as intuitive to analyze as they are when they come in a structured way,” Aguirre explained during a session at the13th Global Investigative Journalism Conference (#GIJC23). “[This is] when we are dealing with what is known as unstructured data specifically in text-based form.”

At GIJC23, Aguirre emphasized that it is easier to analyze structured data than information in text form. This is because structured data, she pointed out, allows journalists to easily manipulate it, by applying arithmetic operations, counting categories, and calculating percentages or rates of change.

To get the data into this easily usable state, Aguirre said it must be converted into something measurable. “This is done simply by extracting insights that point us in the direction we need to investigate further,” she said, noting that “the purpose of analyzing documents with text is to extract useful information from them.”

This process can be daunting, however, so Aguirre offered tips to guide investigative journalists through the process.

Ask Questions

To analyze any document, Aguirre said the first step should be for journalists to ask a series of relevant questions, just as they would do with any other source of information. Depending on the document, the questions could reveal names of those involved, location, or day of events. This initial stage is all about establishing a baseline of facts and, possibly, a chronology of events.

Process the Text

During the next stage of analysis, Aguirre said it is important to remove “stop words” — terms that don’t contain semantic content. She adds that these kinds of words, which include articles and conjunctions, are widely used, but they carry no real information or meaning. While there is no official list, Aguirre said each list of stop words will vary according to the language.

Consider Natural Language Processing

Sometimes, text analysis can be complex. In situations like these, Aguirre recommended the use of Natural Language Processing (NLP), a type of AI language learning model that helps computers understand the way humans write and speak. She highlighted a few handy NLP techniques to use in analyzing documents.

Named Entity Recognition (NER): This technique allows the extraction of different kinds of entities in the documents. Aguirre said this technique is useful to understand what a document is talking about, and in the classification of named entities into predefined categories such as names, places, nationalities, religious or political groups, organizations and companies, agencies and institutions, and locations.

Topic Modeling: This technique helps to analyze and identify groups of similar words within a text body. “I like to think of topic modeling not only as a technique to analyze your data, but also to organize,” Aguirre said, adding that algorithms such as LSA, LDA, BERTopic, Top2Vec can be used with this technique. Kurtis Pykes, a data scientist, also explained topic modeling as a type of statistical modeling that analyzes and identifies groups of similar words within a text body. This approach is used to scan documents, detect word and phrase patterns within the documents, and produce or group similar words into topics. Topic modeling is used to discover and identify hidden topics within a set of text documents.

DocumentCloud: This is a popular, open source software program that allows journalists to upload, organize, analyze, annotate, and publish source documents to the open web. Gumshoe is one of the tools recently integrated with the DocumentCloud platform. This AI tool is aimed at addressing the challenge journalists and newsrooms face in reviewing, for instance, massive documents received as a result of Freedom of Information requests.

Hugging Face: A recent article published in MUO explained that Hugging Face is an open source platform that provides tools and resources for working on language and computer vision projects. (It has both free and paid tiers.) In essence, Aguirre said Hugging Face helps users “find very specific modules available for a wide variety of languages.” Additionally, Hugging Face is a data science platform that enables users to build, train, and deploy machine learning models. Besides being a data science platform, Hugging Face acts as a hub or community where AI experts, machine learning engineers, and data scientists come together to share ideas and get necessary support.

Collaborate

The process of analyzing documents sometimes requires technical skills that journalists may not have. In situations like this, Aguirre urged journalists to collaborate with academia or other newsrooms that have data teams or even programmers to help them deal with complex investigations, and extract the different kinds of entities in documents.

Aguirre said that collaboration “is like a great way to deal with complex investigations regarding documents,” especially for journalists who don’t do programming. She added that journalists don’t have to read several pages before analyzing a document, but can easily use NLP techniques to go to specific pages within the documents and extract needed information.

These techniques help to find out where to look for [specific] information [in the documents], explained Aguirre.

Watch the full GIJC23 panel on Text Analysis for Investigative Reporting.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

AI data journalism data visualization Document Cloud document text analysis GIJC23 language learning model

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Tips to Guide Investigative Journalists in Document Text Analysis</h2> by <a href="https://www.linkedin.com/in/patrick-egwu-489a6a159?originalSubdomain=ca">Patrick Egwu</a> for Global Investigative Journalism Network &bull; November 30, 2023 Investigative journalists often face the challenge of reviewing and combining large documents or data in text forms. This can be very exhausting and labor intensive.<aside>The process of analyzing documents sometimes requires technical skills that journalists may not have.</aside>Ideally, data is accessible in a friendly format &mdash; such as spreadsheet, CSV, or a JSON file &mdash; for easy analysis. But many times, data is stuck in hard to extract sources like PDFs, emails, articles, and social network posts, noted <a href="https://iniciativaidea.org/en/ponentes-2023/fernanda-aguirre/">Fernanda Aguirre</a>, a data analyst at <a href="blank">The Examination</a>, a nonprofit newsroom that investigates preventable health threats, and <a href="https://datacritica.org/acerca-de-data-critica/">Data Critica</a>, a data journalism organization with a focus in Latin America.&ldquo;These kinds of documents are not as intuitive to analyze as they are when they come in a structured way,&rdquo; Aguirre explained during a session at the<a href="https://gijc2023.org/">13th Global Investigative Journalism Conference</a> (#GIJC23). &ldquo;[This is] when we are dealing with what is known as unstructured data specifically in text-based form.&rdquo;At GIJC23, Aguirre emphasized that it is easier to analyze structured data than information in text form. This is because structured data, she pointed out, allows journalists to easily manipulate it, by applying arithmetic operations, counting categories, and calculating percentages or rates of change.To get the data into this easily usable state, Aguirre said it must be converted into something measurable. &ldquo;This is done simply by extracting insights that point us in the direction we need to investigate further,&rdquo; she said, noting that &ldquo;the purpose of analyzing documents with text is to extract useful information from them.&rdquo;This process can be daunting, however, so Aguirre offered tips to guide investigative journalists through the process.<h4>Ask Questions</h4>To analyze any document, Aguirre said the first step should be for journalists to ask a series of relevant questions, just as they would do with any other source of information. Depending on the document, the questions could reveal names of those involved, location, or day of events. This initial stage is all about establishing a baseline of facts and, possibly, a chronology of events.<h4>Process the Text</h4>During the next stage of analysis, Aguirre said it is important to remove &ldquo;<a href="https://www.opinosis-analytics.com/knowledge-base/stop-words-explained/">stop words</a>&rdquo; &mdash; terms that don't contain semantic content. She adds that these kinds of words, which include articles and conjunctions, are widely used, but they carry no real information or meaning. While there is no official list, Aguirre said each list of stop words will vary according to the language.<h4>&nbsp;Consider Natural Language Processing</h4><aside>Gumshoe is an AI tool aimed at addressing the challenge journalists and newsrooms face in reviewing, for instance, massive documents received as a result of Freedom of Information requests.</aside>Sometimes, text analysis can be complex. In situations like these, Aguirre recommended the use of <a href="http://chat.openai.com">Natural Language Processing</a> (NLP), a type of AI language learning model that <a href="https://online.york.ac.uk/the-role-of-natural-language-processing-in-ai/">helps computers understand the way humans write and speak</a>. She highlighted a few handy NLP techniques to use in analyzing documents.<ul>
<li><a href="https://learn.microsoft.com/en-us/azure/ai-services/language-service/named-entity-recognition/overview">Named Entity Recognition (NER)</a>: This technique allows the extraction of different kinds of entities in the documents. Aguirre said this technique is useful to understand what a document is talking about, and in the classification of named entities into predefined categories such as names, places, nationalities, religious or political groups, organizations and companies, agencies and institutions, and locations.</li>
</ul><ul>
<li><a href="https://guides.library.upenn.edu/penntdm/methods/topic_modeling">Topic Modeling</a>: This technique helps to analyze and identify groups of similar words within a text body. &ldquo;I like to think of topic modeling not only as a technique to analyze your data, but also to organize,&rdquo; Aguirre said, adding that <a href="https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05">algorithms</a> such as <a href="https://towardsdatascience.com/topic-modeling-with-latent-semantic-analysis-58aeab6ab2f2">LSA</a>, <a href="https://www.baeldung.com/cs/latent-dirichlet-allocation">LDA</a>, <a href="https://maartengr.github.io/BERTopic/index.html">BERTopic</a>, <a href="https://towardsdatascience.com/how-to-perform-topic-modeling-with-top2vec-1ae9bb4e89dc">Top2Vec</a> can be used with this technique. <a href="https://www.datacamp.com/portfolio/kurtispykes">Kurtis Pykes</a>, a data scientist, also explained topic modeling as a type of statistical <a href="https://www.datacamp.com/tutorial/what-is-topic-modeling">modeling </a>that analyzes and identifies groups of similar words within a text body. This approach is used to scan documents, detect word and phrase patterns within the documents, and produce or group similar words into topics. Topic modeling is used to discover and identify hidden topics within a set of text documents.</li>
</ul><ul>
<li><a href="https://www.documentcloud.org/about/">DocumentCloud</a>: This is a popular, open source software program that allows journalists to upload, organize, analyze, annotate, and publish source documents to the open web. <a href="https://www.muckrock.com/news/archives/2021/dec/20/muckrock-gumshoe-nyu/">Gumshoe</a> is one of the tools recently integrated with the DocumentCloud platform. This AI tool is aimed at addressing the challenge journalists and newsrooms face in reviewing, for instance, massive documents received as a result of Freedom of Information requests.</li>
</ul><ul>
<li><a href="https://www.makeuseof.com/what-is-hugging-face-and-what-is-it-used-for/">Hugging Face</a>: A recent article published in <a href="https://www.makeuseof.com/page/about/">MUO</a> explained that Hugging Face is an open source platform that provides tools and resources for working on language and computer vision projects. (It has both free and paid tiers.) In essence, Aguirre said Hugging Face helps users &ldquo;find very specific modules available for a wide variety of languages.&rdquo; Additionally, Hugging Face is a data science <a href="https://www.techtarget.com/whatis/definition/Hugging-Face">platform</a> that enables users to build, train, and deploy machine learning models. Besides being a data science platform, Hugging Face acts as a hub or community where AI experts, machine learning engineers, and data scientists come together to share ideas and get necessary support.</li>
</ul><h4>Collaborate</h4>The process of analyzing documents sometimes requires technical skills that journalists may not have. In situations like this, Aguirre urged journalists to collaborate with academia or other newsrooms that have data teams or even programmers to help them deal with complex investigations, and extract the different kinds of entities in documents.Aguirre said that collaboration "is like a great way to deal with complex investigations regarding documents," especially for journalists who don't do programming. She added that journalists don't have to read several pages before analyzing a document, but can easily use NLP techniques to go to specific pages within the documents and extract needed information.These techniques help to find out where to look for [specific] information [in the documents], explained Aguirre.Watch the full GIJC23 panel on Text Analysis for Investigative Reporting.
	This <a target="_blank" href="https://gijn.org/stories/tips-investigative-journalists-document-text-analysis/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

New Document Tools to Unearth Redacted Text, Personal Information, and More

by Rowan Philp • April 10, 2023

DocumentCloud now includes many more cutting-edge functions — which include extracting personal identification information embedded in large files, importing data from programs like Google Drive, transcribing YouTube audio, and even peering through weak blackout redactions.

Data mining workshop scraping website GIJC23

Tipsheet Data Journalism GIJC23

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

by Pınar Dağ • November 3, 2023

Knowing where to look for data — and accessing it via scraping data from websites — can be a powerful force multiplier for investigative journalists.

Data Journalism GIJC23

Expert Tips for Journalists on Building Your Own Datasets

by Banjo Damilola • November 17, 2023

What do you do when you don’t get the dataset you need from authorities, or it doesn’t exist? Two experts provided tips at GIJC23.

tips audio video searching files interviews

Methodology Reporting Tools & Tips

Tips for Organizing Audio and Video Files and Making Them Searchable

by Tony Jarne • February 22, 2023

There are different tools reporters can use to make video and audio files searchable, they should be an essential part of any investigative journalist’s toolkit.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Tips to Guide Investigative Journalists in Document Text Analysis

Read this article in

Ask Questions

Process the Text

Consider Natural Language Processing

Collaborate

Read other stories tagged with:

Republish this article

Read Next

Reporting Tools & Tips Research

New Document Tools to Unearth Redacted Text, Personal Information, and More

Tipsheet Data Journalism GIJC23

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism GIJC23

Expert Tips for Journalists on Building Your Own Datasets

Methodology Reporting Tools & Tips

Tips for Organizing Audio and Video Files and Making Them Searchable

Stories

Topics

Tips to Guide Investigative Journalists in Document Text Analysis

Read this article in

Related Resources

Tipsheet for Using Ocean Data in Your Investigations

How to Identify and Investigate AI Audio Deepfakes, a Major 2024 Election Threat

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

What Washington is Doing in Your Country: A Tipsheet for Investigating US Influence Around the World

Share

Ask Questions

Process the Text

Consider Natural Language Processing

Collaborate

Related Resources

Tipsheet for Using Ocean Data in Your Investigations

How to Identify and Investigate AI Audio Deepfakes, a Major 2024 Election Threat

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

What Washington is Doing in Your Country: A Tipsheet for Investigating US Influence Around the World

Related Stories

New Document Tools to Unearth Redacted Text, Personal Information, and More

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Expert Tips for Journalists on Building Your Own Datasets

Tips for Organizing Audio and Video Files and Making Them Searchable

Read other stories tagged with:

Republish this article

Read Next

Reporting Tools & Tips Research

New Document Tools to Unearth Redacted Text, Personal Information, and More

Tipsheet Data Journalism GIJC23

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism GIJC23

Expert Tips for Journalists on Building Your Own Datasets

Methodology Reporting Tools & Tips

Tips for Organizing Audio and Video Files and Making Them Searchable