data journalism extract DocumentCloud redaction

Image: Shutterstock

Stories

•

Topics

» Reporting Tools & Tips » Research

New Document Tools to Unearth Redacted Text, Personal Information, and More

by Rowan Philp • April 10, 2023

Read this article in

New features from DocumentCloud can help investigative journalists reveal or protect poorly-redacted text as well as quickly scrape personal information embedded in large files. Image: Shutterstock

One of the biggest technological leaps for investigative journalists in recent years has been the development of free tools to make large document bundles searchable and manageable for small teams.

In earlier years, reporters needed stacks of multi-colored sticky notes, data input volunteers, and lots of time to manage boxes of public records that arrive in every format, from longhand script to unstructured data tables and partially redacted reports.

But tools powered by machine learning and the ingenuity of open source program developers have not only tamed giant leaks, but can also unearth hidden data in those bundles, and reduce the risk of inadvertently publishing sensitive information.

For instance, attendees at the 2022 Investigative Reporters & Editors conference were amazed to learn that the AI-powered Google Pinpoint tool — in addition to its many time-saving parsing functions — could also transcribe, and search, the tiny text on brass plaques in the distant background of photographs. Indeed, journalists at the environmental newsroom Floodlight were recently named finalists for the Goldsmith Investigative Reporting Prize after they used Pinpoint to auto-analyze thousands of pages of leaked documents to identify individuals allegedly behind a media corruption scandal.

There was a similarly enthusiastic response at the recent NICAR23 data journalism conference in Tennessee, when reporters learned about powerful new digging features on the open source DocumentCloud platform.

A service of the nonprofit MuckRock Foundation, the largely free DocumentCloud platform is already popular for its base document management features, which include easy upload of 70 formats, from PDFs to spreadsheets and graphics; annotating reports; and — its best-known feature — the ability to embed processed documents directly into your story. You can also keyword search its public database of some five million documents added by other researchers and reporters, using familiar Google-type syntax like “AND” and “OR.” And its embedding function is especially important in the current era of declining trust in media, as audiences can directly check your claim that you did indeed find X or Y from a report, effectively turning documents into on-the-record sources.

But DocumentCloud now features many more cutting-edge functions — which include importing from programs like Google Drive to transcribing YouTube audio and even peering through weak blackout redactions (see the list below).

Tools to Address Real-World Data Challenges

In a single-speaker presentation at NICAR23, Sanjin Ibrahimovic, Open Source Fellow at the MuckRock Foundation, said add-ons to the core functions have been created by the DocumentCloud community — users, fellows, data science grantees, and journalists — to address problems and opportunities they encountered during live projects.

Document Cloud personal identifying information detector add-on data journalism

DocumentCloud’s PII Detector add-on feature can extract key information previously hidden in huge data files. Image: Screenshot, DocumentCloud

For instance, Ibrahimovic said users noticed that it took a long time to pick out personally identifiable information (PII) scattered throughout thick files, and that some could be missed in embedded information, like email addresses, social security numbers, ZIP codes, credit card numbers, and physical addresses in the small print.

So DocumentCloud has added a feature that automatically finds and highlights PII terms.

Meanwhile, Ibrahimovic said users were also struck by the weakness and inaccuracy of redactions in documents from government agencies — where officials often use black highlighter pens or poor redaction software to conceal sensitive or secret information. This presents a risk for newsrooms seeking to embed documents, as sensitive information on victims, for instance, could be digitally extracted by bad actors.

So DocumentCloud implemented a “Bad Redactions” add-on feature, which helps journalists in two crucial ways:

It automatically analyzes and surfaces all the supposedly redacted passages in a single spreadsheet, so you can sometimes reveal what the agency intended to conceal.
It gives you the option to complete the redaction job: to permanently scrub all the digital information beneath the blacked out sections, and fully redact them for public documents, or pages you embed. Ibrahimovic warned that reporters should think carefully before clicking on the “Confirm Redaction” button for passages they choose to redact — “because this is a permanent procedure — it’s not reversible.”

For his recent Organized Crime and Corruption Reporting Project (OCCRP) investigation into the trafficking of endangered brazilwood, Luiz Fernando Toledo used Bad Redactions to learn the names of small Brazilian companies fined for smuggling.

His story involved obtaining hundreds of reports on environmental fines from government agencies and then organizing those documents, explained Toledo — who is also project coordinator for the Data Fixers environmental crimes nonprofit. “The Bad Redactions Add-On helped me to find out the names of several people and companies charged. The Import Documents function was also very important. It was easy to parse through so many documents and find the key parts that I needed. I also used DocumentCloud to fact check the whole project.”

User-Friendly Digging Features

Even though they are transparent and open source, Ibrahimovic acknowledged that Add-Ons require coding skills if you want to create one. They are built with platforms like the DocumentCloud application programming interface (API) and GitHub Actions. But he said Add-Ons are only accepted for the service if they are easy to use.

Document Cloud Bad Redactions add-on data journalism

The DocumentCloud Bad Redactions Add-On can both reveal poorly redacted info and help journalists protect confidential information. Image: Screenshot, DocumentCloud

“Users don’t need any programmatic knowledge to run an Add-On,” he pointed out. “The idea for data extraction and analysis procedures is so smaller newsrooms can use this without needing programming skills.”

Nevertheless, running Add-Ons may present technical challenges to non-data reporters — so users should check out MuckRock’s YouTube tutorial channel on the topic.

Access to DocumentCloud requires creating an account — ideally, using your institutional email address — which is followed by a quick verification step. Access to the growing library of new features involves clicking on “Add-Ons,” and then “Browse All Add-Ons.”

Ibrahimovic said some of the newer Add-On tools can:

Import documents from Google Drive, Dropbox, WeTransfer, and Mediafire.
Convert email files (EML and MSG formats) into PDFs.
Pull data from websites with its Scraper function, which can also automatically download and index newly uploaded documents from your target site.
Detect and display poorly redacted text.
Back up projects to The Internet Archive.
Bulk-edit a large set of documents.
Transcribe audio files — including YouTube — and automatically upload transcriptions to your account.
Extract tables within PDFs using a Tabula-based tool.
Recognize and highlight PII terms, such as phone numbers, social security information, and physical addresses.

For several attendees, this latter function — the ‘PII Detector’ – was most exciting, partly because it can instantly provide a database of contact details for potential sources from a massive sheaf of court filings or audit reports.

Laura Corley, an investigative reporter at nonprofit The Macon Newsroom in the US state of Georgia, said new Add-Ons had already proved essential for her research into racial and economic equity at two local charter schools. Minutes of meetings posted by the school governing boards, she said, ran to hundreds of pages, and rarely listed the topics under discussion in headings.

“Without knowing specifically when certain business items were discussed, it could take hours or even days to find the right documents,” she explained. “The DocumentCloud Scraper Add-On allowed me to cull all of the meeting minutes from both websites within minutes. I was able to keyword search a decade’s worth of meeting notes to locate the information.”

“It yielded more results than expected, and gave me further context,” she added.

Summing up, Ibrahimovic said: “Collectively, we think these features really lower the barrier of entry for deep document analysis for journalists and researchers with limited resources.”

Additional Resources

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Testing the Potential of Using ChatGPT to Extract Data from PDFs

New Investigative Tools for Monitoring Social Media Platforms

Rowan Philp is a reporter for GIJN. He was formerly chief reporter for South Africa’s Sunday Times. As a foreign correspondent, he has reported on news, politics, corruption, and conflict from more than two dozen countries around the world.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

API Coding data extraction data journalism data scraping Document Cloud investigative Journalism NICAR23

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>New Document Tools to Unearth Redacted Text, Personal Information, and More</h2> by <a href="https://gijn.org/about/staff-member/rowan-philp/">Rowan Philp</a> for Global Investigative Journalism Network &bull; April 10, 2023 One of the biggest technological leaps for investigative journalists in recent years has been the development of free tools to make large document bundles searchable and manageable for small teams.In earlier years, reporters needed stacks of multi-colored sticky notes, data input volunteers, and lots of time to manage boxes of public records that arrive in every format, from longhand script to unstructured data tables and partially redacted reports.&nbsp;But tools powered by machine learning and the ingenuity of open source program developers have not only tamed giant leaks, but can also unearth hidden data in those bundles, and reduce the risk of inadvertently publishing sensitive information.For instance, attendees at the <a href="https://www.ire.org/training/conferences/ire-2022/">2022 Investigative Reporters &amp; Editors conference</a> were amazed to learn that the AI-powered <a href="https://journaliststudio.google.com/pinpoint/about">Google Pinpoint</a> tool &mdash; in addition to its many time-saving parsing functions &mdash; could also <a href="https://gijn.org/2022/07/05/free-game-changing-data-extraction-tools-that-require-no-coding-skills/">transcribe, and search, the tiny text</a> on brass plaques in the distant background of photographs. Indeed, journalists at the environmental newsroom <a href="https://www.floodlightnews.org/">Floodlight</a> were recently named finalists for the Goldsmith Investigative Reporting Prize after they used Pinpoint to <a href="https://journalistsresource.org/media/goldsmith-power-companies-media-manipulation/">auto-analyze thousands of pages</a> of leaked documents to identify individuals allegedly behind a media corruption scandal.<aside class="module align-right half type-pull-quote">DocumentCloud now boasts of many more cutting-edge functions.</aside>There was a similarly enthusiastic response at the recent <a href="https://www.ire.org/training/conferences/nicar-2023/">NICAR23 data journalism conference</a> in Tennessee, when reporters learned about powerful new digging features on the open source <a href="https://www.documentcloud.org/home">DocumentCloud platform</a>.A service of the nonprofit <a href="https://www.muckrock.com/">MuckRock Foundation</a>, the largely free DocumentCloud platform is already popular for its base document management features, which include easy upload of 70 formats, from PDFs to spreadsheets and graphics; annotating reports; and &mdash; its best-known feature &mdash; the ability to embed processed documents directly into your story. You can also keyword search its public database of some five million documents added by other researchers and reporters, using familiar Google-type syntax like "AND" and "OR." And its embedding function is especially important in the current era of declining trust in media, as audiences can directly check your claim that you did indeed find X or Y from a report, effectively turning documents into on-the-record sources.&nbsp;But DocumentCloud now features many more cutting-edge functions &mdash; which include importing from programs like Google Drive to transcribing YouTube audio and even peering through weak blackout redactions (see the list below).<h4>Tools to Address Real-World Data Challenges</h4>In a single-speaker presentation at NICAR23, Sanjin Ibrahimovic, Open Source Fellow at the <a href="https://www.muckrock.com/">MuckRock Foundation</a>, said <a href="https://www.muckrock.com/news/archives/2023/jan/31/release-notes-note-searching-ocr-improvements-new-add-ons-and-other-documentcloud-improvements/">add-ons to the core functions</a> have been created by the DocumentCloud community &mdash; users, fellows, data science grantees, and journalists &mdash; to address problems and opportunities they encountered during live projects.For instance, Ibrahimovic said users noticed that it took a long time to pick out personally identifiable information (PII) scattered throughout thick files, and that some could be missed in embedded information, like email addresses, social security numbers, ZIP codes, credit card numbers, and physical addresses in the small print.So DocumentCloud has added a feature that automatically finds and highlights PII terms.&nbsp;&nbsp;Meanwhile, Ibrahimovic said users were also struck by the weakness and inaccuracy of redactions in documents from government agencies &mdash; where officials often use black highlighter pens or poor redaction software to conceal sensitive or secret information. This presents a risk for newsrooms seeking to embed documents, as sensitive information on victims, for instance, could be digitally extracted by bad actors.&nbsp;So DocumentCloud implemented a &ldquo;<a href="https://github.com/sooryu22/documentcloud-bad-redactions-addon">Bad Redactions</a>&rdquo; add-on feature, which helps journalists in two crucial ways:<ul>
<li>It automatically analyzes and surfaces all the supposedly redacted passages in a single spreadsheet, so you can sometimes reveal what the agency intended to conceal.</li>
<li>It gives you the option to complete the redaction job: to permanently scrub all the digital information beneath the blacked out sections, and fully redact them for public documents, or pages you embed. Ibrahimovic warned that reporters should think carefully before clicking on the &ldquo;Confirm Redaction&rdquo; button for passages they choose to redact &mdash; &ldquo;because this is a permanent procedure &mdash; it's not reversible.&rdquo;</li>
</ul>For his recent Organized Crime and Corruption Reporting Project (OCCRP) <a href="https://www.occrp.org/en/investigations/operation-do-re-mi-the-brazilian-bow-makers-under-investigation-for-dealing-in-endangered-wood">investigation into the trafficking of endangered brazilwood</a>, <a href="https://twitter.com/toledoluizf">Luiz Fernando Toledo</a> used Bad Redactions to learn the names of small Brazilian companies fined for smuggling.<aside class="module align-right half type-pull-quote">"Users don&rsquo;t need any programmatic knowledge to run an Add-On." &mdash; Sanjin Ibrahimovic, MuckRock Foundation open source fellow</aside>His story involved obtaining hundreds of reports on environmental fines from government agencies and then organizing those documents, explained Toledo &mdash; who is also project coordinator for the <a href="https://datafixers.org/">Data Fixers</a> environmental crimes nonprofit. &ldquo;The Bad Redactions Add-On helped me to find out the names of several people and companies charged. The <a href="https://github.com/MuckRock/cloud-upload-addon">Import Documents</a> function was also very important. It was easy to parse through so many documents and find the key parts that I needed. I also used DocumentCloud to fact check the whole project.&rdquo;<h4>User-Friendly Digging Features</h4>Even though they are transparent and open source, Ibrahimovic acknowledged that <a href="https://www.youtube.com/watch?v=Ie9D5aeidN8&amp;list=PLBGm5TjywchPr19BuUEeGxLBUPcPgAfXV&amp;index=4">Add-Ons require coding skills if you want to create one</a>. They are built with platforms like the DocumentCloud application programming interface (API) and <a href="https://github.com/features/actions">GitHub Actions</a>. But he said Add-Ons are only accepted for the service if they are easy to use.&ldquo;Users don&rsquo;t need any programmatic knowledge to run an Add-On,&rdquo; he pointed out. &ldquo;The idea for data extraction and analysis procedures is so smaller newsrooms can use this without needing programming skills.&rdquo;Nevertheless, running Add-Ons may present technical challenges to non-data reporters &mdash; so users should check out <a href="https://www.youtube.com/playlist?list=PLBGm5TjywchPr19BuUEeGxLBUPcPgAfXV">MuckRock&rsquo;s YouTube tutorial channel</a> on the topic.Access to DocumentCloud requires <a href="https://accounts.muckrock.com/accounts/signup/?intent=documentcloud&amp;next=https%3A%2F%2Fwww.documentcloud.org%2Fhome">creating an account</a> &mdash; ideally, using your institutional email address &mdash; which is followed by a quick verification step. Access to the growing library of new features involves clicking on &ldquo;Add-Ons,&rdquo; and then &ldquo;Browse All Add-Ons.&rdquo;Ibrahimovic said some of the newer Add-On tools can:<ul>
<li>Import documents from Google Drive, Dropbox, WeTransfer, and Mediafire.</li>
<li>Convert email files (EML and MSG formats) into PDFs.</li>
<li>Pull data from websites with its <a href="https://www.muckrock.com/news/archives/2022/may/24/release-notes-keep-an-eye-on-your-favorite-agencie/">Scraper function</a>, which can also automatically download and index newly uploaded documents from your target site.&nbsp;&nbsp;</li>
<li>Detect and display poorly redacted text.</li>
<li>Back up projects to <a href="https://archive.org/">The Internet Archive</a>.</li>
<li>Bulk-edit a large set of documents.</li>
<li>Transcribe audio files &mdash; including YouTube &mdash; and automatically upload transcriptions to your account.</li>
<li>Extract tables within PDFs using a Tabula-based tool.</li>
<li>Recognize and highlight PII terms, such as phone numbers, social security information, and physical addresses.</li>
</ul>For several attendees, this latter function &mdash; the &lsquo;PII Detector&rsquo; &ndash; was most exciting, partly because it can instantly provide a database of contact details for potential sources from a massive sheaf of court filings or audit reports.<aside class="module align-right half type-pull-quote">"I was able to keyword search a decade&rsquo;s worth of meeting notes to locate the information.&rdquo; &mdash; The Macon Newsroom investigative reporter Laura Corley</aside><a href="https://twitter.com/Lauraecor">Laura Corley</a>, an investigative reporter at nonprofit <a href="https://macon-newsroom.com/">The Macon Newsroom</a> in the US state of Georgia, said new Add-Ons had already proved essential for her research into racial and economic equity at two local charter schools. Minutes of meetings posted by the school governing boards, she said, ran to hundreds of pages, and rarely listed the topics under discussion in headings.&ldquo;Without knowing specifically when certain business items were discussed, it could take hours or even days to find the right documents,&rdquo; she explained. &ldquo;The DocumentCloud Scraper Add-On allowed me to cull all of the meeting minutes from both websites within minutes. I was able to keyword search a decade&rsquo;s worth of meeting notes to locate the information.&rdquo;&ldquo;It yielded more results than expected, and gave me further context,&rdquo; she added.&nbsp;Summing up, Ibrahimovic said: &ldquo;Collectively, we think these features really lower the barrier of entry for deep document analysis for journalists and researchers with limited resources.&rdquo;<h4>Additional Resources</h4><a href="https://gijn.org/2022/07/05/free-game-changing-data-extraction-tools-that-require-no-coding-skills/">Free, Game-Changing Data Extraction Tools that Require No Coding Skills</a><a href="https://gijn.org/2023/03/29/using-chatgpt-ai-extract-data-pdfs/">Testing the Potential of Using ChatGPT to Extract Data from PDFs</a><a title="Permalink to New Investigative Tools for Monitoring Social Media Platforms" href="https://gijn.org/2023/03/20/new-tools-monitoring-social-media-junkipedia/" rel="bookmark">New Investigative Tools for Monitoring Social Media Platforms</a><hr><a href="https://gijn.org/about/staff-member/rowan-philp/"><img class="wp-image-606617 alignleft" src="https://gijn.org/wp-content/uploads/2020/05/Rowan-e1671730376423.png" alt="Rowan Philp, senior reporter, GIJN" width="111" height="115">Rowan Philp</a> is a reporter for GIJN. He was formerly chief reporter for South Africa&rsquo;s<a href="https://www.timeslive.co.za/sunday-times/"> Sunday Times</a>. As a foreign correspondent, he has reported on news, politics, corruption, and conflict from more than two dozen countries around the world.
	This <a target="_blank" href="https://gijn.org/stories/data-journalism-tools-documentcloud-extract-redacted-text-personal-information/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

Tipsheet for Using Ocean Data in Your Investigations

by Miriam Forero Ariza • March 22, 2024

Investigations into what happens on, under, and around the ocean can often be answered thanks to the vast amount of data available online.

Data Journalism Reporting Tools & Tips

Best Practices for Working With Mass Shootings Data

by Rowan Philp • March 20, 2024

There can be confusion among journalists about “mass shootings” data, which leads to wildly different numbers and deeper confusion among audiences.

GIJC23 panel Text Analysis for Investigative Reporting

Data Journalism GIJC23 Reporting Tools & Tips

Tips to Guide Investigative Journalists in Document Text Analysis

by Patrick Egwu • November 30, 2023

Investigative journalists often face the challenge of reviewing and combining large documents or data in text forms. This can be very exhausting and labor intensive.

Reporting Tools & Tips

My Favorite Tools: El Salvador’s Jimmy Alvarado on Exposing Corruption

by Andrea Arzaba • June 21, 2023

El Faro investigative journalist Jimmy Alvarado offers his favorite tools and techniques for exposing corruption.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

New Document Tools to Unearth Redacted Text, Personal Information, and More

Read this article in

Tools to Address Real-World Data Challenges

User-Friendly Digging Features

Additional Resources

Read other stories tagged with:

Republish this article

Read Next

Data Journalism Reporting Tools & Tips

Best Practices for Working With Mass Shootings Data

Data Journalism GIJC23 Reporting Tools & Tips

Tips to Guide Investigative Journalists in Document Text Analysis

Reporting Tools & Tips

My Favorite Tools: El Salvador’s Jimmy Alvarado on Exposing Corruption

Stories

Topics

New Document Tools to Unearth Redacted Text, Personal Information, and More

Read this article in

Related Resources

Tipsheet for Using Ocean Data in Your Investigations

Holding Your Government Accountable for Climate Change Pledges

4 More Essential Tips for Using the Wayback Machine

GIJN’s Updated Guide to Planespotting and Flight Tracking

Share

Tools to Address Real-World Data Challenges

User-Friendly Digging Features

Additional Resources

Related Resources

Tipsheet for Using Ocean Data in Your Investigations

Holding Your Government Accountable for Climate Change Pledges

4 More Essential Tips for Using the Wayback Machine

GIJN’s Updated Guide to Planespotting and Flight Tracking

Related Stories

Tipsheet for Using Ocean Data in Your Investigations

Best Practices for Working With Mass Shootings Data

Tips to Guide Investigative Journalists in Document Text Analysis

My Favorite Tools: El Salvador’s Jimmy Alvarado on Exposing Corruption

Read other stories tagged with:

Republish this article

Read Next

Tipsheet Data Journalism Reporting Tools & Tips

Tipsheet for Using Ocean Data in Your Investigations

Data Journalism Reporting Tools & Tips

Best Practices for Working With Mass Shootings Data

Data Journalism GIJC23 Reporting Tools & Tips

Tips to Guide Investigative Journalists in Document Text Analysis

Reporting Tools & Tips

My Favorite Tools: El Salvador’s Jimmy Alvarado on Exposing Corruption