Stories

•

Topics

» Safety & Security

Source Protection: Sensitive Document Checklist

by Ted Han and Quinn Norton • June 19, 2017

Read this article in

Русский

Extraordinary documentation can make for an extraordinary story — and terrible trouble for sources and vulnerable populations if handled without enough care. Recently, The Intercept published a story about a leaked NSA report, posted to DocumentCloud, that alleged Russian hacker involvement in a campaign to phish American election officials.

Simultaneously, the FBI arrested a government contractor, Reality Winner, for allegedly leaking documents to an online news outlet. The affidavit partially revealed how Winner was caught leaking by the FBI, including a postmark and physical characteristics of the document that the Intercept posted.

Russian Spearphishing: DocumentCloud view of leaked NSA report.

The Intercept isn’t alone in leaving digital footprints in their article material. In a post called “We Are with John McAfee Right Now, Suckers,” Vice posted a picture of the at-the-time fugitive John McAfee, complete with GPS coordinates pinpointing their source’s location, who was shortly in official custody. In 2014, The New York Times improperly redacted an NSA document from the Snowden trove, revealing the name of an NSA agent.

The first step with any sensitive material is to consider what will happen when the subjects or public sees that material. It can be hard to pause in the rush of getting a story out, but giving some thought to the nature of the information you’re releasing, what needs to be released, what could be used in unexpected ways and what could harm people, can prevent real problems.

A Checklist for Sensitive Documents

Removing potentially harmful information from documents is difficult. To make it a little easier, DocumentCloud is creating a checklist of what to think about when making a sensitive document public. But even when the material isn’t on DocumentCloud, this checklist can help reporters and news organizations protect their sources, or other vulnerable people, from getting hurt by the materials posted along with a story.

✔ Have you scrubbed the document metadata?

Many modern file formats contain metadata to support popular features. If you’ve used track changes, or geotagged a photo, those are both forms of metadata that can continue to exist invisibly in a document which may reveal details about vulnerable people/sources. Beyond those two examples, there are formats of metadata for all modern files, from email headers to ID3 details embedded in every MP3. It can seem daunting, but a search on the formats of the files you have plus the word “metadata” can help you find tools to analyze, and if needed, remove metadata.

A few examples…

Microsoft Word documents: These documents may contain a few types of hidden information. Here’s a primer.
Images: EXIF is the metadata attached to digital photos. There are quite a few free online EXIF viewers, but if you can’t afford to upload sensitive material, you can also view EXIF data on your own machine via these browser plugins for Firefox and Chrome.
PDFs: Here’s an overview of PDF properties and metadata. In DocumentCloud’s case, its platform will convert images, Word and Excel documents, and HTML pages into PDFs. In these conversions, DocumentCloud removes the metadata from the original when creating the PDF. However, DocumentCloud currently does not remove metadata from documents uploaded directly as PDFs.

✔ Have you checked for identifiers?

Identifiers may include:

Printer dots
Watermarks
Text/font variations
Unusual spacing

Along the Dotted Line: An example of printer microdots.

Documents can be modified to allow the author to track a document’s life after creation. The oldest technique for doing this is a faint print on the paper — the traditional watermark. With digital documents, variations in text, spacing, spelling or even phrases, can allow an author to create versions that link back to specific people or groups of people in order to investigate the origin of a potential leak. Additionally, printers can “sign” paper documents, adding physical metadata to documents through microdots printed directly on the documents that are barely visible to the human eye.

Defeating these techniques requires a careful inspection of the documents, looking for telltale signs and modifying the document to obscure its origin. Sometimes, recreating the document may be necessary, but that’s a judgement call that you have to make on a case-by-case basis. Inspection is never foolproof, but spotting and correcting the spacing, spelling, and physically identifying features of a document can go a long way toward mitigating danger to the people who would become vulnerable once a document is published.

✔ Have you accounted for other information that could reveal vulnerable people combined with this document?

In considering the newsworthiness of a document, it’s also worth considering what will happen when the public or subjects of a document see that document. Sometimes details that aren’t personally identifying on their own can be patched together with other publicly available information, in articles or public webpages, and reveal identities or unintentional details.

It’s hard to know in advance if this possible, but it’s worth taking some time to consider. Uniquely identifying information — such as geographical or life details — can often narrow down an anonymous person quickly. Harassers (or worse) can find vulnerable people.

⁠⁠⁠✔ Is the document properly redacted?

Documents can contain sensitive content which you wish to redact. These could be addresses, phone numbers, personally identifying information or information which could reveal a source. There are a number of redaction tools, DocumentCloud included, which will expunge text and visible content in a document. But it is important to understand how your redaction tools work, and to verify the results. It’s not enough to draw black boxes over digital text — the text itself must be expunged from the document.

For example, DocumentCloud will remove a digital page from a PDF, and replace that page with an image snapshot of that page. DocumentCloud will then use optical character recognition (OCR) on the image, and use the resulting text in the document. This ensures that there is no way for the text which you wish to remove to become inadvertently included in your document. In DocumentCloud, you can check the results by clicking on the text tab in the viewer, as well as checking the original document link.

Whatever tool you use, read the instructions in order to double-check redactions before they are in public.

✔ Is the document the minimum needed for the story?

Publishing only what the story needs, in content and context, minimizes the possibility of harm and focuses reader attention on what matters the most.

It’s our hope that by following this checklist, and thinking carefully about how the document will be perceived and used in public, journalists can maximize the effectiveness of the evidence that supports their stories while minimizing the harm to sources and bystanders.

This post first appeared on Source, an Open News website, and is cross-posted here with permission. It has also been translated into Arabic and Russian by GIJN.

Ted Han is director of technology for @DocumentCloud. He studied computational linguistics and has worked in technology and startups for more than a decade. He was a participant in the Knight Mozilla Journalism Challenge and has worked on DataMapper, Merb and a variety of data-based projects.

Quinn Norton is a technology journalist who started studying hackers in 1995. She has been published in Wired, The Atlantic and Maximum PC, and covers science, copyright law, robotics, body modification and medicine, but no matter how many times she tries to leave she always comes back to hackers.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

documentcloud geotag metadata NSA Open News Source The Intercept whistleblower

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Source Protection: Sensitive Document Checklist</h2> by <a href="https://source.opennews.org/articles/how-protect-your-sources-when-releasing-sensitive-/">Ted Han and Quinn Norton</a> for Global Investigative Journalism Network &bull; June 19, 2017 <a href="https://gijn.org/2017/09/03/%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0-%D0%B8%D1%81%D1%82%D0%BE%D1%87%D0%BD%D0%B8%D0%BA%D0%BE%D0%B2-%D1%87%D0%B5%D0%BA%D0%BB%D0%B8%D1%81%D1%82-%D1%87%D1%82%D0%BE-%D0%BD%D0%B5%D0%BE%D0%B1%D1%85%D0%BE/">Русский</a>Extraordinary documentation can make for an extraordinary story -- and terrible trouble for sources and vulnerable populations if handled without enough care. Recently, <a href="https://theintercept.com/2017/06/05/top-secret-nsa-report-details-russian-hacking-effort-days-before-2016-election/">The Intercept published a story</a> about a leaked NSA report, <a href="https://www.documentcloud.org/documents/3766950-NSA-Report-on-Russia-Spearphishing.html#document/p1">posted to DocumentCloud</a>, that alleged Russian hacker involvement in a campaign to phish American election officials.Simultaneously, the FBI arrested a government contractor, Reality Winner, for allegedly leaking documents to an online news outlet. The affidavit partially revealed how Winner was caught leaking by the FBI, including a postmark and physical characteristics of the document that the Intercept&nbsp;posted.The Intercept isn&rsquo;t alone in leaving digital footprints in their article material. In a post called &ldquo;<a href="https://www.vice.com/sv/article/we-are-with-john-mcafee-right-now-suckers">We Are with John McAfee Right Now, Suckers</a>,&rdquo; Vice posted a picture of the at-the-time fugitive John McAfee, complete with GPS coordinates pinpointing their source&rsquo;s location, who was shortly in official custody. In 2014, The&nbsp;New York Times improperly redacted an NSA document from the Snowden trove, revealing the name of an NSA&nbsp;agent.The first step with any sensitive material is to consider what will happen when the subjects or public sees that material. It can be hard to pause in the rush of getting a story out, but giving some thought to the nature of the information you&rsquo;re releasing, what needs to be released, what could be used in unexpected ways and what could harm people, can prevent real&nbsp;problems.<h3>A Checklist for Sensitive&nbsp;Documents</h3><a href="https://gijn.org/wp-content/uploads/2017/06/documentcloud.png"><img class="alignright wp-image-42735 size-thumbnail" src="https://gijn.org/wp-content/uploads/2017/06/documentcloud-140x140.png" alt="" width="140" height="140"></a>Removing potentially harmful information from documents is difficult. To make it a little easier, <a href="https://www.documentcloud.org/">DocumentCloud</a> is creating a checklist of what to think about when making a sensitive document public. But even when the material isn&rsquo;t on DocumentCloud, this checklist can help reporters and news organizations protect their sources, or other vulnerable people, from getting hurt by the materials posted along with a&nbsp;story.<h4>✔ Have you scrubbed the document&nbsp;metadata?</h4>Many modern file formats contain metadata to support popular features. If you&rsquo;ve used track changes, or geotagged a photo, those are both forms of metadata that can continue to exist invisibly in a document which may reveal details about vulnerable people/sources. Beyond those two examples, there are formats of metadata for all modern files, from email headers to ID3 details embedded in every MP3. It can seem daunting, but a search on the formats of the files you have plus the word &ldquo;metadata&rdquo; can help you find tools to analyze, and if needed, remove&nbsp;metadata.A few&nbsp;examples&hellip;<ul>
<li>Microsoft Word documents: These documents may contain a few types of hidden information. <a href="https://support.office.com/en-us/article/Remove-hidden-data-and-personal-information-by-inspecting-documents-356B7B5D-77AF-44FE-A07F-9AA4D085966F">Here&rsquo;s a primer</a>.</li>
<li>Images: EXIF is the metadata attached to digital photos. There are quite a few free online EXIF viewers, but if you can&rsquo;t afford to upload sensitive material, you can also view EXIF data on your own machine via these browser plugins for <a href="https://addons.mozilla.org/en-US/firefox/addon/exif-viewer/">Firefox</a> and <a href="https://chrome.google.com/webstore/detail/exif-viewer/nafpfdcmppffipmhcpkbplhkoiekndck">Chrome</a>.</li>
<li>PDFs: Here&rsquo;s an overview of <a href="https://helpx.adobe.com/acrobat/using/pdf-properties-metadata.html#edit_document_metadata">PDF properties and metadata</a>. In DocumentCloud&rsquo;s case, its platform will convert images, Word and Excel documents, and HTML pages into PDFs. In these conversions, DocumentCloud removes the metadata from the original when creating the PDF. However, DocumentCloud currently does not remove metadata from documents uploaded directly as&nbsp;PDFs.</li>
</ul><h4>✔ Have you checked for&nbsp;identifiers?</h4>Identifiers may&nbsp;include:<ul>
<li>Printer&nbsp;dots</li>
<li>Watermarks</li>
<li>Text/font&nbsp;variations</li>
<li>Unusual&nbsp;spacing</li>
</ul>Documents can be modified to allow the author to track a document&rsquo;s life after creation. The oldest technique for doing this is a faint print on the paper -- the traditional watermark. With digital documents, variations in text, spacing, spelling or even phrases, can allow an author to create versions that link back to specific people or groups of people in order to investigate the origin of a potential leak. Additionally, printers can &ldquo;sign&rdquo; paper documents, adding physical metadata to documents through microdots printed directly on the documents that are barely visible to the human&nbsp;eye.<aside class="module align-right half type-pull-quote">It can be hard to pause in the rush of getting a story out, but giving some thought to the nature of the information you&rsquo;re releasing can prevent real problems.</aside>Defeating these techniques requires a careful inspection of the documents, looking for telltale signs and modifying the document to obscure its origin. Sometimes, recreating the document may be necessary, but that&rsquo;s a judgement call that you have to make on a case-by-case basis. Inspection is never foolproof, but spotting and correcting the spacing, spelling, and physically identifying features of a document can go a long way toward mitigating danger to the people who would become vulnerable once a document is&nbsp;published.<h4>✔ Have you accounted for other information that could reveal vulnerable people combined with this&nbsp;document?</h4>In considering the newsworthiness of a document, it&rsquo;s also worth considering what will happen when the public or subjects of a document see that document. Sometimes details that aren&rsquo;t personally identifying on their own can be patched together with other publicly available information, in articles or public webpages, and reveal identities or unintentional&nbsp;details.It&rsquo;s hard to know in advance if this possible, but it&rsquo;s worth taking some time to consider. Uniquely identifying information -- such as geographical or life details -- can often narrow down an anonymous person quickly. Harassers (or worse) can find vulnerable&nbsp;people.<h4>⁠⁠⁠✔ Is the document properly&nbsp;redacted?</h4>Documents can contain sensitive content which you wish to redact. These could be addresses, phone numbers, personally identifying information or information which could reveal a source. There are a number of redaction tools, DocumentCloud included, which will expunge text and visible content in a document. But it is important to understand how your redaction tools work, and to verify the results. It&rsquo;s not enough to draw black boxes over digital text -- the text itself must be expunged from the&nbsp;document.<aside class="module align-right half type-pull-quote">It&rsquo;s not enough to draw black boxes over digital text you want to redact -- the text itself must be expunged from the document.</aside>For example, DocumentCloud will remove a digital page from a PDF, and replace that page with an image snapshot of that page. DocumentCloud will then use optical character recognition (OCR) on the image, and use the resulting text in the document. This ensures that there is no way for the text which you wish to remove to become inadvertently included in your document. In DocumentCloud, you can check the results by clicking on the text tab in the viewer, as well as checking the original document&nbsp;link.Whatever tool you use, read the instructions in order to double-check redactions before they are in&nbsp;public.<h4>✔ Is the document the minimum needed for the&nbsp;story?</h4>Publishing only what the story needs, in content and context, minimizes the possibility of harm and focuses reader attention on what matters the&nbsp;most.It&rsquo;s our hope that by following this checklist, and thinking carefully about how the document will be perceived and used in public, journalists can maximize the effectiveness of the evidence that supports their stories while minimizing the harm to sources and&nbsp;bystanders.<hr>This post <a href="https://source.opennews.org/articles/how-protect-your-sources-when-releasing-sensitive-/">first appeared</a> on Source, an <a href="https://www.opennews.org/">Open News </a>website, and is cross-posted here with permission. It has also been translated into <a href="http://bit.ly/2tYbcBj">Arabic </a>and <a href="http://bit.ly/2eT3F0l">Russian </a>by GIJN. 
<a href="https://gijn.org/wp-content/uploads/2017/06/ted-han.jpg"><img class="size-full wp-image-42730 alignleft" src="https://gijn.org/wp-content/uploads/2017/06/ted-han.jpg" alt="" width="130" height="130"></a><a href="https://twitter.com/knowtheory">Ted Han</a> is director of technology for <a href="https://twitter.com/documentcloud">@DocumentCloud</a>.&nbsp;He studied computational linguistics and has worked in technology and startups for more than a decade. He was a participant in the Knight Mozilla Journalism Challenge and has worked on <a class="external" href="http://www.datamapper.org/" target="_blank" rel="noopener noreferrer">DataMapper</a>, Merb and a variety of data-based projects.<a href="https://gijn.org/wp-content/uploads/2017/06/quinn-norton-twitter-profile.jpg"><img class="wp-image-42731 size-thumbnail alignleft" src="https://gijn.org/wp-content/uploads/2017/06/quinn-norton-twitter-profile-140x140.jpg" alt="" width="140" height="140"></a><a href="https://twitter.com/quinnnorton">Quinn Norton</a> is a technology journalist who&nbsp;started studying hackers in 1995.&nbsp;She has been&nbsp;published in Wired, The Atlantic and Maximum PC,&nbsp;and&nbsp;covers science, copyright law, robotics, body modification and medicine, but no matter how many times she tries to leave she always comes back to hackers.
	This <a target="_blank" href="https://gijn.org/stories/protecting-sources-when-releasing-sensitive-documents/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

How Journalists Can Protect Whistleblowers Making “First Contact”

by Banjo Damilola • November 4, 2021

It is vital for journalists to shield their sources, and at a dedicated workshop at GIJC21, two security experts gave practical examples of how reporters can reach out to sources in a way that protects the individuals and wins trust for both journalists and their organization.

Reporting Tools & Tips Safety & Security Teaching & Training

Data at Risk: How To Protect Your Sources and Your Work

by Paul Myers • July 16, 2015

Information stored on your computer or mobile is at risk. You could leave it on a train; it could be seized at an airport security checkpoint; or by the police or courts. And of course hackers can access your data. You need to be aware of all the risks and ways to protect your information, your sources and yourself.

How They Did It

How ProPublica Exposed Ethics Scandals at the US Supreme Court

by Clark Merrefield, The Journalist's Resource • April 25, 2024

The Journalist’s Resource talked with ProPublica reporters about their blockbuster series, which revealed behind-the-scenes connections between billionaires and US Supreme Court justices and prompted historic reforms on the nation’s high court.

data journalism missing piece common mistake

Data Journalism News & Analysis

Lessons Learned: 10 Common Mistakes in Data Journalism

by Rowan Philp • April 24, 2024

GIJN asked speakers and attendees in the NICAR conference hallways for the data journalism gaps they see, and for under-covered topic areas and under-used skills that newsrooms can address.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Source Protection: Sensitive Document Checklist

Read this article in

A Checklist for Sensitive Documents

✔ Have you scrubbed the document metadata?

✔ Have you checked for identifiers?

✔ Have you accounted for other information that could reveal vulnerable people combined with this document?

⁠⁠⁠✔ Is the document properly redacted?

✔ Is the document the minimum needed for the story?

Read other stories tagged with:

Republish this article

Read Next

Safety & Security

How Journalists Can Protect Whistleblowers Making “First Contact”

Reporting Tools & Tips Safety & Security Teaching & Training

Data at Risk: How To Protect Your Sources and Your Work

How They Did It

How ProPublica Exposed Ethics Scandals at the US Supreme Court

Data Journalism News & Analysis

Lessons Learned: 10 Common Mistakes in Data Journalism

Stories

Topics

Source Protection: Sensitive Document Checklist

Read this article in

Related Resources

How to Protect Yourself from Metadata

Tipsheet: Latest Tools for Investigating with Telegram

Investigating Elections: Threat from AI Audio Deepfakes

Updated GIJN Databases (Poverty, Crime, Corruption, and Terrorism)

Share

A Checklist for Sensitive Documents

✔ Have you scrubbed the document metadata?

✔ Have you checked for identifiers?

✔ Have you accounted for other information that could reveal vulnerable people combined with this document?

⁠⁠⁠✔ Is the document properly redacted?

✔ Is the document the minimum needed for the story?

Related Resources

How to Protect Yourself from Metadata

Tipsheet: Latest Tools for Investigating with Telegram

Investigating Elections: Threat from AI Audio Deepfakes

Updated GIJN Databases (Poverty, Crime, Corruption, and Terrorism)

Related Stories

How Journalists Can Protect Whistleblowers Making “First Contact”

Data at Risk: How To Protect Your Sources and Your Work

How ProPublica Exposed Ethics Scandals at the US Supreme Court

Lessons Learned: 10 Common Mistakes in Data Journalism

Read other stories tagged with:

Republish this article

Read Next

Safety & Security

How Journalists Can Protect Whistleblowers Making “First Contact”

Reporting Tools & Tips Safety & Security Teaching & Training

Data at Risk: How To Protect Your Sources and Your Work

How They Did It

How ProPublica Exposed Ethics Scandals at the US Supreme Court

Data Journalism News & Analysis

Lessons Learned: 10 Common Mistakes in Data Journalism