Passports often contain a treasure trove of information — and can help reporters trying to investigate the ownership of offshore companies and trusts. Image: Shutterstock

Stories

•

Topics

» Methodology » Reporting Tools & Tips

Passports Are Key to Uncovering Offshore Secrecy — We Use Machine Learning to Find Them Efficiently

by Clément Doumouro • May 14, 2025

Read this article in

Passports are often the key to unlocking secret ownership of offshore companies and trusts, but finding them among millions of leaked documents can be challenging. To streamline this process, we partnered with machine learning (ML) scientists from the AI Journalism Resource Center at OsloMet University and the Norwegian public broadcaster NRK to develop a passport detection tool.

This tool partly automates the identification of documents containing passports and extracts key information — such as the name of the passport holder, the nationality, and date of birth. This data is all held in the Machine Readable Zone (MRZ) — the two lines of text, letters, and code on the bottom of a passport photo page.

The ability to scour through that information allows data journalists to validate and efficiently share their findings with reporters, generating higher quality leads than before.

How Does Passport Detection Help Investigative Reporters?

Passports are often a critical part of the jigsaw for investigative reporters digging into ownership of offshore companies and trusts. They are often the missing link when investigating hidden entities’ ownership in secret jurisdictions. Journalists regularly use them to identify end clients of offshore service providers during investigative projects involving massive data leaks, as seen during ICIJ’s collaborative investigations exposing offshore secrecy like Pandora Papers, Panama Papers, and Paradise Papers.

In the Pandora Papers investigation, ICIJ and its media partners sifted through millions of documents to unearth the linkages between offshore companies, trusts, and the people connected to them across dozens of countries. In the first months of the investigation, ICIJ’s data team worked on providing lists of client names to media partners so that journalists could efficiently find leads.

The team reviewed thousands of pages of corporate records to eventually identify offshore dealings of 36 current and former world leaders and more than 300 other current and former public officials and politicians around the world. Passports were an important part of the puzzle.

“Passports inside large document leaks are an invaluable resource for finding individuals of public interest and for parceling out work among partners from many different countries,” explains Agustin Armendariz, senior data reporter at ICIJ. “Country lists of passport owners and beneficial owners are often the best starting point for reporters new to a leak to begin searching for a story relevant to their audience.”

However, locating and reviewing such documents is often a daunting task, akin to finding a needle in a haystack.

Why Can Passport Detection be a Hassle?

The machine readable zone, highlighted in red in this mock-up, contains key passport data, such as the name of the passport holder, the nationality, and date of birth. Image: ICIJ

While powerful, our previous workflow to identify passports in a massive amount of leaked documents proved to be both cumbersome and sometimes unreliable.

To find passports, journalists used Datashare, ICIJ’s open source search engine for documents linked to a particular investigation, performing two kinds of search queries. They either searched for keyword terms commonly found inside passports such as country names, “date of expiration,” “place of birth,” “passport no.,” “visa…” etc. Or hunted for common passport file names, such as “passport.pdf” or “passport.jpg.”

To improve accuracy, they could restrict the search to images and PDF file types, which are typically used to store passport scans, but both methods lacked robustness and were inefficient.

On the one hand, many passports were missed, since files containing passports are not always named as such explicitly. Furthermore, due to variable scan quality, text extraction on Datashare (which is powered by OCR, or Optical Character Recognition) can struggle to correctly extract the passport text, reducing the number of search matches. A further problem was that a lot of the passport keyword terms are also found in non-passport documents, triggering false positive matches.

Identifying passports using this workflow often required weeks of careful review, scrolling through thousands of document pages hoping to find actual passport scans.

How Did Our Machine Learning Collaboration with Researchers Help?

The goal of our collaboration with OsloMet and NRK was to leverage state-of-the-art Computer Vision algorithms to speed up and partly automate the passport detection process.

How does it work? To detect passports in documents, the files are first converted into images, then passport scans are detected inside documents using the open-source YOLO object detection model. When a passport is detected, the tool reads its Machine Readable Zone (MRZ) using a tailored OCR, capturing essential details such as the passport holder’s name, date of birth, passport number, country, and date of issuance.

Accurately extracting passport information and ensuring no passport is missed was still challenging. The YOLO model, originally trained for generic object detection, had to be fine-tuned to detect passports specifically.

The OsloMet and NRK team spent months reviewing and annotating documents shared by ICIJ, training models, and calibrating detection thresholds for optimal performance. Using a large and diverse dataset of passport images, we estimated that the model can recover 100% of passport pages found inside documents with a precision rate of 86%: only 14% of the images classified as passports are false positives.

How Did We Integrate This Into Our Workflow?

To integrate the model provided by researchers into our investigation workflows, ICIJ turned it into a fully-fledged service that can be deployed and run on servers processing up to 500 document pages per minute on a machine with a 16GB memory GPU (the service code is open source while the model is not publicly available for confidentiality reasons).

After the tool identifies potential passports and reads their machine readable zone, predictions are uploaded to the Prophecies fact-checking platform. Team members can review and correct predictions — mock-up documents are reviewed here. Image: ICIJ

When the passport detection is complete, the tool’s predictions are reviewed by ICIJ’s data team using Prophecies, our open source fact-checking platform. Thanks to the model’s high precision, the vast majority of images detected as passports are actual passport scans, making the review more efficient. After validation, ICIJ’s data team tags documents that contain passports in Datashare, allowing them to be used by the reporting team immediately.

So far, we have run the tool to classify hundreds of thousands of Datashare document pages. The tool not only made it possible to detect passports more easily and efficiently, it also allows us to perform passport detection more systematically. We are also exploring ways to kickstart automated entity resolution — matching documents to actual individuals — using passport information extracted by the tool.

As an example, during an ongoing investigation, ICIJ’s data team accurately identified around 500 passport scans out of more than 110,000 documents. We first used Datashare to narrow down the search to 75,000 documents with images, and then relied on the tool to detect about 1,000 images identified as passports. Each prediction was reviewed three times by different journalists using Prophecies, through multiple validation rounds.

After removing duplicates and keeping only pages with country information, about 500 unique passports and their country of issuance were finally identified. The workflow successfully reduced the task of analyzing more than 110,000 documents to just 3,000 targeted reviews. Delegating the detection process to an algorithm allowed us to save precious hours, preserving data journalists’ time for high added-value tasks such as fact-checking.

“The passport identification tool is an extremely fast way to sift through large document sets and quickly identify potential passports,” says Armendariz. “Investigators can then quickly identify any people of public interest in the collection of passports as well as identify sections of the leak to comb through by hand.”

For now, the tool’s model is only available to ICIJ staff, members, and partners for security and confidentiality reasons, however, we believe it could benefit other journalistic organizations. We are currently discussing the next steps to share and detail the methodology used to train the tool’s model, allowing other organizations to train their own model with their own data.

How Do We Maintain Confidentiality and Security?

While being very powerful for investigation purposes, passport data is also highly sensitive and, of course, confidential. To ensure the protection of our sources and of the data we receive from them, we take privacy and security very seriously — as we always do when working with leaked data. No data left our infrastructure during the development of the project or while using the tool, and we didn’t rely on third parties. Our infrastructure’s security relies on different pillars involving technical solutions as well as user training. ICIJ staff members or partners working on the project are bound by non-disclosure agreements.

Because machine learning models are subject to Membership Inference Attacks, we decided not to publish and share the model’s weights, as it could have helped attackers learn which passports the model was trained on.

Machine Learning With a Human in the Loop

Developing the passport detection tool showed that relying on machine learning with a human in the loop can help reporters efficiently address critical investigative journalism challenges. Collaborating with partners from academia proved to be incredibly fruitful in implementing these state-of-the-art machine learning solutions. We are extremely grateful to our partners for their support and are already collaborating with them — and other academic partners — on new projects, exploring ways to combine data journalists’ expertise with the latest advances in this field.

Clément Doumouro is a machine learning engineer at the International Consortium of Investigative Journalists (ICIJ), where he focuses on integrating machine learning tools and algorithms into Datashare, ICIJ’s search engine, and supporting journalists in analyzing documents during their investigations. He also collaborates with academic researchers to ensure ICIJ benefits from the latest advancements in machine learning. Before joining the ICIJ, he studied artificial intelligence and robotics before working as a machine learning engineer for Collective Thinking and Sonos, where he specialized in natural language processing and understanding, making sense of human oral and written language.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

machine learning pandora papers transnational collaborations

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Passports Are Key to Uncovering Offshore Secrecy &mdash; We Use Machine Learning to Find Them Efficiently</h2> by <a href="https://www.icij.org/journalists/clement-doumouro/">Cl&eacute;ment Doumouro</a> for Global Investigative Journalism Network &bull; May 14, 2025 Passports are often the key to unlocking secret ownership of offshore companies and trusts, but finding them among millions of leaked documents can be challenging. To streamline this process, we partnered with machine learning (ML) scientists from the <a href="https://www.oslomet.no/en/research/research-projects/ai-journalism-resource-center">AI Journalism Resource Center</a> at <a href="https://www.oslomet.no/">OsloMet</a> University and the Norwegian public broadcaster <a href="https://www.nrk.no/">NRK</a> to develop a passport detection tool.<aside>&ldquo;Passports inside large document leaks are an invaluable resource for finding individuals of public interest and for parceling out work among partners from many different countries." &mdash; Agustin Armendariz, ICIJ senior data reporter</aside>This tool partly automates the identification of documents containing passports and extracts key information &mdash; such as the name of the passport holder, the nationality, and date of birth. This data is all held in the Machine Readable Zone (MRZ) &mdash; the two lines of text, letters, and code on the bottom of a passport photo page.The ability to scour through that information allows data journalists to validate and efficiently share their findings with reporters, generating higher quality leads than before.<h4>How Does Passport Detection Help Investigative Reporters?</h4>Passports are often a critical part of the jigsaw for investigative reporters digging into ownership of offshore companies and trusts. They are often the missing link when investigating hidden entities&rsquo; ownership in secret jurisdictions. Journalists regularly use them to identify end clients of offshore service providers during investigative projects involving massive data leaks, as seen during ICIJ&rsquo;s collaborative investigations exposing offshore secrecy like Pandora Papers, Panama Papers, and Paradise Papers.In the Pandora Papers investigation, ICIJ and its media partners sifted through millions of documents to unearth the linkages between offshore companies, trusts, and the people connected to them across dozens of countries. In the first months of the investigation, ICIJ&rsquo;s data team worked on providing lists of client names to media partners so that journalists could efficiently find leads.The team reviewed thousands of pages of corporate records <a href="https://www.icij.org/investigations/pandora-papers/power-players/">to eventually identify offshore dealings of 36 current and former world leaders and more than 300 other current and former public officials and politicians around the world</a>. Passports were an important part of the puzzle.&ldquo;Passports inside large document leaks are an invaluable resource for finding individuals of public interest and for parceling out work among partners from many different countries,&rdquo; explains Agustin Armendariz, senior data reporter at ICIJ. &ldquo;Country lists of passport owners and beneficial owners are often the best starting point for reporters new to a leak to begin searching for a story relevant to their audience.&rdquo;However, locating and reviewing such documents is often a daunting task, akin to finding a needle in a haystack.<h4>Why Can Passport Detection be a Hassle?</h4>While powerful, our previous workflow to identify passports in a massive amount of leaked documents proved to be both cumbersome and sometimes unreliable.To find passports, journalists used <a href="https://datashare.icij.org/">Datashare</a>, ICIJ&rsquo;s open source search engine for documents linked to a particular investigation, performing two kinds of search queries. They either searched for keyword terms commonly found inside passports such as country names, &ldquo;date of expiration,&rdquo; &ldquo;place of birth,&rdquo; &ldquo;passport no.,&rdquo; &ldquo;visa&hellip;&rdquo; etc. Or hunted for common passport file names, such as &ldquo;passport.pdf&rdquo; or &ldquo;passport.jpg.&rdquo;To improve accuracy, they could restrict the search to images and PDF file types, which are typically used to store passport scans, but both methods lacked robustness and were inefficient.On the one hand, many passports were missed, since files containing passports are not always named as such explicitly. Furthermore, due to variable scan quality, text extraction on Datashare (which is powered by OCR, or Optical Character Recognition) can struggle to correctly extract the passport text, reducing the number of search matches. A further problem was that a lot of the passport keyword terms are also found in non-passport documents, triggering false positive matches.Identifying passports using this workflow often required weeks of careful review, scrolling through thousands of document pages hoping to find actual passport scans.<h4>How Did Our Machine Learning Collaboration with Researchers Help?</h4>The goal of our collaboration with OsloMet and NRK was to leverage state-of-the-art Computer Vision algorithms to speed up and partly automate the passport detection process.<aside>We estimated that the model can recover 100% of passport pages found inside documents with a precision rate of 86%: only 14% of the images classified as passports are false positives.</aside>How does it work? To detect passports in documents, the files are first converted into images, then passport scans are detected inside documents using the open-source <a href="https://github.com/ultralytics/ultralytics">YOLO object detection model</a>. When a passport is detected, the tool reads its <a href="https://en.wikipedia.org/wiki/Machine-readable_passport">Machine Readable Zone (MRZ)</a> using a tailored OCR, capturing essential details such as the passport holder&rsquo;s name, date of birth, passport number, country, and date of issuance.Accurately extracting passport information and ensuring no passport is missed was still challenging. The YOLO model, originally trained for generic object detection, had to be fine-tuned to detect passports specifically.The OsloMet and NRK team spent months reviewing and annotating documents shared by ICIJ, training models, and calibrating detection thresholds for optimal performance. Using a large and diverse dataset of passport images, we estimated that the model can recover 100% of passport pages found inside documents with a <a href="https://en.wikipedia.org/wiki/Precision_and_recall">precision</a> rate of 86%: only 14% of the images classified as passports are false positives.<h4>How Did We Integrate This Into Our Workflow?</h4>To integrate the model provided by researchers into our investigation workflows, ICIJ turned it into a <a href="https://github.com/ICIJ/passport-service">fully-fledged service</a> that can be deployed and run on servers processing up to 500 document pages per minute on a machine with a 16GB memory GPU (the service code is open source while the model is not publicly available for confidentiality reasons).When the passport detection is complete, the tool&rsquo;s predictions are reviewed by ICIJ&rsquo;s data team using <a href="https://icij.gitbook.io/prophecies">Prophecies</a>, our open source fact-checking platform. Thanks to the model's high precision, the vast majority of images detected as passports are actual passport scans, making the review more efficient. After validation, ICIJ&rsquo;s data team tags documents that contain passports in Datashare, allowing them to be used by the reporting team immediately.So far, we have run the tool to classify hundreds of thousands of Datashare document pages. The tool not only made it possible to detect passports more easily and efficiently, it also allows us to perform passport detection more systematically. We are also exploring ways to kickstart automated entity resolution &mdash; matching documents to actual individuals &mdash; using passport information extracted by the tool.<aside>We are currently discussing the next steps to share and detail the methodology used to train the tool's model, allowing other organizations to train their own model with their own data.</aside>As an example, during an ongoing investigation, ICIJ's data team accurately identified around 500 passport scans out of more than 110,000 documents. We first used Datashare to narrow down the search to 75,000 documents with images, and then relied on the tool to detect about 1,000 images identified as passports. Each prediction was reviewed three times by different journalists using Prophecies, through multiple validation rounds.After removing duplicates and keeping only pages with country information, about 500 unique passports and their country of issuance were finally identified. The workflow successfully reduced the task of analyzing more than 110,000 documents to just 3,000 targeted reviews. Delegating the detection process to an algorithm allowed us to save precious hours, preserving data journalists&rsquo; time for high added-value tasks such as fact-checking.&ldquo;The passport identification tool is an extremely fast way to sift through large document sets and quickly identify potential passports,&rdquo; says Armendariz. &ldquo;Investigators can then quickly identify any people of public interest in the collection of passports as well as identify sections of the leak to comb through by hand.&rdquo;For now, the tool&rsquo;s model is only available to ICIJ staff, members, and partners for security and confidentiality reasons, however, we believe it could benefit other journalistic organizations. We are currently discussing the next steps to share and detail the methodology used to train the tool's model, allowing other organizations to train their own model with their own data.<h4>How Do We Maintain Confidentiality and Security?</h4>While being very powerful for investigation purposes, passport data is also highly sensitive and, of course, confidential. To ensure the protection of our sources and of the data we receive from them, we take privacy and security very seriously &mdash; as we always do when working with leaked data. No data left our infrastructure during the development of the project or while using the tool, and we didn't rely on third parties. Our infrastructure's security relies on different pillars involving technical solutions as well as user training. ICIJ staff members or partners working on the project are bound by non-disclosure agreements.Because machine learning models are subject to <a href="https://arxiv.org/abs/1610.05820">Membership Inference Attacks</a>, we decided not to publish and share the model&rsquo;s weights, as it could have helped attackers learn which passports the model was trained on.<h4>Machine Learning With a Human in the Loop</h4>Developing the passport detection tool showed that relying on machine learning with a human in the loop can help reporters efficiently address critical investigative journalism challenges. Collaborating with partners from academia proved to be incredibly fruitful in implementing these state-of-the-art machine learning solutions. We are extremely grateful to our partners for their support and are already collaborating with them &mdash; and other academic partners &mdash; on new projects, exploring ways to combine data journalists&rsquo; expertise with the latest advances in this field.<hr><a href="https://gijn.org/wp-content/uploads/2025/05/Clement-Doumouro-sq-427x427-1.jpg"><img class="alignleft wp-image-2238297 size-thumbnail" src="https://gijn.org/wp-content/uploads/2025/05/Clement-Doumouro-sq-427x427-1-140x140.jpg" alt="" width="140" height="140"></a><a href="https://www.icij.org/journalists/clement-doumouro/">Cl&eacute;ment Doumouro</a> is a machine learning engineer at the International Consortium of Investigative Journalists (ICIJ), where he focuses on integrating machine learning tools and algorithms into Datashare, ICIJ&rsquo;s search engine, and supporting journalists in analyzing documents during their investigations. He also collaborates with academic researchers to ensure ICIJ benefits from the latest advancements in machine learning. Before joining the ICIJ, he studied artificial intelligence and robotics before working as a machine learning engineer for Collective Thinking and Sonos, where he specialized in natural language processing and understanding, making sense of human oral and written language.
	This <a target="_blank" href="https://gijn.org/stories/passports-machine-learning-icij/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

Tipsheet for Reporters: How to Get the Best from OCCRP’s Aleph

by Jan Strozyk • October 9, 2024

In investigative journalism, connecting data points is often key to uncovering the truth. Aleph helps reporters search, organize, and analyze data, to better trace hidden connections.

StravaLeaks, US Secret Service put Biden, Trump in danger

How They Did It Reporting Tools & Tips

Running Into Open Secrets: How to Investigate Using the Strava Fitness App

by Santiago Villa • April 22, 2025

Journalists around the world have mined the Strava fitness app to pinpoint secret military bases and track the movements of world leaders.

Reporting Tools & Tips

How to Investigate Money Laundering

by Paul Radu • June 15, 2021

The criminal blueprint and its elements need to be understood to efficiently follow the money and stop criminals from doing business as usual. Criminals, both the ones just starting out as well as those who are already well established, have regional and global infrastructure that is continuously built and maintained by what the Organized Crime and Corruption Reporting Project (OCCRP) team calls the “criminal services industry.” Here’s OCCRP’s Paul Radu on how it works and how to untangle it.

10 Questions

What I’ve Learned: Lessons and Advice from a Veteran of the Panama Papers Investigation

by Laura Dixon • March 1, 2023

Several years ago a whistleblower reached out to the German newspaper Süddeutsche Zeitung with an enticing message: “Hello, this is John Doe… Interested in data?” That story would become the Panama Papers, the first in a number of international collaborations for investigative reporter Frederik Obermaier. Read his tips and advice about investigative reporting.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Passports Are Key to Uncovering Offshore Secrecy — We Use Machine Learning to Find Them Efficiently

Read this article in

How Does Passport Detection Help Investigative Reporters?

Why Can Passport Detection be a Hassle?

How Did Our Machine Learning Collaboration with Researchers Help?

How Did We Integrate This Into Our Workflow?

How Do We Maintain Confidentiality and Security?

Machine Learning With a Human in the Loop

Read other stories tagged with:

Republish this article

Read Next

Tipsheet Reporting Tools & Tips

Tipsheet for Reporters: How to Get the Best from OCCRP’s Aleph

How They Did It Reporting Tools & Tips

Running Into Open Secrets: How to Investigate Using the Strava Fitness App

Reporting Tools & Tips

How to Investigate Money Laundering

10 Questions

What I’ve Learned: Lessons and Advice from a Veteran of the Panama Papers Investigation

Stories

Topics

Passports Are Key to Uncovering Offshore Secrecy — We Use Machine Learning to Find Them Efficiently

Read this article in

Related Resources

GIJC23 – Tracking Assets Around the World

Tips for Linking Shell Companies to their Secret Owners

GIJN Toolbox: Hunting for Secret Money and Financial Conflicts of Interest

Following the Money, from Laundromats to Central Banks

Share

How Does Passport Detection Help Investigative Reporters?

Why Can Passport Detection be a Hassle?

How Did Our Machine Learning Collaboration with Researchers Help?

How Did We Integrate This Into Our Workflow?

How Do We Maintain Confidentiality and Security?

Machine Learning With a Human in the Loop

Related Resources

GIJC23 – Tracking Assets Around the World

Tips for Linking Shell Companies to their Secret Owners

GIJN Toolbox: Hunting for Secret Money and Financial Conflicts of Interest

Following the Money, from Laundromats to Central Banks

Related Stories

Tipsheet for Reporters: How to Get the Best from OCCRP’s Aleph

Running Into Open Secrets: How to Investigate Using the Strava Fitness App

How to Investigate Money Laundering

What I’ve Learned: Lessons and Advice from a Veteran of the Panama Papers Investigation

Read other stories tagged with:

Republish this article

Read Next

Tipsheet Reporting Tools & Tips

Tipsheet for Reporters: How to Get the Best from OCCRP’s Aleph

How They Did It Reporting Tools & Tips

Running Into Open Secrets: How to Investigate Using the Strava Fitness App

Reporting Tools & Tips

How to Investigate Money Laundering

10 Questions

What I’ve Learned: Lessons and Advice from a Veteran of the Panama Papers Investigation