Stories

•

Topics

Testing the Potential of Using ChatGPT to Extract Data from PDFs

by Brandon Roberts • March 29, 2023

Read this article in

I convert a ton of text documents like PDFs to spreadsheets. It’s tedious and expensive work. So every time a new iteration of AI technology arrives, I wonder if it’s capable of doing what so many people ask for: to hand off a PDF, ask for a spreadsheet, and get one back. After throwing a couple programming problems at OpenAI’s ChatGPT and getting a viable result, I wondered if we were finally there.

Back when OpenAI’s GPT-3 was the hot new thing, I saw Montreal journalist Roberto Rocha attempt a similar test. The results were lackluster, but ChatGPT, OpenAI’s newest model, has several improvements that make it better suited to extraction: It’s 10 times larger than GPT-3 and is generally more coherent as a result, it’s been trained to explicitly follow instructions, and it understands programming languages.

To test how well ChatGPT could extract structured data from PDFs, I wrote a Python script (which I’ll share at the end!) to convert two document sets to spreadsheets:

A 7,000-page PDF of New York data breach notification forms. There were five different forms, bad OCR, and some freeform letters mixed in.
1,400 memos from internal police investigations. These were completely unstructured and contained emails and document scans. Super messy.

My overall strategy was the following:

Redo the OCR, using the highest quality tools possible. This was critically important because ChatGPT refused to work with poorly OCR’d text.
Clean the data as well as I could, maintaining physical layout and removing garbage characters and boilerplate text.
Break the documents into individual records.
Ask ChatGPT to turn each record into JSON.

I spent about a week getting familiarized with both datasets and doing all this preprocessing. Once it’s done, getting ChatGPT to convert a piece of text into JSON is really easy. You can paste in a record and say “return a JSON representation of this” and it will do it. But doing this for multiple records is a bad idea because ChatGPT will invent its own schema, using randomly chosen field names from the text. It will also decide on its own way to parse values. Addresses, for example, will sometimes end up as a string and sometimes as a JSON object or an array, with the constituent parts of an address split up.

Prompt design is the most important factor in getting consistent results, and your language choices make a huge difference. One tip: Figure out what wording ChatGPT uses when referring to a task and mimic that. (If you don’t know, you can always ask: “Explain how you’d _____ using _______.”)

Because ChatGPT understands code, I designed my prompt around asking for JSON that conforms to a given JSON schema. This was my prompt:

Image: Screenshot, OpenNews:Source

I tried to extract a JSON object from every response and run some validation checks against it. Two checks were particularly important: 1) making sure the JSON was complete, not truncated or broken, and 2) making sure the keys and values matched the schema. I retried if the validation check failed, and usually I’d get valid JSON back on the second or third attempts. If it continued to fail, I’d make a note of it and skip the record. Some records ChatGPT just doesn’t like.

Results

Impressively, ChatGPT built a mostly usable dataset. At first glance, I even thought I had a perfectly extracted dataset. But once I went through the pages and compared values, I started to notice errors. Some names were misspelled. Some were missing entirely. Some numbers were wrong.

The errors, although subtle and relatively infrequent, were enough to prevent me from doing the basic analyses that most data journalists want to do. Averages, histograms, mins and maxes were out.

But for my projects, the mistakes were tolerable. I wanted to find big players in the breach database, so I didn’t care if some of the names were wrong or if some numeric values were off by a zero. For the police data, I was basically looking for a summary to identify certain incidents and the individuals involved. If I missed something, it would be OK.

Overall, these are the types of errors ChatGPT introduced:

ChatGPT hallucinated data, meaning it made things up. Often in subtle and hard-to-detect ways. For example, it turned “2222 Colony Road, Moorcroft” (note the “r”) into “2222 Colony Road, Mooncroft.” The word “Mooncroft” (with an “n”) doesn’t appear anywhere in the text. ChatGPT seemed to be making a connection between the words colony and moon. How quaint.
It stumbled on people’s names and assumed gender. Some forms had a “salutation” field, which seemed to cause ChatGPT to add salutations (“Miss,” “Mr”) when inappropriate and omit them even when given (“Dr” and “Prof”). It also failed to use the correct name when multiple names appeared in a record, preferring whichever came last.
ChatGPT remembered previous prompts, causing mixups. Occasionally it would use a name or a business entity from an earlier record, despite a perfectly valid one appearing in the current record’s text. For example, in one record it used the names of a lawyer and law firm last seen 150 and 30 pages earlier, respectively. This problem forced me to make sure names and entities actually existed in the current record.
Words it thought were typos got “corrected.” Usually this was helpful, but sometimes it introduced an error. This was particularly problematic with email addresses.
Errors were scattered seemingly randomly throughout the data. While certain columns contained more errors than others, all columns had error rates ranging from 1% to upwards of 6%. The errors were scattered across rows, too. Combined, this meant that I’d basically need to compare every row with every record to get a fully valid dataset — the very work I was trying to avoid in the first place.

Problems with large language models have been well documented by now. Even with the great advances in ChatGPT, some of them reared their head in my experiments. Attempts to ignore these problems and shovel ChatGPT-derived work directly to readers will inevitably lead to disastrous failures.

Sometimes ChatGPT simply refuses to work with a document and gives a boilerplate response. It responded with concerns about “sensitive information” in both the police memos and the New York data breach datasets, despite them both being public documents. Image: Screenshot, OpenNews Source

Will ChatGPT Revolutionize Data Journalism?

I don’t think so, for three reasons:

No, for technical reasons: Working with ChatGPT via OpenAI’s API is painfully slow. It took nearly three weeks to extract approximately 2,500 records from the data breach PDF alone. This is even more significant considering I started this project before ChatGPT hit the mainstream and was able to use it for two weeks before rate limiting was imposed. The API is also unreliable and exhibits frequent downtime and interruptions, although this may improve in the future.
No, for economic reasons: With ChatGPT I’m convinced we’re trading one form of manual labor for another. We’re trading programming and transcription for cleaning, fact-checking, and validation. Because any row can potentially be incorrect, every field must be checked in order to build confidence. In the end, I’m not convinced we save much work.
No, for editorial reasons: The problems with data hallucination and other mixups restrict this approach to internal or journalist-facing uses, in my opinion. It’s a better tip generator than story generator. Putting ChatGPT at the end of a journalistic workflow risks exchanging more speed and quantity for less credibility.

The totality of these problems make most uses of ChatGPT editorially impractical, especially at scale. But I think it still has a place. For small, under-resourced newsrooms that need to turn a small PDF into a table, this could be workable (Hey ChatGPT, can you turn this text into an array of JSON objects?).

Some PDFs are also just so messy and non-uniform that writing an extraction script is too time consuming. I’ve had countless projects die due to problems like that. ChatGPT extraction has the potential to breathe life into such projects.

ChatGPT extraction could also serve well as an exploratory tool or a lead generator, in use cases where mistakes and missing values are tolerable, or speculative situations where you want to get a taste of the data before sinking weeks into a real cleanup and analysis.

Try It Yourself

I made my ChatGPT extractor script available on GitHub. Maybe you have a troublesome data project and want to try this for yourself. Or maybe you want to see the possibilities and limitations face-to-face. I’m secretly hoping someone will finally crack the FCC TV and cable political ad disclosure dataset, closing the chapter left open since ProPublica’s Free The Files project.

Either way, I have a feeling we’ll be reporting on and using this technology for some time to come. And the best way to get acquainted with any technology is to use it.

This article was first published on OpenNews: Source and is reprinted here under a Creative Commons license.

Additional Resources

10 Things You Should Know About AI in Journalism

Journalists’ Guide to Using AI and Satellite Imagery for Storytelling

Beyond the Hype: Using AI Effectively in Investigative Journalism

Brandon Roberts is an independent data journalist specializing in open source and bringing computational techniques to journalism projects.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

AI AI tools artificial intelligence ChatGPT Cross post data journalism data mining GPT-3 journalism

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Testing the Potential of Using ChatGPT to Extract Data from PDFs</h2> by <a href="https://source.opennews.org/people/brandon-roberts/">Brandon Roberts</a> for Global Investigative Journalism Network &bull; March 29, 2023 I convert a ton of text documents like PDFs to spreadsheets. It&rsquo;s tedious and expensive work. So every time a new iteration of&nbsp;AI&nbsp;technology arrives, I wonder if it&rsquo;s capable of doing what so many people ask for: to hand off a&nbsp;PDF, ask for a spreadsheet, and get one back. After throwing a couple programming problems at OpenAI&rsquo;s ChatGPT and getting a viable result, I wondered if we were finally&nbsp;there.<aside class="module align-right half type-pull-quote">The errors, although subtle and relatively infrequent, were enough to prevent me from doing the basic analyses that most data journalists want to do.</aside>Back when OpenAI&rsquo;s&nbsp;GPT-3 was the hot new thing, I saw Montreal journalist Roberto Rocha&nbsp;<a href="https://robertorocha.info/getting-tabular-data-from-unstructured-text-with-gpt-3-an-ongoing-experiment/">attempt a similar test</a>. The results were lackluster, but ChatGPT, OpenAI&rsquo;s newest model, has several improvements that make it better suited to extraction: It&rsquo;s 10 times larger than&nbsp;GPT-3 and is generally more coherent as a result, it&rsquo;s been trained to&nbsp;<a href="https://openai.com/blog/instruction-following/">explicitly follow instructions</a>, and it understands programming&nbsp;languages.To test how well ChatGPT could extract structured data from PDFs, I wrote a Python script (which I&rsquo;ll share at the end!) to convert two document sets to&nbsp;spreadsheets:<ul>
<li>A 7,000-page&nbsp;PDF&nbsp;of New York data breach notification forms. There were five different forms, bad&nbsp;OCR, and some freeform letters mixed&nbsp;in.</li>
<li>1,400 memos from internal police investigations. These were completely unstructured and contained emails and document scans. Super&nbsp;messy.</li>
</ul>My overall strategy was the&nbsp;following:<ol>
<li>Redo the&nbsp;OCR, using the&nbsp;<a href="https://github.com/freedmand/textra">highest quality tools possible</a>. This was critically important because ChatGPT refused to work with poorly&nbsp;OCR&rsquo;d&nbsp;text.</li>
<li>Clean the data as well as I could, maintaining physical layout and removing garbage characters and boilerplate&nbsp;text.</li>
<li>Break the documents into individual&nbsp;records.</li>
<li>Ask ChatGPT to turn each record into&nbsp;JSON.</li>
</ol>I spent about a week getting familiarized with both datasets and doing all this preprocessing. Once it&rsquo;s done, getting ChatGPT to convert a piece of text into&nbsp;JSON&nbsp;is really easy. You can paste in a record and say&nbsp;&ldquo;return a&nbsp;JSON&nbsp;representation of this&rdquo;&nbsp;and it will do it. But doing this for multiple records is a bad idea because ChatGPT will invent its own schema, using randomly chosen field names from the text. It will also decide on its own way to parse values. Addresses, for example, will sometimes end up as a string and sometimes as a&nbsp;JSON&nbsp;object or an array, with the constituent parts of an address split&nbsp;up.<a href="https://github.com/dair-ai/Prompt-Engineering-Guide/blob/main/guides/prompts-intro.md">Prompt design</a>&nbsp;is the most important factor in getting consistent results, and your language choices make a huge difference. One tip: Figure out what wording ChatGPT uses when referring to a task and mimic that. (If you don&rsquo;t know, you can always ask:&nbsp;&ldquo;Explain how you&rsquo;d _____ using _______.&rdquo;)Because ChatGPT understands code, I designed my prompt around asking for&nbsp;JSON&nbsp;that conforms to a given&nbsp;<a href="https://json-schema.org/">JSON&nbsp;schema</a>. This was my&nbsp;prompt:I tried to extract a&nbsp;JSON&nbsp;object from every response and run some validation checks against it. Two checks were particularly important: 1) making sure the&nbsp;JSON&nbsp;was complete, not truncated or broken, and 2) making sure the keys and values matched the schema. I retried if the validation check failed, and usually I&rsquo;d get valid&nbsp;JSON&nbsp;back on the second or third attempts. If it continued to fail, I&rsquo;d make a note of it and skip the record. Some records ChatGPT just doesn&rsquo;t&nbsp;like.<h4>Results</h4>Impressively, ChatGPT built a&nbsp;mostly&nbsp;usable dataset. At first glance, I even thought I had a perfectly extracted dataset. But once I went through the pages and compared values, I started to notice errors. Some names were misspelled. Some were missing entirely. Some numbers were&nbsp;wrong.<aside class="module align-right half type-pull-quote">The totality of these problems make most uses of ChatGPT editorially impractical, especially at scale.</aside>The errors, although subtle and relatively infrequent, were enough to prevent me from doing the basic analyses that most data journalists want to do. Averages, histograms, mins and maxes were&nbsp;out.But for my projects, the mistakes were tolerable. I wanted to find big players in the breach database, so I didn&rsquo;t care if some of the names were wrong or if some numeric values were off by a zero. For the police data, I was basically looking for a summary to identify certain incidents and the individuals involved. If I missed something, it would be&nbsp;OK.Overall, these are the types of errors ChatGPT&nbsp;introduced:<ul>
<li>ChatGPT&nbsp;<a href="https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)">hallucinated</a>&nbsp;data, meaning it made things up.&nbsp;Often in subtle and hard-to-detect ways. For example, it turned &ldquo;2222 Colony Road,&nbsp;Moorcroft&rdquo; (note the &ldquo;r&rdquo;) into &ldquo;2222 Colony Road,&nbsp;Mooncroft.&rdquo; The word &ldquo;Mooncroft&rdquo; (with an &ldquo;n&rdquo;) doesn&rsquo;t appear anywhere in the text. ChatGPT seemed to be making a connection between the words colony&nbsp;and&nbsp;moon. How&nbsp;quaint.</li>
<li>It stumbled on people&rsquo;s names and assumed gender. Some forms had a &ldquo;salutation&rdquo; field, which seemed to cause ChatGPT to add salutations (&ldquo;Miss,&rdquo; &ldquo;Mr&rdquo;) when inappropriate and omit them even when given (&ldquo;Dr&rdquo; and &ldquo;Prof&rdquo;). It also failed to use the correct name when multiple names appeared in a record, preferring whichever came last.</li>
<li>ChatGPT remembered previous prompts, causing mixups.&nbsp;Occasionally it would use a name or a business entity from an earlier record, despite a perfectly valid one appearing in the current record&rsquo;s text. For example, in one record it used the names of a lawyer and law firm last seen 150 and 30 pages earlier, respectively. This problem forced me to make sure names and entities actually existed in the current&nbsp;record.</li>
<li>Words it thought were typos got &ldquo;corrected.&rdquo;&nbsp;Usually this was helpful, but sometimes it introduced an error. This was particularly problematic with email&nbsp;addresses.</li>
<li>Errors were scattered seemingly randomly throughout the data. While certain columns contained more errors than others, all columns had error rates ranging from 1% to upwards of 6%. The errors were scattered across rows, too. Combined, this meant that I&rsquo;d basically need to compare every row with every record to get a fully valid dataset &mdash; the very work I was trying to avoid in the first place.</li>
</ul>Problems with large language models have been&nbsp;<a href="https://medium.com/fair-bytes/how-biased-is-gpt-3-5b2b91f1177">well</a>&nbsp;<a href="https://interaktiv.br.de/ai-generated-fact-boxes/">documented</a>&nbsp;by now. Even with the great advances in ChatGPT, some of them reared their head in my experiments. Attempts to ignore these problems and shovel ChatGPT-derived work directly to readers will inevitably&nbsp;<a href="https://gizmodo.com/cnet-ai-chatgpt-news-robot-1849996151">lead to disastrous failures</a>.<h4>Will ChatGPT Revolutionize Data Journalism?</h4><aside class="module align-right half type-pull-quote">ChatGPT extraction could also serve well as an exploratory tool or a lead generator.</aside>I don&rsquo;t think so, for three&nbsp;reasons:<ol>
<li>No, for technical reasons:&nbsp;Working with ChatGPT via OpenAI&rsquo;s&nbsp;API&nbsp;is painfully slow. It took nearly three weeks to extract approximately 2,500 records from the data breach&nbsp;PDF&nbsp;alone. This is even more significant considering I started this project before ChatGPT hit the mainstream and was able to use it for two weeks before rate limiting was imposed. The&nbsp;API&nbsp;is also unreliable and exhibits frequent downtime and interruptions, although&nbsp;<a href="https://openai.com/blog/chatgpt-plus/">this may improve</a>&nbsp;in the&nbsp;future.</li>
<li>No, for economic reasons:&nbsp;With ChatGPT I&rsquo;m convinced we&rsquo;re trading one form of manual labor for another. We&rsquo;re trading programming and transcription for cleaning, fact-checking, and validation. Because any row can potentially be incorrect, every field must be checked in order to build confidence. In the end, I&rsquo;m not convinced we save much&nbsp;work.</li>
<li>No, for editorial reasons:&nbsp;The problems with data hallucination and other mixups restrict this approach to internal or journalist-facing uses, in my opinion. It&rsquo;s a better tip generator than story generator. Putting ChatGPT at the end of a journalistic workflow risks exchanging more speed and quantity for less&nbsp;credibility.</li>
</ol>The totality of these problems make most uses of ChatGPT editorially impractical, especially at scale. But I think it still has a place. For small, under-resourced newsrooms that need to turn a small&nbsp;PDF&nbsp;into a table, this could be workable (Hey ChatGPT, can you turn this text into an array of&nbsp;JSON objects?).Some PDFs are also just so messy and non-uniform that writing an extraction script is too time consuming. I&rsquo;ve had countless projects die due to problems like that. ChatGPT extraction has the potential to breathe life into such&nbsp;projects.ChatGPT extraction could also serve well as an exploratory tool or a lead generator, in use cases where mistakes and missing values are tolerable, or speculative situations where you want to get a taste of the data before sinking weeks into a real cleanup and&nbsp;analysis.<h4>Try It Yourself</h4>I made my ChatGPT extractor script&nbsp;<a href="https://github.com/brandonrobertz/chatgpt-document-extraction">available on GitHub</a>. Maybe you have a troublesome data project and want to try this for yourself. Or maybe you want to see the possibilities and limitations face-to-face. I&rsquo;m secretly hoping someone will finally crack the&nbsp;<a href="https://publicfiles.fcc.gov/">FCC&nbsp;TV&nbsp;and cable political ad disclosure</a>&nbsp;dataset, closing the chapter&nbsp;<a href="https://wandb.ai/deepform/political-ad-extraction/benchmark">left open</a>&nbsp;since&nbsp;<a href="https://projects.propublica.org/free-the-files/">ProPublica&rsquo;s Free The Files</a>&nbsp;project.Either way, I have a feeling we&rsquo;ll be reporting on and using this technology for some time to come. And the best way to get acquainted with any technology is to use&nbsp;it.This article was first <a href="https://source.opennews.org/articles/testing-pdf-data-extraction-chatgpt/">published</a> on <a href="https://source.opennews.org/">OpenNews: Source</a> and is reprinted here under a Creative Commons license.<h4>Additional Resources</h4><a href="https://gijn.org/2022/09/28/10-things-you-should-know-about-ai-in-journalism/">10 Things You Should Know About AI in Journalism</a><a href="https://gijn.org/2022/02/16/journalists-guide-to-using-ai-and-satellite-imagery-for-storytelling/">Journalists&rsquo; Guide to Using AI and Satellite Imagery for Storytelling</a><a href="https://gijn.org/2019/09/09/beyond-the-hype-using-ai-effectively-in-investigative-journalism/">Beyond the Hype: Using AI Effectively in Investigative Journalism</a><hr><a href="https://gijn.org/wp-content/uploads/2023/03/Screenshot-2023-03-28-at-18.10.53.png"><img class="alignleft wp-image-629899 size-thumbnail" src="https://gijn.org/wp-content/uploads/2023/03/Screenshot-2023-03-28-at-18.10.53-140x140.png" alt="" width="140" height="140"></a><a href="https://source.opennews.org/people/brandon-roberts/">Brandon Roberts</a> is an independent data journalist specializing in open source and bringing computational techniques to journalism projects.
	This <a target="_blank" href="https://gijn.org/stories/testing-the-potential-of-using-chatgpt-to-extract-data-from-pdfs/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

How to Verify Bystander Video

by Alex Mahadevan, Poynter • April 20, 2026

From the Minneapolis shootings to the Guthrie kidnapping, visual investigation skills are now mandatory. Here’s how to do it.

GIJN’s Top Investigative Tools of 2025

by Rowan Philp • December 15, 2025

In a year that kleptocracy and attacks on independent media spiked, investigative reporters harnessed a mix of new databases and innovative tools to hold bad actors accountable.

The first of Microsoft's data centers under construction in the village of Mt. Pleasant, Wisconsin located in Racine county Wisconsin.

Reporting Tools & Tips

Tips for Researching Massive Water Consumption by Data Centers

by Peyton McCauley & Melissa Scanlan, The Conversation • September 4, 2025

Researchers at the University of Wisconsin-Milwaukee offer insights into sources and techniques for investigating water usage by major tech companies’ data centers.

Hands weaving and building technology. Image: Hanna Barakat & Archival Images of AI + AIxDESIGN / Better Images of AI / Used under a CC 4.0 license

Investigative Techniques Reporting Tools & Tips

How Newsrooms Are Using AI Chatbots to Leverage Their Own Reporting — and Build Trust

by Rowan Philip • August 7, 2025

From the Philippines to the UK, a number of major newsrooms have created their own AI chatbots designed to respond using only that site’s trusted reporting archive and vetted databases as source material.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Testing the Potential of Using ChatGPT to Extract Data from PDFs

Read this article in

Results

Will ChatGPT Revolutionize Data Journalism?

Try It Yourself

Additional Resources

Read other stories tagged with:

Republish this article

Read Next

Investigative Techniques Reporting Tools & Tips

How to Verify Bystander Video

Editor's Picks Reporting Tools & Tips

GIJN’s Top Investigative Tools of 2025

Reporting Tools & Tips

Tips for Researching Massive Water Consumption by Data Centers

Investigative Techniques Reporting Tools & Tips

How Newsrooms Are Using AI Chatbots to Leverage Their Own Reporting — and Build Trust

Stories

Topics

Testing the Potential of Using ChatGPT to Extract Data from PDFs

Read this article in

Related Resources

Toolkit: How to Investigate Illegal, Unreported, and Unregulated (IUU) Fishing

Step-By-Step Guide for Journalists on the Basics of Google Sheets

Tipsheet for Using Ocean Data in Your Investigations

How to Identify and Investigate AI Audio Deepfakes, a Major 2024 Election Threat

Share

Results

Will ChatGPT Revolutionize Data Journalism?

Try It Yourself

Additional Resources

Related Resources

Toolkit: How to Investigate Illegal, Unreported, and Unregulated (IUU) Fishing

Step-By-Step Guide for Journalists on the Basics of Google Sheets

Tipsheet for Using Ocean Data in Your Investigations

How to Identify and Investigate AI Audio Deepfakes, a Major 2024 Election Threat

Related Stories

How to Verify Bystander Video

GIJN’s Top Investigative Tools of 2025

Tips for Researching Massive Water Consumption by Data Centers

How Newsrooms Are Using AI Chatbots to Leverage Their Own Reporting — and Build Trust

Read other stories tagged with:

Republish this article

Read Next

Investigative Techniques Reporting Tools & Tips

How to Verify Bystander Video

Editor's Picks Reporting Tools & Tips

GIJN’s Top Investigative Tools of 2025

Reporting Tools & Tips

Tips for Researching Massive Water Consumption by Data Centers

Investigative Techniques Reporting Tools & Tips

How Newsrooms Are Using AI Chatbots to Leverage Their Own Reporting — and Build Trust