I convert a ton of text documents like PDFs to spreadsheets. It’s tedious and expensive work. So every time a new iteration of AI technology arrives, I wonder if it’s capable of doing what so many people ask for: to hand off a PDF, ask for a spreadsheet, and get one back. After throwing a couple programming problems at OpenAI’s ChatGPT and getting a viable result, I wondered if we were finally there.
Back when OpenAI’s GPT-3 was the hot new thing, I saw Montreal journalist Roberto Rocha attempt a similar test. The results were lackluster, but ChatGPT, OpenAI’s newest model, has several improvements that make it better suited to extraction: It’s 10 times larger than GPT-3 and is generally more coherent as a result, it’s been trained to explicitly follow instructions, and it understands programming languages.
To test how well ChatGPT could extract structured data from PDFs, I wrote a Python script (which I’ll share at the end!) to convert two document sets to spreadsheets:
- A 7,000-page PDF of New York data breach notification forms. There were five different forms, bad OCR, and some freeform letters mixed in.
- 1,400 memos from internal police investigations. These were completely unstructured and contained emails and document scans. Super messy.
My overall strategy was the following:
- Redo the OCR, using the highest quality tools possible. This was critically important because ChatGPT refused to work with poorly OCR’d text.
- Clean the data as well as I could, maintaining physical layout and removing garbage characters and boilerplate text.
- Break the documents into individual records.
- Ask ChatGPT to turn each record into JSON.
I spent about a week getting familiarized with both datasets and doing all this preprocessing. Once it’s done, getting ChatGPT to convert a piece of text into JSON is really easy. You can paste in a record and say “return a JSON representation of this” and it will do it. But doing this for multiple records is a bad idea because ChatGPT will invent its own schema, using randomly chosen field names from the text. It will also decide on its own way to parse values. Addresses, for example, will sometimes end up as a string and sometimes as a JSON object or an array, with the constituent parts of an address split up.
Prompt design is the most important factor in getting consistent results, and your language choices make a huge difference. One tip: Figure out what wording ChatGPT uses when referring to a task and mimic that. (If you don’t know, you can always ask: “Explain how you’d _____ using _______.”)
Because ChatGPT understands code, I designed my prompt around asking for JSON that conforms to a given JSON schema. This was my prompt:
I tried to extract a JSON object from every response and run some validation checks against it. Two checks were particularly important: 1) making sure the JSON was complete, not truncated or broken, and 2) making sure the keys and values matched the schema. I retried if the validation check failed, and usually I’d get valid JSON back on the second or third attempts. If it continued to fail, I’d make a note of it and skip the record. Some records ChatGPT just doesn’t like.
Impressively, ChatGPT built a mostly usable dataset. At first glance, I even thought I had a perfectly extracted dataset. But once I went through the pages and compared values, I started to notice errors. Some names were misspelled. Some were missing entirely. Some numbers were wrong.
The errors, although subtle and relatively infrequent, were enough to prevent me from doing the basic analyses that most data journalists want to do. Averages, histograms, mins and maxes were out.
But for my projects, the mistakes were tolerable. I wanted to find big players in the breach database, so I didn’t care if some of the names were wrong or if some numeric values were off by a zero. For the police data, I was basically looking for a summary to identify certain incidents and the individuals involved. If I missed something, it would be OK.
Overall, these are the types of errors ChatGPT introduced:
- ChatGPT hallucinated data, meaning it made things up. Often in subtle and hard-to-detect ways. For example, it turned “2222 Colony Road, Moorcroft” (note the “r”) into “2222 Colony Road, Mooncroft.” The word “Mooncroft” (with an “n”) doesn’t appear anywhere in the text. ChatGPT seemed to be making a connection between the words colony and moon. How quaint.
- It stumbled on people’s names and assumed gender. Some forms had a “salutation” field, which seemed to cause ChatGPT to add salutations (“Miss,” “Mr”) when inappropriate and omit them even when given (“Dr” and “Prof”). It also failed to use the correct name when multiple names appeared in a record, preferring whichever came last.
- ChatGPT remembered previous prompts, causing mixups. Occasionally it would use a name or a business entity from an earlier record, despite a perfectly valid one appearing in the current record’s text. For example, in one record it used the names of a lawyer and law firm last seen 150 and 30 pages earlier, respectively. This problem forced me to make sure names and entities actually existed in the current record.
- Words it thought were typos got “corrected.” Usually this was helpful, but sometimes it introduced an error. This was particularly problematic with email addresses.
- Errors were scattered seemingly randomly throughout the data. While certain columns contained more errors than others, all columns had error rates ranging from 1% to upwards of 6%. The errors were scattered across rows, too. Combined, this meant that I’d basically need to compare every row with every record to get a fully valid dataset — the very work I was trying to avoid in the first place.
Problems with large language models have been well documented by now. Even with the great advances in ChatGPT, some of them reared their head in my experiments. Attempts to ignore these problems and shovel ChatGPT-derived work directly to readers will inevitably lead to disastrous failures.
Will ChatGPT Revolutionize Data Journalism?
I don’t think so, for three reasons:
- No, for technical reasons: Working with ChatGPT via OpenAI’s API is painfully slow. It took nearly three weeks to extract approximately 2,500 records from the data breach PDF alone. This is even more significant considering I started this project before ChatGPT hit the mainstream and was able to use it for two weeks before rate limiting was imposed. The API is also unreliable and exhibits frequent downtime and interruptions, although this may improve in the future.
- No, for economic reasons: With ChatGPT I’m convinced we’re trading one form of manual labor for another. We’re trading programming and transcription for cleaning, fact-checking, and validation. Because any row can potentially be incorrect, every field must be checked in order to build confidence. In the end, I’m not convinced we save much work.
- No, for editorial reasons: The problems with data hallucination and other mixups restrict this approach to internal or journalist-facing uses, in my opinion. It’s a better tip generator than story generator. Putting ChatGPT at the end of a journalistic workflow risks exchanging more speed and quantity for less credibility.
The totality of these problems make most uses of ChatGPT editorially impractical, especially at scale. But I think it still has a place. For small, under-resourced newsrooms that need to turn a small PDF into a table, this could be workable (Hey ChatGPT, can you turn this text into an array of JSON objects?).
Some PDFs are also just so messy and non-uniform that writing an extraction script is too time consuming. I’ve had countless projects die due to problems like that. ChatGPT extraction has the potential to breathe life into such projects.
ChatGPT extraction could also serve well as an exploratory tool or a lead generator, in use cases where mistakes and missing values are tolerable, or speculative situations where you want to get a taste of the data before sinking weeks into a real cleanup and analysis.
Try It Yourself
I made my ChatGPT extractor script available on GitHub. Maybe you have a troublesome data project and want to try this for yourself. Or maybe you want to see the possibilities and limitations face-to-face. I’m secretly hoping someone will finally crack the FCC TV and cable political ad disclosure dataset, closing the chapter left open since ProPublica’s Free The Files project.
Either way, I have a feeling we’ll be reporting on and using this technology for some time to come. And the best way to get acquainted with any technology is to use it.
Brandon Roberts is an independent data journalist specializing in open source and bringing computational techniques to journalism projects.