Stories

•

Topics

» Data Journalism » Reporting Tools & Tips

Beginner’s Guide to Extracting Data from PDFs

by Laura Grant • July 17, 2017

Read this article in

Japanese

Journalists get lots of data in PDF format — they can be tables of data that are embedded in reports, or spreadsheets that have been thoughtfully saved as PDFs before they’re emailed to you — but until you can get that data into a spreadsheet, there’s not much you can do with it.

Luckily, there are a few great tools that can liberate your data quickly and relatively easily. I’ve listed some of the ones that I’ve tried out here (though there are no doubt loads more out there) as well as some tips on some of the more fiddly parts of scraping PDFs, including rotated tables, converting scanned PDFs and password protected PDFs.

Tabula

I love Tabula. It’s my go-to option, firstly because it’s free, and secondly because it’s really easy to use. Its website says it was created “by journalists for journalists,” which is probably why it’s so popular with non-techie people like me.

I often need to extract tables of data from biggish PDF reports. Tabula lets you upload an entire document and select just the tables you want. You can convert one table at a time, or a few, depending on the layout of your document, into a CSV, TSV of JSON file, which you can import to Google Sheets (free), Libre Office Calc (free), Excel (not free), or whatever program you prefer.

The only times I don’t go straight to Tabula is when I have PDFs that have been scanned in, or when the tables I want to convert are rotated 90 degrees. But I’ll deal with those later.

Cometdocs

This one is also popular with journalists — not least because Investigative Reporters and Editors members get free premium membership — and it’s really easy to use. You can convert up to five documents a week for free, but you have to subscribe if you want to do more. I quite like the fact that you can subscribe for a month at a time for $9.99, but if you really like it you can get a lifetime membership for about $130.

This is how it works: upload or import the PDF you want to convert, click the convert button and choose between Excel and .ODS (which you can open in Libre Office), unfortunately .CSV isn’t an option. If you don’t have either of those spreadsheet packages, you can upload the file to Google Drive and open it in Google Sheets.

It works quickly and well, but the really nice thing about Cometdocs is that it does optical character recognition (OCR), so it can convert scanned PDFs. You need to check the converted document against the original, though, just to be sure it picked everything up correctly. Like Tabula, it can’t handle tables that are rotated.

Adobe Export PDF

This one’s not free, but it’s not terribly expensive either — about $24 a year. If you use Adobe Reader, which is Adobe’s free PDF reader, Export PDF allows you to convert a PDF document that you’ve opened in Acrobat Reader to Excel, Word, PowerPoint or RTF. It works well and quickly with fairly big documents. But, like Tabula, it can’t do scanned documents or rotated tables.

Nitro Pro

If you have a Windows machine, Nitro is a great tool for editing and converting PDFs to useful formats, but it’s not free (about $160) and the fact that it only works with Windows means it’s out of reach for me and my MacBook. I have tried it out on somebody else’s machine, though, and I was suitably impressed.

Acrobat Pro

This one is accessible for Mac users, but it’s also not free (about $15 a month and it requires an annual commitment).

Zanran

This UK-based company has developed software to automate PDF processing. It’s not free, but you can see what it can do by trying out it’s demo document converter — as long as your document is 1.5MB or smaller. You upload your PDF, tell them what you want it converted to, give them your email address and they’ll mail you the converted document.

Zamzar

This is another online conversion tool where you can upload your document, choose the format you want to convert it to and it’ll email the converted document to the email address of your choice.

Rotated Tables

Sometimes the tables in PDF documents have been rotated 90 degrees. You need to be able to rotate the tables back to a normal orientation before any conversion tool will be able to identify them as text. Just rotating the page in Acrobat Reader or Preview, for example, won’t work. You need to rotate the table itself. To do this you need a proper PDF editor such as Acrobat Pro or Nitro Pro.

If you have Acrobat Pro, here’s what you do:

If your tables are part of a larger document, open your document and, using the Organize Pages option, extract the pages with the tables you want to rotate. If you want to extract a number of consecutive pages, it’s easier to extract them into separate files.
Open the page with the table on it. Go to the View menu and rotate until your table is upright.
If there are headers and footers or any other text that is not rotated in the same direction as your table, remove them using the Edit PDF function – you need to delete them, covering them up doesn’t work.
Go to the Enhance Scans option and choose Recognize Text; check the settings to make sure the option “Save as editable text and images” is selected. This may take a few minutes and when it’s finished your table may be rotated 90 percent again.
Go back to View and rotate your page until the table is upright again. Then save your file.
You can try to convert your page to an Excel spreadsheet using the Export PDF function, but I find that Tabula generally does the job better.
Always check the converted data against the original documents because sometimes 8s can be mistaken for 6s or Bs. But even if your converted document isn’t absolutely perfect, converting it this way will be much quicker than manually typing it into a spreadsheet.

Converting Scanned PDFs

In a scanned PDF, a table will be identified as an image rather than text, so if you want to extract the data from a table you first need to convert it to text with something that has optical character recognition (OCR). You can use Cometdocs, Acrobat Pro or Nitro Pro. Acrobat Pro’s Enhance Scans tool should recognize the text in your PDF as long as the quality of the scan isn’t terrible. Sometimes it helps to save a snapshot of the table you want to extract into its own PDF before you use the Enhance Scans tool. Once the scan is converted to text and images I still save it as a PDF and convert it to a CSV with Tabula. And, of course, always check your data against the original.

Password Protected PDFs

Sometimes PDFs are password protected so that you can’t edit them or convert them to any other format. If you have a Mac with Preview, try opening your PDF in Preview, then select the Export as PDF option under the File menu. Open the new version of your PDF and see if you’re able convert it to a spreadsheet now.

Do you have a favorite tool for extracting data from PDFs? Let me know. You can find me on Twitter: @laurajgrant.

This is part three of an occasional series on useful tools for data journalists on the Media Hack Collective’s Journalism Toolbox. It is reposted here with permission.

Laura Grant is a data journalist and a managing partner of the Media Hack Collective, a collaboration dedicated to digital storytelling. She has been a journalist for more than 20 years, and is the former associate editor of digital and data projects at South Africa’s Mail & Guardian, where she produced data-driven stories, interactive graphics and maps.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

cometdocs data tools PDF extraction Tabula

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Beginner&rsquo;s Guide to Extracting Data from PDFs</h2><p class="byline"> <span>by</span> <a href="https://twitter.com/LauraJGrant">Laura Grant</a> <span>for Global Investigative Journalism Network</span> <span>&bull; July 17, 2017</span> </p><p><a href="https://gijn.org/wp-content/uploads/2017/07/Text-mining.png"><img class=" wp-image-44947 alignright" src="https://gijn.org/wp-content/uploads/2017/07/Text-mining-336x150.png" alt="" width="466" height="208"></a>Journalists get lots of data in PDF format -- they can be tables of data that are embedded in reports, or spreadsheets that have been thoughtfully saved as PDFs before they&rsquo;re emailed to you -- but until you can get that data into a spreadsheet, there&rsquo;s not much you can do with it.</p><p>Luckily, there are a few great tools that can liberate your data quickly and relatively easily. I&rsquo;ve listed some of the ones that I&rsquo;ve tried out&nbsp;here (though&nbsp;there are no doubt loads more out there) as well as some tips on some of the more fiddly parts of scraping PDFs, including rotated tables, converting scanned PDFs and password protected PDFs.</p><h3><a href="http://tabula.technology/">Tabula</a></h3><p><a href="http://tabula.technology/"><img class="alignright wp-image-45877" src="https://gijn.org/wp-content/uploads/2017/07/Tabula-logo-336x233.png" alt="" width="301" height="208"></a>I love Tabula. It&rsquo;s my go-to option, firstly because it&rsquo;s free, and secondly because it&rsquo;s really easy to use. Its website says it was created &ldquo;by journalists for journalists,&rdquo; which is probably why it&rsquo;s so popular with non-techie people like me.</p><p>I often need to extract tables of data from biggish PDF reports. Tabula lets you upload an entire document and select just the tables you want. You can convert one table at a time, or a few, depending on the layout of your document, into a CSV, TSV of JSON file, which you can import to <a href="https://www.google.com/sheets/about/">Google Sheets</a> (free), <a href="https://www.libreoffice.org/discover/calc/">Libre Office Calc</a> (free), <a href="https://office.live.com/start/Excel.aspx">Excel</a> (not free), or whatever program you prefer.</p><aside class="module align-right half type-pull-quote">I love Tabula. Firstly because it&rsquo;s free, and secondly because it&rsquo;s really easy to use.&nbsp;</aside><p>The only times I don&rsquo;t go straight to Tabula is when I have PDFs that have been scanned in, or when the tables I want to convert are rotated 90 degrees. But I&rsquo;ll deal with those later.</p><h3><a href="https://www.cometdocs.com">Cometdocs</a></h3><p>This one is also popular with journalists -- not least because&nbsp;<a href="http://www.ire.org/blog/ire-news/2013/05/22/ire-announces-partnership-cometdocs/">Investigative Reporters and Editors</a>&nbsp;members get free premium membership -- and it&rsquo;s really easy to use. You can convert up to five documents a week for free, but you have to subscribe if you want to do more. I quite like the fact that you can subscribe for a month at a time for $9.99, but if you really like it you can get a lifetime membership for about $130.</p><p><a href="https://gijn.org/wp-content/uploads/2017/07/7_cometdocs.png"><img class=" wp-image-45881 alignright" src="https://gijn.org/wp-content/uploads/2017/07/7_cometdocs.png" alt="" width="231" height="201"></a>This is how it works:&nbsp;upload or import the PDF&nbsp;you want to convert, click the convert button and choose between Excel and .ODS (which you can open in Libre Office), unfortunately .CSV isn&rsquo;t an option. If you don&rsquo;t have either of those spreadsheet packages, you can upload the file to Google Drive and open it in Google Sheets.</p><p>It works quickly and well, but the really nice thing about Cometdocs is that it does optical character recognition (OCR), so it can&nbsp;convert scanned PDFs. You need to check the converted document against the original, though, just to be sure it picked everything up correctly. Like Tabula, it&nbsp;can&rsquo;t handle tables that are rotated.</p><h3><a href="https://acrobat.adobe.com/us/en/acrobat/export-pdf-online-pricing.html">Adobe Export PDF</a></h3><aside class="module align-right half type-pull-quote">The really nice thing about Cometdocs is that it does optical character recognition (OCR), so it can&nbsp;convert scanned PDFs.</aside><p>This one&rsquo;s not free, but it&rsquo;s not terribly expensive either -- about $24 a year. If you use Adobe Reader, which is Adobe&rsquo;s free PDF reader, Export PDF allows you to convert a PDF document that you&rsquo;ve opened in Acrobat Reader to Excel, Word, PowerPoint or RTF. It works well and quickly with fairly big documents. But, like Tabula, it can&rsquo;t do scanned documents or rotated tables.</p><h3><a href="https://www.gonitro.com">Nitro Pro</a></h3><p><a href="https://gijn.org/wp-content/uploads/2017/07/nitro.png"><img class=" wp-image-45885 alignright" src="https://gijn.org/wp-content/uploads/2017/07/nitro.png" alt="" width="161" height="161"></a>If you have a Windows machine, Nitro is a great tool for editing and converting PDFs to useful formats, but it&rsquo;s not free (about $160) and the fact that it only works with Windows means it&rsquo;s out of reach for me and my MacBook. I have tried it out on somebody else&rsquo;s machine, though, and I was suitably impressed.</p><h3><a href="https://acrobat.adobe.com/us/en/acrobat/acrobat-pro.html">Acrobat Pro</a></h3><p>This one is accessible for Mac users, but it&rsquo;s also not free (about $15 a month and it requires an annual commitment).</p><h3><a href="https://pdf.zanran.com/">Zanran</a></h3><p><a href="https://gijn.org/wp-content/uploads/2017/07/zanran.png"><img class="wp-image-45888 alignright" src="https://gijn.org/wp-content/uploads/2017/07/zanran.png" alt="" width="271" height="108"></a>This UK-based company has developed software to automate PDF processing. It&rsquo;s not free, but you can see what it can do by trying out it&rsquo;s demo document converter -- as long as your document is 1.5MB or smaller. You upload your PDF, tell them what you want it converted to, give them your email address and they&rsquo;ll mail you the converted document.</p><h3><a href="http://www.zamzar.com/">Zamzar</a></h3><p>This is another online conversion tool where you can upload your document, choose the format you want to convert it to and it&rsquo;ll email the converted document to the email address of your choice.</p><h3>Rotated Tables</h3><p>Sometimes the tables in PDF&nbsp;documents have been rotated 90 degrees. You need to be able to rotate the tables back to a normal orientation before any conversion tool will be able to identify them as text. Just rotating the page in Acrobat Reader or Preview, for example, won&rsquo;t work. You need to rotate the table itself. To do this you need a proper PDF&nbsp;editor such as Acrobat Pro or Nitro Pro.</p><p>If you have Acrobat Pro, here&rsquo;s what you do:</p><p><a href="https://gijn.org/wp-content/uploads/2017/07/adobe.jpeg"><img class="size-full wp-image-45891 alignright" src="https://gijn.org/wp-content/uploads/2017/07/adobe.jpeg" alt="" width="233" height="216"></a></p><ul>
<li>If your tables are part of a larger document, open your document and, using the Organize Pages option, extract the pages with the tables you want to rotate. If you want to extract a number of consecutive pages, it&rsquo;s easier&nbsp;to extract them into separate files.</li>
<li>Open the page with the table on it. Go to the View menu and rotate until your table is upright.</li>
<li>If there are headers and footers or any other text that is&nbsp;not rotated in the same direction as your table, remove them using the Edit PDF function &ndash; you need to delete them, covering them up doesn&rsquo;t work.</li>
<li>Go to the Enhance Scans option and choose Recognize Text; check the settings to make sure the option &ldquo;Save as editable text and images&rdquo; is selected. This may take a few minutes and when it&rsquo;s finished your table may be rotated 90 percent again.</li>
<li>Go back to View and rotate your page until the table is upright again. Then save your file.</li>
<li>You can try to convert your page to an Excel spreadsheet using the Export PDF function, but I find that Tabula generally does the job better.</li>
<li>Always check the converted data against the original documents because sometimes 8s can be mistaken for 6s or Bs. But even if your converted document isn&rsquo;t absolutely perfect, converting it this way will be much quicker than manually typing it into a spreadsheet.</li>
</ul><h3>Converting Scanned PDFs</h3><p><a href="https://gijn.org/wp-content/uploads/2017/07/table-3a.jpg"><img class="wp-image-45890 alignright" src="https://gijn.org/wp-content/uploads/2017/07/table-3a-336x263.jpg" alt="" width="290" height="227"></a>In a scanned PDF, a table will be identified as an image rather than text, so if you want to extract the data from a table you first need to convert it to text with something that has optical character recognition (OCR). You can use Cometdocs, Acrobat Pro or Nitro Pro. Acrobat Pro&rsquo;s Enhance Scans tool should recognize the text in your PDF&nbsp;as long as the quality of the scan isn&rsquo;t terrible. Sometimes it helps to save a snapshot of the table you want to extract into its own PDF&nbsp;before you use the Enhance Scans tool. Once the scan is converted to text and images I still save it as a PDF&nbsp;and convert it to a CSV with Tabula. And, of course, always check your data against the original.</p><h3>Password Protected PDFs</h3><p><a href="https://gijn.org/wp-content/uploads/2017/07/16029690838_bdeae55e39_q.jpg"><img class=" wp-image-45889 alignright" src="https://gijn.org/wp-content/uploads/2017/07/16029690838_bdeae55e39_q.jpg" alt="" width="145" height="145"></a>Sometimes PDFs&nbsp;are password protected so that you can&rsquo;t edit them or convert them to any other format. If you have a Mac with Preview, try opening your PDF&nbsp;in Preview, then&nbsp;select the Export as PDF option under the File menu. Open the&nbsp;new version of your PDF and see if you&rsquo;re able&nbsp;convert it to a spreadsheet now.</p><p>Do you have a favorite tool for extracting data from PDFs? Let me know. You can find me on Twitter:&nbsp;<a href="https://twitter.com/LauraJGrant">@laurajgrant</a>.</p><hr><p><em>This is <a href="http://mediahack.co.za/2017/06/beginners-guide-extracting-data-pdfs/">part three</a>&nbsp;of an occasional series on useful tools for data journalists on the Media Hack Collective's&nbsp;<a href="http://mediahack.co.za/category/journalism-toolbox/">Journalism Toolbox</a>. It is reposted here with permission.</em></p><p><em><a href="https://gijn.org/wp-content/uploads/2017/07/Laura-round-1.png"><img class="size-full wp-image-45876 alignleft" src="https://gijn.org/wp-content/uploads/2017/07/Laura-round-1.png" alt="" width="150" height="150"></a><a href="https://twitter.com/LauraJGrant">Laura Grant </a>is a data journalist and a managing partner of the <a href="http://mediahack.co.za/">Media Hack Collective</a>, a collaboration dedicated to digital storytelling. She has been a journalist for more than 20 years, and&nbsp;is the former&nbsp;associate editor of digital and data projects at South Africa's&nbsp;</em><a href="https://mg.co.za/">Mail &amp; Guardian</a>,&nbsp;<em>where she produced data-driven stories, interactive graphics and maps.</em></p><p>&nbsp;</p><p>
	This <a target="_blank" href="https://gijn.org/stories/beginners-guide-to-extracting-data-from-pdfs/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">
</p>

Data Journalism Top 10: Climate Crises, Tech Layoffs, Border Fences, and Banned Books

by Alexa van Sickle, Laura Dixon, and Connected Action • March 24, 2023

This week our top ten in data journalism features stories digging into the link between Ebola outbreaks and deforestation, on the stark impact of global warming on everything from crop yields to species loss, and the undeniable increase in the number of hot nights in Singapore.

Data Journalism

Data Journalists Offer Tools for the Future

by Santiago Villa • November 5, 2021

Data journalism has evolved from simple spreadsheet analysis of local government data to the spectacular tracking of the hidden wealth of oligarchs, autocrats, and corporate leaders from gigantic datasets. In a session at GIJC21, leading data journalists looked at this transition but also, at what is next.

Reporting Tools & Tips

Digging Up Hidden Data with the Web Inspector

by Smaranda Tolosano • July 28, 2021

Many reporters never notice the “inspect element” option below the “copy” and save-as” functions in the right-click menu on any webpage related to their investigation. But it turns out that this little-used web inspector tool can dig up a wealth of hidden information from a site’s source code, reveal the raw data behind graphics, and download images and videos that supposedly cannot be saved.

Data Journalism

Data Journalism’s Top Ten

by GIJN & Connected Action • September 13, 2017

What’s the global data journalism community tweeting about this week? Our NodeXL #ddj mapping from Sept 4 to 10 has @FastCoDesign bringing the #EndtheRainbow argument to the fore, @puddingviz analyzing driving times to abortion clinics in the US and @nsuwatch dissecting a trial involving a German terrorist organization by the numbers.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Beginner’s Guide to Extracting Data from PDFs

Read this article in

Tabula

Cometdocs

Adobe Export PDF

Nitro Pro

Acrobat Pro

Zanran

Zamzar

Rotated Tables

Converting Scanned PDFs

Password Protected PDFs

Read other stories tagged with:

Republish this article

Read Next

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Climate Crises, Tech Layoffs, Border Fences, and Banned Books

Data Journalism

Data Journalists Offer Tools for the Future

Reporting Tools & Tips

Digging Up Hidden Data with the Web Inspector

Data Journalism

Data Journalism’s Top Ten

Topics

Beginner’s Guide to Extracting Data from PDFs

Read this article in

Related Resources

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Tipsheet: Latest Tools for Investigating with Telegram

Investigating Elections: Threat from AI Audio Deepfakes

Updated GIJN Databases (Poverty, Crime, Corruption, and Terrorism)

Share

Rotated Tables

Converting Scanned PDFs

Password Protected PDFs

Related Resources

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Tipsheet: Latest Tools for Investigating with Telegram

Investigating Elections: Threat from AI Audio Deepfakes

Updated GIJN Databases (Poverty, Crime, Corruption, and Terrorism)

Related Stories

Data Journalism Top 10: Climate Crises, Tech Layoffs, Border Fences, and Banned Books

Data Journalists Offer Tools for the Future

Digging Up Hidden Data with the Web Inspector

Data Journalism’s Top Ten

Read other stories tagged with:

Republish this article

Read Next

Data Journalism Data Journalism Top 10

Data Journalism

Reporting Tools & Tips

Data Journalism