Stories

•

Topics

» Data Journalism

Why Web Scraping Is Vital to Democracy

by The Markup Staff • December 17, 2020

Read this article in

English

Indonesian

Image: Pixabay

Editor’s Note: The Markup, a New York-based investigative newsroom that covers the tech industry, recently argued for web scraping in an amicus curiae (literally, a friend of the court brief) for a US Supreme Court case that threatens to make scraping illegal. Here’s why they did it.

The fruits of web scraping — using code to harvest data and information from websites — are all around us.

People build scrapers that can find every Applebee’s on the planet or collect congressional legislation and votes or track fancy watches for sale on fan websites. Businesses use scrapers to manage their online retail inventory and monitor competitors’ prices. Lots of well-known sites use scrapers to do things like track airline ticket prices and job listings. Google is essentially a giant, crawling web scraper.

Scrapers are also the tools of watchdogs and journalists, which is why The Markup filed an amicus brief in a case before the United States Supreme Court that threatens to make scraping illegal.

The case itself — Van Buren v. United States — is not about scraping but rather a legal question regarding the prosecution of a Georgia police officer, Nathan Van Buren, who was bribed to look up confidential information in a law enforcement database. Van Buren was prosecuted under the Computer Fraud and Abuse Act (CFAA), which prohibits unauthorized access to a computer network such as computer hacking, where someone breaks into a system to steal information (or, as dramatized in the 1980s classic movie “WarGames,” potentially start World War III).

In Van Buren’s case, since he was allowed to access the database for work, the question is whether the court will broadly define his troubling activities as “exceeding authorized access” to extract data, which is what would make it a crime under the CFAA. And it’s that definition that could affect journalists.

Or, as Justice Neil Gorsuch put it during Monday’s oral arguments, lead in the direction of “perhaps making a federal criminal of us all.”

Investigative journalists and other watchdogs often use scrapers to illuminate issues big and small, from tracking the influence of lobbyists in Peru by harvesting the digital visitor logs for government buildings to monitoring and collecting political ads on Facebook. In both of those instances, the pages and data scraped are publicly available on the internet — no hacking necessary — but sites involved could easily change the fine print on their terms of service to label the aggregation of that information “unauthorized.” And the Supreme Court, depending on how it rules, could decide that violating those terms of service is a crime under the CFAA.

“A statute that allows powerful forces like the government or wealthy corporate actors to unilaterally criminalize newsgathering activities by blocking these efforts through the terms of service for their websites would violate the First Amendment,” The Markup wrote in the brief.

What sort of work is at risk? Here’s a roundup of some recent journalism made possible by web scraping:

The COVID tracking project, from The Atlantic, collects and aggregates data from around the country on a daily basis, serving as a means of monitoring where testing is happening, where the pandemic is growing, and the racial disparities in who’s contracting and dying from the virus.
This project, from Reveal, scraped extremist Facebook groups and compared their membership rolls to those of law enforcement groups on Facebook — and found a lot of overlap.

Reveal also used scrapers to find that hundreds of millions of dollars in property taxes should have never been charged to Detroit residents who then lost their homes through foreclosure.

The Markup’s recent investigation into Google’s search results found that it consistently favors its own products, leaving some websites from which the web giant itself scrapes information struggling for visitors and, therefore, ad revenue. The United States Department of Justice cited the issue in an antitrust lawsuit against the company.

In Copy, Paste, Legislate, USA Today found a pattern of cookie-cutter laws, pushed by special interest groups, circulating in legislatures around the country.

Reuters scraped social media and message boards to find an underground market for adopted children whose parents, who had usually adopted the children from abroad, decided the children were too much for them. A couple featured in the piece was later convicted of kidnapping as a result of the investigation.

Gizmodo was able to use similar tools to find the probable locations of tens of thousands of Ring surveillance cameras.

The Trace and The Verge, using scrapers, found people using an online market to sell guns without a license and without performing background checks.

This article was originally published on The Markup and is republished under the Creative Commons Attribution-NonCommercial-NoDerivatives license.

Additional Reading

Document of the Day: In Defense of Data Scraping

Web Scraping: A Journalist’s Guide

On the Ethics of Web Scraping and Data Journalism

The Markup is a nonprofit newsroom that investigates how powerful institutions use technology to change society. It is staffed with “quantitative journalists who pursue meaningful, data-driven investigations.”

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

Internet newsgathering the markup web scraping

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Why Web Scraping Is Vital to Democracy</h2> by <a href="https://www.twitter.com/themarkup/">The Markup Staff</a> for Global Investigative Journalism Network &bull; December 17, 2020 Editor's Note: The Markup, a New York-based investigative newsroom that covers the tech industry, recently argued for web scraping in an <a href="https://www.supremecourt.gov/DocketPDF/19/19-783/147271/20200708180752488_19-783%20-%20the%20markup%20amicus%20brief%20for%20e-filing%207-8-2020.pdf">amicus curiae </a>(literally, a friend of the court brief) for a US Supreme Court case that threatens to make scraping illegal. Here's why they did it.The fruits of web scraping -- using code to harvest data and information from websites -- are all around us.People build scrapers that can <a href="https://github.com/alltheplaces/alltheplaces/tree/master/locations/spiders">find every Applebee&rsquo;s on the planet</a> or <a href="https://github.com/unitedstates/congress">collect congressional legislation and votes</a> or <a href="https://www.watchpatrol.net">track fancy watches for sale</a> on fan websites. Businesses use scrapers to <a href="https://service.octoparse.com/inventory-web-scraping-blind-rivet-supply">manage their online retail inventory</a> and monitor<a href="https://datahut.co/solutions/"> competitors&rsquo; prices</a>. Lots of well-known sites use scrapers to do things like <a href="https://www.skyscanner.com">track airline ticket prices</a> and <a href="https://www.careerbuilder.com">job listings</a>. Google is essentially a giant, crawling web scraper.Scrapers are also the tools of watchdogs and journalists, which is why The Markup filed an <a href="https://www.supremecourt.gov/DocketPDF/19/19-783/147271/20200708180752488_19-783%20-%20the%20markup%20amicus%20brief%20for%20e-filing%207-8-2020.pdf">amicus brief</a> in a case before the United States Supreme Court that threatens to make scraping illegal.The case itself -- <a href="https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/19-783.html">Van Buren v. United States</a> -- is not about scraping but rather a legal question regarding the prosecution of a Georgia police officer, Nathan Van Buren, who was bribed to look up confidential information in a law enforcement database. Van Buren was prosecuted under the Computer Fraud and Abuse Act (CFAA), which prohibits unauthorized access to a computer network such as computer hacking, where someone breaks into a system to steal information (or, as dramatized in the 1980s classic movie &ldquo;<a href="https://www.imdb.com/title/tt0086567/">WarGames</a>,&rdquo; potentially start World War&nbsp;III).In Van Buren&rsquo;s case, since he was allowed to access the database for work, the question is whether the court will broadly define his troubling activities as &ldquo;exceeding authorized access&rdquo; to extract data, which is what would make it a crime under the CFAA. And it&rsquo;s that definition that could affect journalists.Or, as Justice Neil Gorsuch put it during Monday&rsquo;s oral arguments, lead in the direction of &ldquo;perhaps making a federal criminal of us all.&rdquo;Investigative journalists and other watchdogs often use scrapers to illuminate issues big and small, from <a href="https://manolo.rocks">tracking the influence of lobbyists in Peru</a> by harvesting the digital visitor logs for government buildings to <a href="https://adobservatory.org">monitoring and collecting</a> political ads on Facebook. In both of those instances, the pages and data scraped are publicly available on the internet -- no hacking necessary -- but sites involved could easily change the fine print on their terms of service to label the aggregation of that information &ldquo;unauthorized.&rdquo; And the Supreme Court, depending on how it rules, could decide that violating those terms of service is a crime under the CFAA.&ldquo;A statute that allows powerful forces like the government or wealthy corporate actors to unilaterally criminalize newsgathering activities by blocking these efforts through the terms of service for their websites would violate the First Amendment,&rdquo; The Markup wrote in the brief.What sort of work is at risk? Here&rsquo;s a roundup of some recent journalism made possible by web scraping:<ul>
<li>The <a href="https://covidtracking.com">COVID tracking project</a>, from The Atlantic, collects and aggregates data from around the country on a daily basis, serving as a means of monitoring where testing is happening, where the pandemic is growing, and the racial disparities in who&rsquo;s contracting and dying from the virus.</li>
<li>This <a href="https://revealnews.org/topic/to-protect-and-slur/">project</a>, from Reveal, scraped extremist Facebook groups and compared their membership rolls to those of law enforcement groups on Facebook -- and found a lot of overlap.</li>
</ul><ul>
<li>Reveal also used scrapers to find that <a href="https://revealnews.org/episodes/the-lost-homes-of-detroit/">hundreds of millions of dollars in property taxes</a> should have never been charged to Detroit residents who then lost their homes through foreclosure.</li>
</ul><ul>
<li>The Markup&rsquo;s recent investigation into Google&rsquo;s search results found that it consistently <a href="https://themarkup.org/google-the-giant/2020/07/28/google-search-results-prioritize-google-products-over-competitors">favors its own products</a>, leaving some websites from which the web giant itself scrapes information struggling for visitors and, therefore, ad revenue. The United States Department of Justice <a href="https://themarkup.org/google-the-giant/2020/10/20/google-antitrust-lawsuit-markup-investigations">cited the issue</a> in an antitrust lawsuit against the company.</li>
</ul><ul>
<li>In <a href="https://www.usatoday.com/pages/interactives/asbestos-sharia-law-model-bills-lobbyists-special-interests-influence-state-laws/">Copy, Paste, Legislate</a>, USA Today found a pattern of cookie-cutter laws, pushed by special interest groups, circulating in legislatures around the country.</li>
</ul><ul>
<li>Reuters scraped social media and message boards to <a href="https://www.reuters.com/investigates/adoption/#article/part1">find an underground market for adopted children</a> whose parents, who had usually adopted the children from abroad, decided the children were too much for them. A couple featured in the piece was later convicted of<a href="https://www.justice.gov/opa/pr/illinois-couple-sentenced-multiple-kidnappings-and-transporting-minor-intent-engage-sexual"> kidnapping</a> as a <a href="https://www.reuters.com/article/us-usa-kidnapping-adoption/re-homing-couple-exposed-by-reuters-is-indicted-on-kidnap-charges-idUSKBN0NT2GK20150508">result of the investigation</a>.</li>
</ul><ul>
<li>Gizmodo was able to use similar tools to find the probable <a href="https://gizmodo.com/ring-s-hidden-data-let-us-map-amazons-sprawling-home-su-1840312279">locations of tens of thousands of Ring surveillance cameras</a>.</li>
</ul><ul>
<li>The Trace and The Verge, using scrapers, found people <a href="https://www.thetrace.org/2020/01/armslist-unlicensed-gun-sales-engaged-in-the-business/">using an online market to sell guns</a> without a license and without performing background checks.</li>
</ul>This article was <a href="https://themarkup.org/news/2020/12/03/why-web-scraping-is-vital-to-democracy">originally published on The Markup</a> and is republished under the <a href="https://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives license</a>.<h4>Additional Reading</h4><a href="https://gijn.org/2020/07/20/document-of-the-day-in-defense-of-data-scraping/">Document of the Day: In Defense of Data Scraping</a><a href="https://gijn.org/2015/08/11/web-scraping-a-journalists-guide/">Web Scraping: A Journalist&rsquo;s Guide</a><a href="https://gijn.org/2015/08/12/on-the-ethics-of-web-scraping-and-data-journalism/">On the Ethics of Web Scraping and Data Journalism</a><hr><a href="https://themarkup.org/news/2020/12/03/why-web-scraping-is-vital-to-democracy"><img class="alignleft" src="https://mrkp-static-production.themarkup.org/static/img/republish-logo.png" alt="Originally published on themarkup.org" width="246" height="79"></a><a href="https://www.twitter.com/themarkup/">The Markup</a> is a nonprofit newsroom that investigates how powerful institutions use technology to change society. It is staffed with "quantitative journalists who pursue meaningful, data-driven investigations."
	This <a target="_blank" href="https://gijn.org/stories/why-web-scraping-is-vital-to-democracy/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

Honoring the Best in Data Journalism: Winners of the 2023 Sigma Awards

by Laura Dixon • March 28, 2023

Winning entries at this year’s Sigma Awards focused on the war in Ukraine, air pollution, rising sea levels, political candidates, and road accidents involving schoolchildren, and used data, satellite imagery, gaming techniques, and 3D imagery to create compelling stories.

Data Journalism

Data Journalism Top 10: Royal Instagram Mystery, US Election, The Markup Launches, 100 Years of Mideast Deals

by Eunice Au & Connected Action • March 5, 2020

What’s the global data journalism community tweeting about this week? Our NodeXL #ddj mapping from February 24 to March 1 finds The New York Times digging into some curious data from two Instagram accounts of the British royal family, Al Jazeera analyzing Trump’s plan to resolve the Israeli-Palestinian conflict, The Markup launching with an investigation into auto insurance algorithms, and Pew Research Center sharing some American election data snapshots.

Data Journalism

The Quartz Guide to Bad Data

by Christopher Groskopf • January 15, 2016

Data Journalism Methodology Reporting Tools & Tips

On the Ethics of Web Scraping and Data Journalism

by Nael Shiab • August 12, 2015

Web scraping is a way to extract information presented on websites. As I explained it in the first installment of this article, web scraping is used by many companies. It’s also a great tool for reporters who know how to code, since more and more public institutions publish their data on their websites.
With web scrapers, which are also called “bots,” it’s possible to gather large amounts of data for stories. But what are the ethical rules that reporters have to follow while web scraping?

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Why Web Scraping Is Vital to Democracy

Read this article in

Additional Reading

Read other stories tagged with:

Republish this article

Read Next

Data Journalism News & Analysis

Honoring the Best in Data Journalism: Winners of the 2023 Sigma Awards

Data Journalism

Data Journalism Top 10: Royal Instagram Mystery, US Election, The Markup Launches, 100 Years of Mideast Deals

Data Journalism

The Quartz Guide to Bad Data

Data Journalism Methodology Reporting Tools & Tips

On the Ethics of Web Scraping and Data Journalism

Stories

Topics

Why Web Scraping Is Vital to Democracy

Read this article in

Related Resources

Tipsheet: Latest Tools for Investigating with Telegram

Investigating Elections: Threat from AI Audio Deepfakes

Updated GIJN Databases (Poverty, Crime, Corruption, and Terrorism)

Updated Resources on Corruption

Share

Additional Reading

Related Resources

Tipsheet: Latest Tools for Investigating with Telegram

Investigating Elections: Threat from AI Audio Deepfakes

Updated GIJN Databases (Poverty, Crime, Corruption, and Terrorism)

Updated Resources on Corruption

Related Stories

Honoring the Best in Data Journalism: Winners of the 2023 Sigma Awards

Data Journalism Top 10: Royal Instagram Mystery, US Election, The Markup Launches, 100 Years of Mideast Deals

The Quartz Guide to Bad Data

On the Ethics of Web Scraping and Data Journalism

Read other stories tagged with:

Republish this article

Read Next

Data Journalism News & Analysis

Honoring the Best in Data Journalism: Winners of the 2023 Sigma Awards

Data Journalism

Data Journalism Top 10: Royal Instagram Mystery, US Election, The Markup Launches, 100 Years of Mideast Deals

Data Journalism

The Quartz Guide to Bad Data

Data Journalism Methodology Reporting Tools & Tips

On the Ethics of Web Scraping and Data Journalism