Stories

•

Topics

» Data Journalism » Methodology » Reporting Tools & Tips

Web Scraping: A Journalist’s Guide

by Nael Shiab • August 11, 2015

Do you remember when Twitter lost $8 billion in just a few hours earlier this year? It was because of a web scraper, a tool companies use—as do many data reporters.

A web scraper is simply a computer program that reads the HTML code from webpages, and analyze it. With such a program, or “bot,” it’s possible to extract data and information from websites.

Let’s go back in time. Last April, Twitter was supposed to announce its trimestrial financial results once the stock markets closed. Because the results were a little bit disappointing, Twitter wanted to avoid a brutal confidence loss from the traders. Unfortunately, because of a mistake, the results were published online for 45 seconds, when the stock markets were still open.

These 45 seconds allowed a bot programmed to web scrape to find the results, format them and automatically publish them on Twitter itself. (Nowadays, even bots have scoops from time to time!)

#BREAKING: Twitter $TWTR Q1 Revenue misses estimates, $436M vs. $456.52M expected

— Selerity (@Selerity) April 28, 2015

Once the tweet was published, traders went crazy. It was a disaster for Twitter. The bot’s company, Selerity, specializes in real-time analysis, and became the target of many critics. The company explained the situation a few minutes later.

Today’s $TWTR earnings release was sourced from Twitter’s Investor Relations website https://t.co/QD6138euja. No leak. No hack.

— Selerity (@Selerity) April 28, 2015

For a bot, 45 seconds is an eternity. According to the company, it took only three seconds for its bot to publish the financial results.

Web Scraping and Journalism

As more and more public institutions publish data on websites, web scraping has become an increasingly useful tool for reporters who know how to code.

For example: for a story for Journal Métro, I used a web scraper to compare the price of 12,000 products from the Société des alcools du Québec with the price of 10,000 products of the LCBO in Ontario.

Another time, when I was in Sudbury, I decided to investigate food inspections in restaurants. All the results from such investigations are published on the Sudbury Health Unit’s website. However, it’s impossible to download all the results; you can only verify the restaurants one by one.

I asked for the entire database where the results are stored. After a first refusal, I filed a freedom-of-information request—after which the Health Unit asked for a $2,000 fee to process my request.

Instead of paying, I decided to code my own bot, one that would extract all the results directly from the website. Here is the result:

Coded in Python, my bot takes control of Google Chrome with the Selenium library. It clicks on each result for the 1600 facilities inspected by the Health Unit, extracts the data and then sends the information into an Excel file.

To do all of that by yourself would take you weeks. For my bot, it was one night of work.

But while my bot was tirelessly extracting thousands of lines of code, one thought kept bothering me: what are the ethical rules of web scraping?

Do we have the right to extract any information found on the web? Where is the line between scraping, and hacking? And how can you ensure that the process is transparent for both the institutions targeted and the public reading the story?

As reporters, we have to respect the highest ethical standards. Otherwise, how can readers trust the facts we report to them?

Unfortunately, the code of conduct of the Fédération professionnelle des journalistes du Québec, adopted in 1996 and amended in 2010, is getting old and brings no clear answers to all my questions.

The ethics guidelines of the Canadian Association of Journalists, although more recent, doesn’t shed much light on the matter, either.

As Université de Québec à Montréal journalism professor Jean-Hugues Roy says it: “These are new territories. There are new tools that push us to rethink what ethics are, and the ethics have to evolve with them.”

So, I decided to find the answers by myself, by contacting several data reporters in the country.

Stay tuned; the results from that survey will be published in a following instalment.

Note: If you’d like to try a web scrape yourself, I published a short tutorial last February. You will learn how to extract data from the Parliament of Canada website!

This post originally appeared on J-Source.CA and is reprinted with permission. To see this story in Chinese, check GIJN’s Chinese-language site.

Nael Shiab is an MA graduate of the University of King’s College digital journalism program. He has worked as a video reporter for Radio-Canada and is currently a data reporter for Transcontinental. @NaelShiab

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

bots Canadian Association of Journalists code of conduct ethics guidelines Fédération professionnelle des journalistes du Québec Journal Metro Python Selenium library selerity twitter web scraping

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Web Scraping: A Journalist&rsquo;s Guide</h2> by <a href="https://twitter.com/NaelShiab">Nael Shiab</a> for Global Investigative Journalism Network &bull; August 11, 2015 <a href="https://gijn.org/wp-content/uploads/2015/07/Backlit_keyboard-1024x576.jpg"><img class="alignright wp-image-5495" src="https://gijn.org/wp-content/uploads/2015/07/Backlit_keyboard-1024x576-771x434.jpg" alt="Backlit_keyboard-1024x576" width="403" height="227"></a>Do you remember when <a href="http://www.bbc.com/news/technology-32511932">Twitter lost $8 billion in just a few hours</a> earlier this year? It was because of a web scraper, a tool companies use&mdash;as do many data reporters.A web scraper is simply a computer program that reads the HTML code from webpages, and analyze it. With such a program, or &ldquo;bot,&rdquo; it&rsquo;s possible to extract data and information from websites.Let&rsquo;s go back in time. Last April, Twitter was supposed to announce its trimestrial financial results once the stock markets closed. Because the results were a little bit disappointing, Twitter wanted to avoid a brutal confidence loss from the traders. Unfortunately, because of a mistake, the results were published online for 45 seconds, when the stock markets were still open.These 45 seconds allowed a bot programmed to web scrape to find the results, format them and automatically publish them on Twitter itself. (Nowadays, even bots have scoops from time to time!)<blockquote class="twitter-tweet" lang="en" xml:lang="en">
<a href="https://twitter.com/hashtag/BREAKING?src=hash">#BREAKING</a>: Twitter <a href="https://twitter.com/search?q=%24TWTR&amp;src=ctag">$TWTR</a> Q1 Revenue misses estimates, $436M vs. $456.52M expected
&mdash; Selerity (@Selerity) <a href="https://twitter.com/Selerity/status/593129551221432320">April 28, 2015</a></blockquote>Once the tweet was published, traders went crazy. It was a disaster for Twitter. The bot&rsquo;s company, <a href="http://www.seleritycorp.com/">Selerity</a>, specializes in real-time analysis, and became the target of many critics. The company explained the situation a few minutes later.<blockquote class="twitter-tweet" lang="en" xml:lang="en">
Today&rsquo;s <a href="https://twitter.com/search?q=%24TWTR&amp;src=ctag">$TWTR</a> earnings release was sourced from Twitter&rsquo;s Investor Relations website <a href="https://t.co/QD6138euja">https://t.co/QD6138euja</a>. No leak. No hack.
&mdash; Selerity (@Selerity) <a href="https://twitter.com/Selerity/status/593136296236752896">April 28, 2015</a></blockquote>For a bot, 45 seconds is an eternity. According to the company, <a href="http://arstechnica.com/business/2015/05/how-selerity-reported-twitters-2q15-earnings-before-twitter-did/">it took only three seconds for its bot to publish the financial results</a>.Web Scraping and JournalismAs more and more public institutions publish data on websites, web scraping has become an increasingly useful tool for reporters who know how to code.For example: <a href="http://journalmetro.com/actualites/national/789697/saq-des-centaines-de-produits-moins-chers-en-ontario/">for a story for Journal M&eacute;tro</a>, I used a web scraper to compare the price of 12,000 products from the Soci&eacute;t&eacute; des alcools du Qu&eacute;bec with the price of 10,000 products of the LCBO in Ontario.Another time, when I was in Sudbury, I decided to investigate food inspections in restaurants. All the results from such investigations are published on the <a href="http://inspectionresults.sdhu.com/">Sudbury Health Unit</a>&rsquo;s website. However, it&rsquo;s impossible to download all the results; you can only verify the restaurants one by one.<img class="media-element file-default" title="" src="http://j-source.ca/sites/default/wp-content/uploads/Sante-publique-1024x621.png" alt="" width="1024" height="621">I asked for the entire database where the results are stored. After a first refusal, I filed a freedom-of-information request&mdash;after which the Health Unit asked for a $2,000 fee to process my request.Instead of paying, I decided to code my own bot, one that would extract all the results directly from the website. Here is the result:Coded in Python, my bot takes control of Google Chrome with the <a href="http://selenium-python.readthedocs.org/">Selenium</a> library. It clicks on each result for the 1600 facilities inspected by the Health Unit, extracts the data and then sends the information into an Excel file.To do all of that by yourself would take you weeks. For my bot, it was one night of work.<img class="media-element file-default" title="" src="http://j-source.ca/sites/default/wp-content/uploads/inspections-restaurants-1024x517.png" alt="" width="1024" height="517">But while my bot was tirelessly extracting thousands of lines of code, one thought kept bothering me: what are the ethical rules of web scraping?Do we have the right to extract any information found on the web? Where is the line between scraping, and hacking? And how can you ensure that the process is transparent for both the institutions targeted and the public reading the story?As reporters, we have to respect the highest ethical standards. Otherwise, how can readers trust the facts we report to them?Unfortunately, the code of conduct of the <a href="http://www.fpjq.org/deontologie/guide-de-deontologie/#pt5">F&eacute;d&eacute;ration professionnelle des journalistes du Qu&eacute;bec</a>, adopted in 1996 and amended in 2010, is getting old and brings no clear answers to all my questions.The <a href="http://www.caj.ca/ethics-guidelines/">ethics guidelines</a> of the Canadian Association of Journalists, although more recent, doesn&rsquo;t shed much light on the matter, either.As Universit&eacute; de Qu&eacute;bec &agrave; Montr&eacute;al journalism professor <a href="https://twitter.com/jeanhuguesroy">Jean-Hugues Roy</a> says it: &ldquo;These are new territories. There are new tools that push us to rethink what ethics are, and the ethics have to evolve with them.&rdquo;So, I decided to find the answers by myself, by contacting several data reporters in the country.Stay tuned; the results from that survey will be published in a following instalment.Note: If you&rsquo;d like to try a web scrape yourself, <a href="http://naelshiab.com/members-parliament-web-scraping/">I published a short tutorial last February</a>. You will learn how to extract data from the Parliament of Canada website!&nbsp;<hr>This post originally <a href="http://j-source.ca/article/journalists-guide-web-scraping">appeared on J-Source.CA</a> and is reprinted with permission. To see this story in Chinese, <a href="http://cn.gijn.org/2015/08/21/%E6%96%B0%E9%97%BB%E4%BA%BA%E7%BD%91%E7%BB%9C%E6%95%B0%E6%8D%AE%E9%87%87%E9%9B%86%E5%85%A5%E9%97%A8/">check GIJN's Chinese-language site.</a><a href="https://gijn.org/wp-content/uploads/2015/07/nael.jpg"><img class="alignleft size-thumbnail wp-image-5496" src="https://gijn.org/wp-content/uploads/2015/07/nael-140x140.jpg" alt="nael" width="140" height="140"></a>Nael Shiab is an MA graduate of the University of King's College digital journalism program. He has worked as a video reporter for Radio-Canada and is currently a data reporter for Transcontinental. <a href="https://twitter.com/NaelShiab">@NaelShiab</a>
	This <a target="_blank" href="https://gijn.org/stories/web-scraping-a-journalists-guide/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

Experts Discuss Investigating Social Media Companies Like Facebook, Twitter, and TikTok

by Holly Pate • May 29, 2023

Investigative journalists intending to cover social media and its societial effects must understand the intricacies of the companies that drive them, and think critically about novel angles of coverage.

data journalism deforestation latin america

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Latin America Deforestation, Twitter Censorship, and Bankrupt Coal Mines’ Toxic Legacy

by Alexa van Sickle and Eunice Au • May 5, 2023

In this week’s Top 10 in Data Journalism, GIJN looks into stories about Latin American deforestation, Twitter’s censorship compliance, and the toxic legacy of bankrupt coal mines in the US.

Reporting Tools & Tips

New Investigative Tools for Monitoring Social Media Platforms

by Rowan Philp • March 20, 2023

Social media platforms are among the most difficult sites to scrape for data across the internet. A recent session at NICAR23 unveiled several dynamic new tools — including Junkipedia, a possible CrowdTangle replacement — that can perform a wealth of social media monitoring tasks, from tracking down who is behind harmful ads to identifying conspiracy groups or influencers spreading disinformation.

NYTimes Official Obituaries in China data journalism COVID death estimates

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Elon Musk’s Tweets, Chart-Topping Hits, China’s COVID Toll (by Obit)

by Eunice Au, Laura Dixon and Connected Action • February 10, 2023

This week’s Top 10 in data journalism looks at Elon Musk’s Tweets, tracking COVID in China via official obituaries, Kontinentalist’s piece on rubber’s history in colonial Malaya, El Confidencial’s analysis of immigration in Spain, The Economist’s look into the secret of creating chart-topping hits, and more.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Web Scraping: A Journalist’s Guide

Read other stories tagged with:

Republish this article

Read Next

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Elon Musk’s Tweets, Chart-Topping Hits, China’s COVID Toll (by Obit)

Stories

Topics

Web Scraping: A Journalist’s Guide

Related Resources

Simple Tips for Verifying if a Tweet Screenshot Is Real or Fake

GIJN Guide to Investigating Foreign Lobbying

Guide to Investigating Caste

Gathering Evidence and Documents in Conflict and War Zones — A MENA Case Study

Share

Related Resources

Simple Tips for Verifying if a Tweet Screenshot Is Real or Fake

GIJN Guide to Investigating Foreign Lobbying

Guide to Investigating Caste

Gathering Evidence and Documents in Conflict and War Zones — A MENA Case Study

Related Stories

Experts Discuss Investigating Social Media Companies Like Facebook, Twitter, and TikTok

Data Journalism Top 10: Latin America Deforestation, Twitter Censorship, and Bankrupt Coal Mines’ Toxic Legacy

New Investigative Tools for Monitoring Social Media Platforms

Data Journalism Top 10: Elon Musk’s Tweets, Chart-Topping Hits, China’s COVID Toll (by Obit)

Read other stories tagged with:

Republish this article

Read Next

News & Analysis Reporting Tools & Tips

Experts Discuss Investigating Social Media Companies Like Facebook, Twitter, and TikTok

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Latin America Deforestation, Twitter Censorship, and Bankrupt Coal Mines’ Toxic Legacy

Reporting Tools & Tips

New Investigative Tools for Monitoring Social Media Platforms

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Elon Musk’s Tweets, Chart-Topping Hits, China’s COVID Toll (by Obit)