Accessibility Settings

color options

monochrome muted color dark

reading tools

isolation ruler

Stories

Topics

Web Scraping: A Journalist’s Guide

Backlit_keyboard-1024x576Do you remember when Twitter lost $8 billion in just a few hours earlier this year? It was because of a web scraper, a tool companies use—as do many data reporters.

A web scraper is simply a computer program that reads the HTML code from webpages, and analyze it. With such a program, or “bot,” it’s possible to extract data and information from websites.

Let’s go back in time. Last April, Twitter was supposed to announce its trimestrial financial results once the stock markets closed. Because the results were a little bit disappointing, Twitter wanted to avoid a brutal confidence loss from the traders. Unfortunately, because of a mistake, the results were published online for 45 seconds, when the stock markets were still open.

These 45 seconds allowed a bot programmed to web scrape to find the results, format them and automatically publish them on Twitter itself. (Nowadays, even bots have scoops from time to time!)

Once the tweet was published, traders went crazy. It was a disaster for Twitter. The bot’s company, Selerity, specializes in real-time analysis, and became the target of many critics. The company explained the situation a few minutes later.

For a bot, 45 seconds is an eternity. According to the company, it took only three seconds for its bot to publish the financial results.

Web Scraping and Journalism

As more and more public institutions publish data on websites, web scraping has become an increasingly useful tool for reporters who know how to code.

For example: for a story for Journal Métro, I used a web scraper to compare the price of 12,000 products from the Société des alcools du Québec with the price of 10,000 products of the LCBO in Ontario.

Another time, when I was in Sudbury, I decided to investigate food inspections in restaurants. All the results from such investigations are published on the Sudbury Health Unit’s website. However, it’s impossible to download all the results; you can only verify the restaurants one by one.

I asked for the entire database where the results are stored. After a first refusal, I filed a freedom-of-information request—after which the Health Unit asked for a $2,000 fee to process my request.

Instead of paying, I decided to code my own bot, one that would extract all the results directly from the website. Here is the result:

Coded in Python, my bot takes control of Google Chrome with the Selenium library. It clicks on each result for the 1600 facilities inspected by the Health Unit, extracts the data and then sends the information into an Excel file.

To do all of that by yourself would take you weeks. For my bot, it was one night of work.

But while my bot was tirelessly extracting thousands of lines of code, one thought kept bothering me: what are the ethical rules of web scraping?

Do we have the right to extract any information found on the web? Where is the line between scraping, and hacking? And how can you ensure that the process is transparent for both the institutions targeted and the public reading the story?

As reporters, we have to respect the highest ethical standards. Otherwise, how can readers trust the facts we report to them?

Unfortunately, the code of conduct of the Fédération professionnelle des journalistes du Québec, adopted in 1996 and amended in 2010, is getting old and brings no clear answers to all my questions.

The ethics guidelines of the Canadian Association of Journalists, although more recent, doesn’t shed much light on the matter, either.

As Université de Québec à Montréal journalism professor Jean-Hugues Roy says it: “These are new territories. There are new tools that push us to rethink what ethics are, and the ethics have to evolve with them.”

So, I decided to find the answers by myself, by contacting several data reporters in the country.

Stay tuned; the results from that survey will be published in a following instalment.

Note: If you’d like to try a web scrape yourself, I published a short tutorial last February. You will learn how to extract data from the Parliament of Canada website! 


This post originally appeared on J-Source.CA and is reprinted with permission. To see this story in Chinese, check GIJN’s Chinese-language site.

naelNael Shiab is an MA graduate of the University of King’s College digital journalism program. He has worked as a video reporter for Radio-Canada and is currently a data reporter for Transcontinental. @NaelShiab

Republish this article


Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

Read Next

Reporting Tools & Tips

New Investigative Tools for Monitoring Social Media Platforms

Social media platforms are among the most difficult sites to scrape for data across the internet. A recent session at NICAR23 unveiled several dynamic new tools — including Junkipedia, a possible CrowdTangle replacement — that can perform a wealth of social media monitoring tasks, from tracking down who is behind harmful ads to identifying conspiracy groups or influencers spreading disinformation. 

NYTimes Official Obituaries in China data journalism COVID death estimates

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Elon Musk’s Tweets, Chart-Topping Hits, China’s COVID Toll (by Obit)

This week’s Top 10 in data journalism looks at Elon Musk’s Tweets, tracking COVID in China via official obituaries, Kontinentalist’s piece on rubber’s history in colonial Malaya, El Confidencial’s analysis of immigration in Spain, The Economist’s look into the secret of creating chart-topping hits, and more.