Stories

•

Topics

» Data Journalism » Methodology » Reporting Tools & Tips

On the Ethics of Web Scraping and Data Journalism

by Nael Shiab • August 12, 2015

Read this article in

Web scraping is a way to extract information presented on websites. As I explained it in the first instalment of this article, web scraping is used by many companies.

It’s also a great tool for reporters who know how to code, since more and more public institutions publish their data on their websites.

With web scrapers, which are also called “bots,” it’s possible to gather large amounts of data for stories. For example, I created one to compare the alcohol prices between Quebec and Ontario.

My colleague, Florent Daudens, who works for Radio-Canada, also used a web scraper to compare the rent prices in several neighbourhoods in Montreal with ads from Kijiji.

But what are the ethical rules that reporters have to follow while web scraping?

These rules are particularly important since, for non-geek people, web scraping looks like hacking.

Unfortunately the Code of Ethics of the Fédération professionnelle des journalistes, nor the ethical guidelines of the Canadian Association of Journalists, give a clear answer to this question.

So I asked a few data reporter colleagues, and looked for some answers myself.

Public Data, or Not?

This is the first consensus from data reporters: if an institution publishes data on its website, this data should automatically be public.

Cédric Sam works for the South China Morning Post, in Hong Kong. He also worked for La Presse and Radio-Canada. “I do web scraping almost every day,” he says.

For him, bots have as much responsibility as their human creators. “Whether it’s a human who copies and pastes the data, or a human who codes a computer program to do it, it’s the same. It’s like hiring 1000 people that would work for you. It’s the same result.”

However, government’s servers also host personal information about citizens. “Most of this data is hidden because it would otherwise violate privacy laws,” says William Wolfe-Wylie, a developer for CBC and journalism teacher at Centennial College and the Munk School at University of Toronto.

Here is the very important limit between web scraping and hacking: the respect of the law.

Reporters should not pry into protected data. If a regular user can’t access it, journalists shouldn’t try to get it. “It’s very important that reporters acknowledge these legal barriers, which are legitimate ones, and respect them,” says William Wolfe-Wylie.

Roberto Rocha, who was until recently a data reporter for the Montreal Gazette, adds that journalists should always read the user terms and conditions of use to avoid any trouble.

Another important detail to verify: the robots.txt file, which can be found at the root of the website and which states what is allowed to be scraped or not. For example, here is the file for the Royal Bank of Canada: http://www.rbcbanqueroyale.com/robots.txt

Identify Yourself, or Not?

When you are a reporter and you want to ask someone questions, the first thing to do is to present yourself and the story you are working on.

But what should you do when it’s a bot that is sending queries to a server or a database? Should the same rule apply?

For Glen McGregor, national affairs reporter for the Ottawa Citizen, the answer is yes. “In the http headers, I put my name, my phone number and a note saying: ‘I am a reporter extracting data from this webpage. If you have any problem or concern, call me.’

“So, if the web administrator suddenly sees a huge amount of hits on his website, freaks out and thinks he’s under attack, he can check who’s doing it. He will see my note and my phone number. I think it’s an important ethical thing to do.”

Jean-Hugues Roy, a journalism professor at the Université du Québec à Montréal and himself a web scraper coder, agrees.

But everybody is not on the same page. Philippe Gohier, web editor-in-chief at L’Actualité, does everything he can to not be identified.

“Sometimes, I use proxys,” he says. “I change my IP address and I change my headers too, to make it look like a real human instead of a bot. I try to respect the rules, but I also try to be undetectable.”

To not identify yourself when you are extracting data from a website could be compared, in some ways, to doing interviews with a hidden mic or camera. The Code of Ethics from the FPJQ states some rules regarding this.

4 a) Undercover procedures

In certain cases, journalists are justified in obtaining the information they seek through undercover means: false identities, hidden microphones and cameras, imprecise information about the objectives of their news reports, spying, infiltrating…

These methods must always be the exception to the rule. Journalists use them when:

* the information sought is of definite public interest; for example, in cases where socially reprehensible actions must be exposed;

* the information cannot be obtained or verified by other means, or other means have already been used unsuccessfully;

* the public gain is greater than any inconvenience to individuals.

The public must be informed of the methods used.

Best practice would generally be to identify yourself in your code, even if it’s a bot that does all the work. However, if there’s a possibility that the targeted institution would change the availability of the data because a reporter tries to gather it, you should make yourself more discreet.

And for those who are afraid to be blocked if you identify as a reporter, don’t worry; it’s quite easy to change your IP address.

For some reporters, best practise is also to ask for the data before scraping it. For them, it’s only after a refusal that web scraping should be an option.

This interesting point has an advantage: if the institution answers quickly and gives you the raw data, it will save you time.

Publish Your Code, or Not?

Transparency is another very important aspect of journalism. Without it, the public wouldn’t trust the reporters’ work. From the FPJQ Code of Ethics:

The vast majority of data reporters publish the data they used for their stories. This act of transparency shows that their reports are based on real facts that the public can check if it wants to. But what about their code?

An error in a web scraper script can completely skew the analysis of the data obtained. So should the code be public as well?

For open-source software, to reveal the code is a must. The main reason is to allow others to improve the software, but also to give confidence to the users who can check what the software is doing in detail.

However, for coder-reporters, to reveal or not to reveal is a difficult choice.

“In some ways, we are businesses,” said Sam. I think that if you have a competitive edge and if you can continue to find stories with it, you should keep it to yourself. You can’t reveal everything all the time.”

For Roberto Rocha, the code shouldn’t be published.

However, Rocha has a GitHub account where he publishes some of his scripts, as Chad Skelton, Jean-Hugues Roy and Philippe Gohier do.

“I really think that the tide lifts all boats,” said Gohier. “The more we share scripts and technology, the more it will help everybody. I’m not doing anything that someone can’t do with some effort. I am not reshaping the world.”

Jean-Hugues Roy agreed, and added that journalists should allow others to replicate their work, like scientists do by publishing their methodology.

Nonetheless, the professor specifies that there’re exceptions. Roy is currently working on a bot that would extract data from SEDAR, where documents from the Canadian publicly traded companies are published.

“I usually publish my code, but this one, I don’t know. It’s complicated and I put a lot of time into it.”

On an another hand, Glen McGregor doesn’t publish his scripts, but sends them if someone asks for them.

When a reporter has a source, he will do everything in his power to protect it. The reporter will do so to earn the confidence of his source, who will hopefully give him more sensitive information. But the reporter also does this to keep his source to himself.

So, in the end, a web scraper could be viewed as the bot version of a source. Another question to consider is whether reporters’ bots be patented in the future.

Who knows? Perhaps one day a reporter will refuse to reveal his code the same way Daniel Leblanc refused to reveal the identity of his source called “Ma Chouette.”

After all, these days, bots are starting to look more and more like humans.

Note: This is more a technical detail than an ethical dilemma, but to respect the web infrastructure is, of course, another golden rule of web scraping. Always leave several seconds between your requests, and don’t overload servers.

This post originally appeared on J-Source.CA and is reprinted with permission

Nael Shiab is an MA graduate of the University of King’s College digital journalism program. He has worked as a video reporter for Radio-Canada and is currently a data reporter for Transcontinental. @NaelShiab

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

best practices bots Cédric Sam Coding Ethics GitHub Glen McGregor IP address Jean-Hugues Roy Philippe Gohier public data Roberto Rocha undercover reporting web scraping William Wolfe-Wylie

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>On the Ethics of Web Scraping and Data Journalism</h2> by <a href="https://twitter.com/NaelShiab">Nael Shiab</a> for Global Investigative Journalism Network &bull; August 12, 2015 Web scraping is a way to extract information presented on websites. <a href="https://gijn.org/2015/08/11/web-scraping-a-journalists-guide/">As I explained it in the first instalment of this article</a>, web scraping is used by many companies.It&rsquo;s also a great tool for reporters who know how to code, since more and more public institutions publish their data on their websites.With web scrapers, which are also called &ldquo;bots,&rdquo; it&rsquo;s possible to gather large amounts of data for stories. For example, I created one <a href="http://journalmetro.com/actualites/national/789697/saq-des-centaines-de-produits-moins-chers-en-ontario/">to compare the alcohol prices between Quebec and Ontario</a>.My colleague, Florent Daudens, who works for Radio-Canada, also used a web scraper <a href="http://ici.radio-canada.ca/nouvelles/societe/2015/06/14/004-loyer-abordable-kijiji-annonces-prix-logements-location.shtml">to compare the rent prices in several neighbourhoods in Montreal with ads from Kijiji</a>.But what are the ethical rules that reporters have to follow while web scraping?These rules are particularly important since, for non-geek people, web scraping looks like hacking.Unfortunately <a href="http://www.fpjq.org/deontologie/guide-de-deontologie/">the Code of Ethics of the F&eacute;d&eacute;ration professionnelle des journalistes</a>, nor the ethical guidelines of the <a href="http://www.caj.ca/ethics-guidelines/">Canadian Association of Journalists</a>, give a clear answer to this question.So I asked a few data reporter colleagues, and looked for some answers myself.Public Data, or Not?This is the first consensus from data reporters: if an institution publishes data on its website, this data should automatically be public.<a href="https://twitter.com/cedricsam">C&eacute;dric Sam</a> works for the <a href="http://www.scmp.com/frontpage/international">South China Morning Post</a>, in Hong Kong. He also worked for La Presse and Radio-Canada. &ldquo;I do web scraping almost every day,&rdquo; he says.For him, bots have as much responsibility as their human creators. &ldquo;Whether it&rsquo;s a human who copies and pastes the data, or a human who codes a computer program to do it, it&rsquo;s the same. It&rsquo;s like hiring 1000 people that would work for you. It&rsquo;s the same result.&rdquo;However, government&rsquo;s servers also host personal information about citizens. &ldquo;Most of this data is hidden because it would otherwise violate privacy laws,&rdquo; says <a href="https://twitter.com/wolfewylie">William Wolfe-Wylie</a>, a developer for CBC and journalism teacher at Centennial College and the Munk School at University of Toronto.Here is the very important limit between web scraping and hacking: the respect of the law.Reporters should not pry into protected data. If a regular user can&rsquo;t access it, journalists shouldn&rsquo;t try to get it. &ldquo;It&rsquo;s very important that reporters acknowledge these legal barriers, which are legitimate ones, and respect them,&rdquo; says William Wolfe-Wylie.<a href="https://twitter.com/robroc">Roberto Rocha</a>, who was until recently a data reporter for the Montreal Gazette, adds that journalists should always read the user terms and conditions of use to avoid any trouble.Another important detail to verify: the robots.txt file, which can be found at the root of the website and which states what is allowed to be scraped or not. For example, here is the file for the Royal Bank of Canada: <a href="http://www.rbcbanqueroyale.com/robots.txt">http://www.rbcbanqueroyale.com/robots.txt</a><img class="media-element file-default" title="" src="http://j-source.ca/sites/default/wp-content/uploads/robots-RBC.png" alt="" width="880" height="778"><h4 class="p1">Identify Yourself, or Not?</h4>When you are a reporter and you want to ask someone questions, the first thing to do is to present yourself and the story you are working on.But what should you do when it&rsquo;s a bot that is sending queries to a server or a database? Should the same rule apply?For <a href="https://twitter.com/glen_mcgregor">Glen McGregor</a>, national affairs reporter for the Ottawa Citizen, the answer is yes. &ldquo;In the http headers, I put my name, my phone number and a note saying: &lsquo;I am a reporter extracting data from this webpage. If you have any problem or concern, call me.&rsquo;&nbsp;&ldquo;So, if the web administrator suddenly sees a huge amount of hits on his website, freaks out and thinks he&rsquo;s under attack, he can check who&rsquo;s doing it. He will see my note and my phone number. I think it&rsquo;s an important ethical thing to do.&rdquo;<a href="https://twitter.com/jeanhuguesroy">Jean-Hugues Roy</a>, a journalism professor at the Universit&eacute; du Qu&eacute;bec &agrave; Montr&eacute;al and himself a web scraper coder, agrees.But everybody is not on the same page. <a href="https://twitter.com/pgohier">Philippe Gohier</a>, web editor-in-chief at L&rsquo;Actualit&eacute;, does everything he can to not be identified.&ldquo;Sometimes, I use proxys,&rdquo; he says. &ldquo;I change my IP address and I change my headers too, to make it look like a real human instead of a bot. I try to respect the rules, but I also try to be undetectable.&rdquo;To not identify yourself when you are extracting data from a website could be compared, in some ways, to doing interviews with a hidden mic or camera. The Code of Ethics from the FPJQ states some rules regarding this.<blockquote>
4 a) Undercover procedures
In certain cases, journalists are justified in obtaining the information they seek through undercover means: false identities, hidden microphones and cameras, imprecise information about the objectives of their news reports, spying, infiltrating&hellip;
These methods must always be the exception to the rule. Journalists use them when:
* the information sought is of definite public interest; for example, in cases where socially reprehensible actions must be exposed;
* the information cannot be obtained or verified by other means, or other means have already been used unsuccessfully;
* the public gain is greater than any inconvenience to individuals.
The public must be informed of the methods used.
</blockquote>Best practice would generally be to identify yourself in your code, even if it&rsquo;s a bot that does all the work. However, if there&rsquo;s a possibility that the targeted institution would change the availability of the data because a reporter tries to gather it, you should make yourself more discreet.And for those who are afraid to be blocked if you identify as a reporter, don&rsquo;t worry; <a href="https://en.wikipedia.org/wiki/Proxy_server">it&rsquo;s quite easy to change your IP address</a>.For some reporters, best practise is also to ask for the data before scraping it. For them, it&rsquo;s only after a refusal that web scraping should be an option.This interesting point has an advantage: if the institution answers quickly and gives you the raw data, it will save you time.<h4 class="p1">Publish Your Code, or Not?</h4>Transparency is another very important aspect of journalism. Without it, the public wouldn&rsquo;t trust the reporters&rsquo; work. From the FPJQ Code of Ethics:<img class="media-element file-default" title="" src="http://j-source.ca/sites/default/wp-content/uploads/gathering-FPJQ-1024x153.png" alt="" width="1024" height="153">The vast majority of data reporters publish the data they used for their stories. This act of transparency shows that their reports are based on real facts that the public can check if it wants to. But what about their code?An error in a web scraper script can completely skew the analysis of the data obtained. So should the code be public as well?For open-source software, <a href="http://opensource.org/definition">to reveal the code is a must</a>. The main reason is to allow others to improve the software, but also to give confidence to the users who can check what the software is doing in detail.However, for coder-reporters, to reveal or not to reveal is a difficult choice.&ldquo;In some ways, we are businesses,&rdquo; said Sam. I think that if you have a competitive edge and if you can continue to find stories with it, you should keep it to yourself. You can&rsquo;t reveal everything all the time.&rdquo;For Roberto Rocha, the code shouldn&rsquo;t be published.However, Rocha has a GitHub account where he publishes some of his scripts, as Chad Skelton, Jean-Hugues Roy and Philippe Gohier do.&ldquo;I really think that the tide lifts all boats,&rdquo; said Gohier. &ldquo;The more we share scripts and technology, the more it will help everybody. I&rsquo;m not doing anything that someone can&rsquo;t do with some effort. I am not reshaping the world.&rdquo;Jean-Hugues Roy agreed, and added that journalists should allow others to replicate their work, like scientists do by publishing their methodology.Nonetheless, the professor specifies that there&rsquo;re exceptions. <a href="http://www.sedar.com/homepage_fr.htm">Roy is currently working on a bot that would extract data from SEDAR</a>, where documents from the Canadian publicly traded companies are published.&ldquo;I usually publish my code, but this one, I don&rsquo;t know. It&rsquo;s complicated and I put a lot of time into it.&rdquo;On an another hand, Glen McGregor doesn&rsquo;t publish his scripts, but sends them if someone asks for them.When a reporter has a source, he will do everything in his power to protect it. The reporter will do so to earn the confidence of his source, who will hopefully give him more sensitive information. But the reporter also does this to keep his source to himself.So, in the end, a web scraper could be viewed as the bot version of a source. Another question to consider is whether reporters&rsquo; bots be patented in the future.Who knows? Perhaps one day a reporter will refuse to reveal his code the same way <a href="http://www.theglobeandmail.com/news/politics/supreme-court-bolsters-protection-of-medias-confidential-sources/article1316054/">Daniel Leblanc refused to reveal the identity of his source called &ldquo;Ma Chouette.&rdquo;</a>After all, these days, <a href="https://www.youtube.com/watch?v=HJvuzZ-kol0">bots are starting to look more and more like humans</a>.Note: This is more a technical detail than an ethical dilemma, but to respect the web infrastructure is, of course, another golden rule of web scraping. Always leave several seconds between your requests, and don&rsquo;t overload servers.&nbsp;<hr>This post originally <a href="http://j-source.ca/article/ethics-web-scraping-and-data-journalism">appeared on J-Source.CA</a> and is reprinted with permission<a href="https://gijn.org/wp-content/uploads/2015/07/nael.jpg"><img class="alignleft size-thumbnail wp-image-5496" src="https://gijn.org/wp-content/uploads/2015/07/nael-140x140.jpg" alt="nael" width="140" height="140"></a>Nael Shiab is an MA graduate of the University of King's College digital journalism program. He has worked as a video reporter for Radio-Canada and is currently a data reporter for Transcontinental. <a href="https://twitter.com/NaelShiab">@NaelShiab</a>.
	This <a target="_blank" href="https://gijn.org/stories/on-the-ethics-of-web-scraping-and-data-journalism/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

New Document Tools to Unearth Redacted Text, Personal Information, and More

by Rowan Philp • April 10, 2023

DocumentCloud now includes many more cutting-edge functions — which include extracting personal identification information embedded in large files, importing data from programs like Google Drive, transcribing YouTube audio, and even peering through weak blackout redactions.

Case Studies Methodology

Is It Ever OK for Journalists to Lie to Get a Story?

by Andrea Carson and Denis Muller • January 24, 2023

In a new book on undercover reporting, Australian journalism professors Andrea Carson and Denis Muller examine whether deception is ever an acceptable method for journalists to use.

Reporting Tools & Tips

Digging Up Hidden Data with the Web Inspector

by Smaranda Tolosano • July 28, 2021

Many reporters never notice the “inspect element” option below the “copy” and save-as” functions in the right-click menu on any webpage related to their investigation. But it turns out that this little-used web inspector tool can dig up a wealth of hidden information from a site’s source code, reveal the raw data behind graphics, and download images and videos that supposedly cannot be saved.

Case Studies Reporting Tools & Tips

How Reporters Exposed the Spies Implicated in the Navalny Poisoning

by Rowan Philp • April 14, 2021

No law enforcement agency announced any criminal investigation into the attempted murder of Russia’s leading opposition figure, Alexey Navalny, after he was poisoned with a chemical weapon in Russia last year. Instead, investigative journalists stepped forward. In a GIJN webinar, journalists from Bellingcat and Russia’s The Insider explained how, and why, they used black market data to help expose the true culprits behind that attack.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

On the Ethics of Web Scraping and Data Journalism

Read this article in

Identify Yourself, or Not?

Publish Your Code, or Not?

Read other stories tagged with:

Republish this article

Read Next

Reporting Tools & Tips Research

New Document Tools to Unearth Redacted Text, Personal Information, and More

Case Studies Methodology

Is It Ever OK for Journalists to Lie to Get a Story?

Reporting Tools & Tips

Digging Up Hidden Data with the Web Inspector

Case Studies Reporting Tools & Tips

How Reporters Exposed the Spies Implicated in the Navalny Poisoning

Stories

Topics

On the Ethics of Web Scraping and Data Journalism

Read this article in

Related Resources

Top 10 Guides and Tipsheets from GIJN’s Resource Center in 2022

GIJN’s Guide to Undercover Reporting

How To Create a Data Journalism Team

Tipsheet: Latest Tools for Investigating with Telegram

Share

Identify Yourself, or Not?

Publish Your Code, or Not?

Related Resources

Top 10 Guides and Tipsheets from GIJN’s Resource Center in 2022

GIJN’s Guide to Undercover Reporting

How To Create a Data Journalism Team

Tipsheet: Latest Tools for Investigating with Telegram

Related Stories

New Document Tools to Unearth Redacted Text, Personal Information, and More

Is It Ever OK for Journalists to Lie to Get a Story?

Digging Up Hidden Data with the Web Inspector

How Reporters Exposed the Spies Implicated in the Navalny Poisoning

Read other stories tagged with:

Republish this article

Read Next

Reporting Tools & Tips Research

New Document Tools to Unearth Redacted Text, Personal Information, and More

Case Studies Methodology

Is It Ever OK for Journalists to Lie to Get a Story?

Reporting Tools & Tips

Digging Up Hidden Data with the Web Inspector

Case Studies Reporting Tools & Tips

How Reporters Exposed the Spies Implicated in the Navalny Poisoning