Stories

•

Topics

» Data Journalism

How a BBC Data Unit Scraped Airport Noise Complaints

by Daniel Wainwright • December 14, 2016

I’d wondered for a while why no one who had talked about scraping at conferences had actually demonstrated the procedure. It seemed to me to be one of the most sought-after skills for any investigative journalist.

Then I tried to do so myself in an impromptu session at the first Data Journalism Conference in Birmingham (#DJUK16) and found out why: it’s not as easy as it’s supposed to look.

To anyone new to data journalism, a scraper is as close to magic as you get with a spreadsheet and no wand.

Numbers and text on page after page after page after page just effortlessly start to appear neatly in a spreadsheet you can sort, filter and interrogate.

You can even leave the scraper running while you ring a contact or just make a cup of tea.

Scraping Heathrow’s Noise Complaints

I used a fairly rudimentary scraper to gather three years’ worth of noise complaint data from the Heathrow Airport website. With the third runway very much on the news agenda that week I wanted to quickly get an idea of how much of an issue noise already was.

The result was this story, which was widely picked up by other outlets.

But how did I do it?

Complaints data for each day of the year was published on a separate URL. To create the spreadsheet would have taken me hours or even days.

Using Googlesheets, I created a standard formula to import the data from HTML tables on each of the pages of the operational data site. (This is always best done in a new spreadsheet — at Data Journalism UK I tried to do this by modifying the existing spreadsheet, which generated a sheet full of #REF! errors)

Note how the first two numbers correspond to the number of telephone, email and letter contacts and the total number of web contacts:

Column E contains a basic sum to add the cells in columns C and D together.

The Formula that Grabs the Data

Now let’s break down the formula.

Starting from the middle, the ImportHTML is telling the sheet to drag in something within the HTML of a web address in cell A2.

The “table” is telling the sheet to look for a table. The following numbers mean this: 1 = the first table it finds. 33 = row 33 of that table. 2 = column 2 of that table.

The substitute relates to the bits on the end. It’s telling the scraper if it finds an asterisk to replace that with the contents of the “”, in this case, put nothing in its place. As it happens, there was no asterisk to replace so it’s a bit redundant. But it can be used to replace spaces with %20, which a browser will need to work properly.

Which Row Is It?

To find this, we have to look on the website itself. Right-click the mouse and select “view page source.” This brings up something that looks like this:

Don’t Panic

Use Ctrl and F and search for “complaints”, which is the bit we want.

You’ll see it says “row-33”, with the actual number we want just after the bit that says “column-2”.

It’s the same for the other data we want at row 34. We’ll change that number when we copy the formula into the adjoining cell (column D) of the spreadsheet.

Copying for All Dates

You could easily spend just as long as you would filling the spreadsheet manually if you were to copy the URL for every date into column A.

Every date has the same basic start to the URL, namely http://heathrowoperationaldata.com/

We can copy the dates from the drop down list on the right hand side.

We do that and put them into cell B.

Then back in Cell A, we start off with an = and past the start of the URL. We then use an & and put the number of the next cell, B2 in this case.

What this is telling it to do is append the date onto the rest of the URL, thus creating a clickable link.

But you’ll notice we have spaces between each of the day, month and year. We need them to have dashes (-) instead otherwise the URL won’t work.

You can then copy down the formula in Column A so a URL is created for each individual date.

Once that’s done, our scraper should spring to life and start populating the sheet.

“Should” being the operative word.

Get It Right

Remember, there’s no substitute for thoroughly checking your facts.

A scraper allows you to pull in information that is all stored in the same format. But it’s up to you to make sure that what you are relying on is accurate.

And that’s just as applicable in writing and publishing news stories as it is in giving an impromptu demonstration about something you only managed to do successfully yourself once.

This post first appeared on the Online Journalism Blog and has been reproduced with permission.

Daniel Wainwright is a data journalist at BBC News Online.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

data scraping excel spreadsheet Fact checking heathrow airport noise pollution

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>How a BBC Data Unit Scraped Airport Noise Complaints</h2> by <a href="https://twitter.com/danwainwright">Daniel Wainwright</a> for Global Investigative Journalism Network &bull; December 14, 2016 <a href="http://www.bbc.com/news/uk-england-37803205"><img class="alignright wp-image-26176 size-medium" src="https://gijn.org/wp-content/uploads/2016/12/heathrow-airport-336x326.jpg" alt="heathrow-airport" width="336" height="326"></a>I&rsquo;d wondered for a while why no one who had talked about scraping at conferences had actually demonstrated the procedure. It seemed to me to be one of the most sought-after skills for any investigative journalist.Then I tried to do so myself in an impromptu session at the first <a href="https://www.journalism.co.uk/news/making-data-stories-more-personal-highlights-from-data-journalism-uk/s2/a694889/">Data Journalism Conference</a> in Birmingham (#DJUK16) and found out why: it&rsquo;s not as easy as it&rsquo;s supposed to look.To anyone new to data journalism, a scraper is as close to magic as you get with a spreadsheet and no wand.Numbers and text on page after page after page after page just effortlessly start to appear neatly in a spreadsheet you can sort, filter and interrogate.You can even leave the scraper running while you ring a contact or just make a cup of tea.<h3>Scraping Heathrow's Noise Complaints</h3><a href="https://gijn.org/wp-content/uploads/2016/12/heathrow-data.jpg"><img class="alignright wp-image-26177 size-medium" src="https://gijn.org/wp-content/uploads/2016/12/heathrow-data-336x155.jpg" alt="heathrow-data" width="336" height="155"></a>I used a fairly rudimentary scraper to gather three years&rsquo; worth of noise complaint data from the <a href="http://heathrowoperationaldata.com/">Heathrow Airport website</a>. With the third runway very much on the news agenda that week I wanted to quickly get an idea of how much of an issue noise already was.The result was <a href="http://www.bbc.co.uk/news/uk-england-37803205">this story</a>, which was widely picked up by other outlets.But how did I do it?Complaints data for each day of the year was <a href="http://heathrowoperationaldata.com/14th-november-2016/">published on a separate URL</a>. To create the spreadsheet would have taken me hours or even days.Using Googlesheets, I created a standard formula to import the data from HTML tables on each of the pages of the operational data site. (This is always best done in a new spreadsheet &mdash; at Data Journalism UK I tried to do this by modifying the existing spreadsheet, which generated a sheet full of <code>#REF!</code> errors)<a href="https://gijn.org/wp-content/uploads/2016/12/airport-1.png"><img class="aligncenter wp-image-26167 size-full" src="https://gijn.org/wp-content/uploads/2016/12/airport-1.png" alt="airport-1" width="625" height="195"></a>Note how the first two numbers correspond to the number of telephone, email and letter contacts and the total number of web contacts:<a href="https://gijn.org/wp-content/uploads/2016/12/airport-2.png"><img class="size-full wp-image-26168 aligncenter" src="https://gijn.org/wp-content/uploads/2016/12/airport-2.png" alt="airport-2" width="625" height="135"></a><a href="https://gijn.org/wp-content/uploads/2016/12/airport-3.png"><img class="size-full wp-image-26169 alignright" src="https://gijn.org/wp-content/uploads/2016/12/airport-3.png" alt="airport-3" width="91" height="17"></a>Column E contains a basic sum to add the cells in columns C and D together.<h3>The Formula that Grabs the Data 
</h3>Now let&rsquo;s break down the formula.Starting from the middle, the <code>ImportHTML</code> is telling the sheet to drag in something within the HTML of a web address in cell A2.The <code>&ldquo;table&rdquo;</code> is telling the sheet to look for a table. The following numbers mean this: <code>1</code>&nbsp; = the first table it finds. <code>33</code> = row 33 of that table. <code>2</code> = column 2 of that table.The <code>substitute</code> relates to the bits on the end. It&rsquo;s telling the scraper if it finds an asterisk to replace that with the contents of the <code>&ldquo;&rdquo;</code>, in this case, put nothing in its place. As it happens, there was no asterisk to replace so it&rsquo;s a bit redundant. But it can be used to replace spaces with <code>%20</code>, which a browser will need to work properly.<h3>Which Row Is It? 
</h3>To find this, we have to look on the website itself. Right-click the mouse and select &ldquo;view page source.&rdquo; This brings up something that looks like this:<h4><a href="https://gijn.org/wp-content/uploads/2016/12/airport-4.png"><img class="aligncenter wp-image-26170 size-full" src="https://gijn.org/wp-content/uploads/2016/12/airport-4.png" alt="airport-4" width="625" height="304"></a>Don&rsquo;t Panic</h4>Use Ctrl and F and search for &ldquo;complaints&rdquo;, which is the bit we want.<a href="https://gijn.org/wp-content/uploads/2016/12/airport-5.png"><img class="size-full wp-image-26171 aligncenter" src="https://gijn.org/wp-content/uploads/2016/12/airport-5.png" alt="airport-5" width="625" height="110"></a>You&rsquo;ll see it says &ldquo;row-33&rdquo;, with the actual number we want just after the bit that says &ldquo;column-2&rdquo;.It&rsquo;s the same for the other data we want at row 34. We&rsquo;ll change that number when we copy the formula into the adjoining cell (column D) of the spreadsheet.<h3><a href="https://gijn.org/wp-content/uploads/2016/12/airpot-6.png"><img class="size-full wp-image-26172 aligncenter" src="https://gijn.org/wp-content/uploads/2016/12/airpot-6.png" alt="airpot-6" width="468" height="27"></a></h3><h3>Copying for All Dates</h3>You could easily spend just as long as you would filling the spreadsheet manually if you were to copy the URL for every date into column A.Every date has the same basic start to the URL, namely <a href="http://heathrowoperationaldata.com/">http://heathrowoperationaldata.com/</a>We can copy the dates from the drop down list on the right hand side.We do that and put them into cell B.<a href="https://gijn.org/wp-content/uploads/2016/12/airport-7.png"><img class="size-full wp-image-26173 aligncenter" src="https://gijn.org/wp-content/uploads/2016/12/airport-7.png" alt="airport-7" width="245" height="265"></a>Then back in Cell A, we start off with an <code>=</code> and past the start of the URL. We then use an <code>&amp;</code> and put the number of the next cell, <code>B2</code> in this case.What this is telling it to do is append the date onto the rest of the URL, thus creating a clickable link.<a href="https://gijn.org/wp-content/uploads/2016/12/airport-8.png"><img class="size-full wp-image-26174 aligncenter" src="https://gijn.org/wp-content/uploads/2016/12/airport-8.png" alt="airport-8" width="418" height="96"></a>But you&rsquo;ll notice we have spaces between each of the day, month and year. We need them to have dashes (-) instead otherwise the URL won&rsquo;t work.<a href="https://gijn.org/wp-content/uploads/2016/12/airport-9.png"><img class="size-full wp-image-26175 aligncenter" src="https://gijn.org/wp-content/uploads/2016/12/airport-9.png" alt="airport-9" width="625" height="418"></a>You can then copy down the formula in Column A so a URL is created for each individual date.Once that&rsquo;s done, our scraper should spring to life and start populating the sheet.&ldquo;Should&rdquo; being the operative word.<h3>Get It Right</h3>Remember, there&rsquo;s no substitute for thoroughly checking your facts.A scraper allows you to pull in information that is all stored in the same format. But it&rsquo;s up to you to make sure that what you are relying on is accurate.And that&rsquo;s just as applicable in writing and publishing news stories as it is in giving an impromptu demonstration about something you only managed to do successfully yourself once.<hr><a href="https://gijn.org/wp-content/uploads/2016/12/daniel-wainwright.jpg"><img class="alignleft wp-image-26178" src="https://gijn.org/wp-content/uploads/2016/12/daniel-wainwright.jpg" alt="daniel-wainwright" width="116" height="116"></a>This post<a href="https://onlinejournalismblog.com/2016/11/29/how-the-bbc-england-data-unit-scraped-airport-noise-complaints/"> first appeared</a> on the Online Journalism Blog and has been reproduced with permission.<a href="https://twitter.com/danwainwright">Daniel Wainwright</a> is a data journalist at BBC News Online.
	This <a target="_blank" href="https://gijn.org/stories/how-a-bbc-data-unit-scraped-airport-noise-complaints/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

by Pınar Dağ • November 3, 2023

Knowing where to look for data — and accessing it via scraping data from websites — can be a powerful force multiplier for investigative journalists.

Data Journalism

Data Journalism Top 10: Thanksgiving Dangers, Vaccine Tracker, Crosswords Diversity, Golf Swings, Facebook Fact-checks

by Eunice Au & Connected Action • November 26, 2020

Around the world, people are toning down their celebrations in a bid to mitigate spread of the coronavirus. The latest casualty: Thanksgiving. Our NodeXL #ddj mapping from November 16 to 22 found FiveThirtyEight’s timely piece explaining the risk of COVID-19 transmission from even small Thanksgiving dinner gatherings. Also popular: The New York Times tracking the status of all vaccine trials in progress, and The Pudding analysis of race and gender in crossword puzzles from five major US news publications.

Data Journalism

Data Journalism Top 10: Border Disputes, Mediterranean Gas, Data Reporting Grants, Newsroom Cuts

by Eunice Au & Connected Action • September 17, 2020

Territorial disputes — over land, borders, or resources — are a long-standing source of tension around the world. Our NodeXL #ddj mapping from September 7 to 13 finds Al Jazeera explaining the India-China dispute over a shared Himalayan border in seven maps, and the Financial Times attempting to put into context the tensions between Turkey and its neighbors competing over natural gas discoveries. We also find Stanford University and Big Local News offering data reporting grants on the pandemic, and other groups offering free data journalism workshops and webinars.

Data Journalism

Data Journalism Top 10: Measuring Mask Use, Parental Interruptions, Childbirth Woes, India’s Low Death Rate

by Eunice Au & Connected Action • July 23, 2020

How widespread is mask use in your country? Our NodeXL #ddj mapping from July 13 to 19 finds The New York Times mapping the odds of people encountering other mask wearers in the United States, two university professors quantifying the number of interruptions a parent suffers on average every hour while working from home, the Committee to Protect Journalists talking to data journalists about the struggles of reporting on COVID-19, and openDemocracy documenting cases of mistreatment of women in labor around the world since the pandemic started.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

How a BBC Data Unit Scraped Airport Noise Complaints

Scraping Heathrow’s Noise Complaints

The Formula that Grabs the Data

Which Row Is It?

Don’t Panic

Copying for All Dates

Get It Right

Read other stories tagged with:

Republish this article

Read Next

Tipsheet Data Journalism GIJC23

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism

Data Journalism Top 10: Thanksgiving Dangers, Vaccine Tracker, Crosswords Diversity, Golf Swings, Facebook Fact-checks

Data Journalism

Data Journalism Top 10: Border Disputes, Mediterranean Gas, Data Reporting Grants, Newsroom Cuts

Data Journalism

Data Journalism Top 10: Measuring Mask Use, Parental Interruptions, Childbirth Woes, India’s Low Death Rate

Stories

Topics

How a BBC Data Unit Scraped Airport Noise Complaints

Related Resources

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Tipsheet: Latest Tools for Investigating with Telegram

Investigating Elections: Threat from AI Audio Deepfakes

Share

Scraping Heathrow’s Noise Complaints

The Formula that Grabs the Data

Which Row Is It?

Don’t Panic

Copying for All Dates

Get It Right

Related Resources

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Tipsheet: Latest Tools for Investigating with Telegram

Investigating Elections: Threat from AI Audio Deepfakes

Related Stories

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism Top 10: Thanksgiving Dangers, Vaccine Tracker, Crosswords Diversity, Golf Swings, Facebook Fact-checks

Data Journalism Top 10: Border Disputes, Mediterranean Gas, Data Reporting Grants, Newsroom Cuts

Data Journalism Top 10: Measuring Mask Use, Parental Interruptions, Childbirth Woes, India’s Low Death Rate

Read other stories tagged with:

Republish this article

Read Next

Tipsheet Data Journalism GIJC23

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism

Data Journalism Top 10: Thanksgiving Dangers, Vaccine Tracker, Crosswords Diversity, Golf Swings, Facebook Fact-checks

Data Journalism

Data Journalism Top 10: Border Disputes, Mediterranean Gas, Data Reporting Grants, Newsroom Cuts

Data Journalism

Data Journalism Top 10: Measuring Mask Use, Parental Interruptions, Childbirth Woes, India’s Low Death Rate