Accessibility Settings

color options

monochrome muted color dark

reading tools

isolation ruler

Stories

Topics

How a BBC Data Unit Scraped Airport Noise Complaints

heathrow-airportI’d wondered for a while why no one who had talked about scraping at conferences had actually demonstrated the procedure. It seemed to me to be one of the most sought-after skills for any investigative journalist.

Then I tried to do so myself in an impromptu session at the first Data Journalism Conference in Birmingham (#DJUK16) and found out why: it’s not as easy as it’s supposed to look.

To anyone new to data journalism, a scraper is as close to magic as you get with a spreadsheet and no wand.

Numbers and text on page after page after page after page just effortlessly start to appear neatly in a spreadsheet you can sort, filter and interrogate.

You can even leave the scraper running while you ring a contact or just make a cup of tea.

Scraping Heathrow’s Noise Complaints

heathrow-dataI used a fairly rudimentary scraper to gather three years’ worth of noise complaint data from the Heathrow Airport website. With the third runway very much on the news agenda that week I wanted to quickly get an idea of how much of an issue noise already was.

The result was this story, which was widely picked up by other outlets.

But how did I do it?

Complaints data for each day of the year was published on a separate URL. To create the spreadsheet would have taken me hours or even days.

Using Googlesheets, I created a standard formula to import the data from HTML tables on each of the pages of the operational data site. (This is always best done in a new spreadsheet — at Data Journalism UK I tried to do this by modifying the existing spreadsheet, which generated a sheet full of #REF! errors)

airport-1Note how the first two numbers correspond to the number of telephone, email and letter contacts and the total number of web contacts:

airport-2airport-3Column E contains a basic sum to add the cells in columns C and D together.

The Formula that Grabs the Data

Now let’s break down the formula.

Starting from the middle, the ImportHTML is telling the sheet to drag in something within the HTML of a web address in cell A2.

The “table” is telling the sheet to look for a table. The following numbers mean this: 1  = the first table it finds. 33 = row 33 of that table. 2 = column 2 of that table.

The substitute relates to the bits on the end. It’s telling the scraper if it finds an asterisk to replace that with the contents of the “”, in this case, put nothing in its place. As it happens, there was no asterisk to replace so it’s a bit redundant. But it can be used to replace spaces with %20, which a browser will need to work properly.

Which Row Is It?

To find this, we have to look on the website itself. Right-click the mouse and select “view page source.” This brings up something that looks like this:

airport-4Don’t Panic

Use Ctrl and F and search for “complaints”, which is the bit we want.

airport-5You’ll see it says “row-33”, with the actual number we want just after the bit that says “column-2”.

It’s the same for the other data we want at row 34. We’ll change that number when we copy the formula into the adjoining cell (column D) of the spreadsheet.

airpot-6

Copying for All Dates

You could easily spend just as long as you would filling the spreadsheet manually if you were to copy the URL for every date into column A.

Every date has the same basic start to the URL, namely http://heathrowoperationaldata.com/

We can copy the dates from the drop down list on the right hand side.

We do that and put them into cell B.

airport-7Then back in Cell A, we start off with an = and past the start of the URL. We then use an & and put the number of the next cell, B2 in this case.

What this is telling it to do is append the date onto the rest of the URL, thus creating a clickable link.

airport-8But you’ll notice we have spaces between each of the day, month and year. We need them to have dashes (-) instead otherwise the URL won’t work.

airport-9You can then copy down the formula in Column A so a URL is created for each individual date.

Once that’s done, our scraper should spring to life and start populating the sheet.

“Should” being the operative word.

Get It Right

Remember, there’s no substitute for thoroughly checking your facts.

A scraper allows you to pull in information that is all stored in the same format. But it’s up to you to make sure that what you are relying on is accurate.

And that’s just as applicable in writing and publishing news stories as it is in giving an impromptu demonstration about something you only managed to do successfully yourself once.


daniel-wainwrightThis post first appeared on the Online Journalism Blog and has been reproduced with permission.

Daniel Wainwright is a data journalist at BBC News Online.

Republish our articles for free, online or in print, under a Creative Commons license.

Republish this article


Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

Read Next

Data Journalism

Data Journalism Top 10: Thanksgiving Dangers, Vaccine Tracker, Crosswords Diversity, Golf Swings, Facebook Fact-checks

Around the world, people are toning down their celebrations in a bid to mitigate spread of the coronavirus. The latest casualty: Thanksgiving. Our NodeXL #ddj mapping from November 16 to 22 found FiveThirtyEight’s timely piece explaining the risk of COVID-19 transmission from even small Thanksgiving dinner gatherings. Also popular: The New York Times tracking the status of all vaccine trials in progress, and The Pudding analysis of race and gender in crossword puzzles from five major US news publications.

Data Journalism

Data Journalism Top 10: Border Disputes, Mediterranean Gas, Data Reporting Grants, Newsroom Cuts

Territorial disputes — over land, borders, or resources — are a long-standing source of tension around the world. Our NodeXL #ddj mapping from September 7 to 13 finds Al Jazeera explaining the India-China dispute over a shared Himalayan border in seven maps, and the Financial Times attempting to put into context the tensions between Turkey and its neighbors competing over natural gas discoveries. We also find Stanford University and Big Local News offering data reporting grants on the pandemic, and other groups offering free data journalism workshops and webinars.

Data Journalism

Data Journalism Top 10: Measuring Mask Use, Parental Interruptions, Childbirth Woes, India’s Low Death Rate

How widespread is mask use in your country? Our NodeXL #ddj mapping from July 13 to 19 finds The New York Times mapping the odds of people encountering other mask wearers in the United States, two university professors quantifying the number of interruptions a parent suffers on average every hour while working from home, the Committee to Protect Journalists talking to data journalists about the struggles of reporting on COVID-19, and openDemocracy documenting cases of mistreatment of women in labor around the world since the pandemic started.