Stories

•

Topics

» Data Journalism » Methodology » Reporting Tools & Tips

How I Built a Scraper To Measure MP Activity

by Maarten Lambrechts • October 12, 2016

Flemish parliament website: list of MPs

When the president of the parliament states that there are some MPs (Members of Parliament) “doing nothing,” you know what to do as a data journalist: you turn to the numbers. This is how I did that and how I got a scatter plot in a printed paper and an interactive one online.

The Data

I knew that the Flemish parliament has a strong open data policy and publishes all parliamentary activities of the members of parliament, so I decided to check out their API. But the API proved to be a bit difficult for me:

To get the info I was interested in, I had to make a lot of API calls, store the results and make a lot of other API calls.
The response file formats are json and xml. I don’t have a lot of experience getting data out of these formats and this proved to be challenging.

After a while I gave up on the API, the xml’s and the json’s, and I decided to just scrape the website instead. Luckily, the website is very well structured and contains all the information I wanted in a very structured way.

I used the rvest R package for scraping. I took some time in the summer to learn R and some of its useful packages. I’m very glad I did that: it is paying off already.

What the scraper does (you can find all the code at the bottom of this page):

It visits the page where all the MPs are listed and stores their names, the party they belong to and the urls of their personal profile pages.
It then goes to all the profile pages and collects the urls to the pages where the activity of the MPs are listed (questions they asked, things they said in parliament and proposals they made).
It then changes a parameter in these url’s to filter out the activity of only the current term.
It visits the urls with the filter and gets the number of activities listed on these pages.

Fairly simple, all in all. I wasted much more time trying to collect the data with the API then writing the html scraper.

Then What?

I decided to analyze two measures: how much an MP said something in parliament and how many official documents (proposals, amendments, …) they filed. An obvious choice then was to make a scatter plot. I used ggplot2, another great R package I learned to work with, to do that.

A clear trend, but also with some outliers in all directions: not bad for building a story. But how to do it?

Key was to add lines for both medians. This divides the plot into 4 quadrants and I used these quadrants to classify the MPs as Busy Bees (a lot of interventions in parliament, a lot of documents filed), Silent Workers (few interventions, lot of docs), Chatterers (lot of interventions, few docs) and the Passive MPs (few interventions and few docs).

This added layer of classification, both in the story and in the graphic, proved to be the sugar to let the dry graphic that a scatter plot is to a lot of people (not to me!) go down. Without it, I don’t think I could have convinced the editors to run the graphic and I think a lot of people would have a harder time getting the chart.

Output

For print, I generated the scatter plot with ggplot2 and exported it as a pdf. Further processing for print (which involved the manual placement of the overlapping labels) was done by my colleague Filip Ysenbaert, while reporting was done by 2 colleagues of the politics desk.

For the online version, I used D3 to make a scatter plot with buttons for highlighting and for zooming in on the ‘passive’ zone of the plot. Details of every MP are shown on hover/tap.

Mobile readers only get a static scatter plot, but they still get the small multiples for comparing the parties in parliament. Those were also generated with ggplot2.

R

As I wrote already: learning R payed off. And not only for getting the data and visualizing it: I now have an R script (see below) that I can run by clicking a button and it will get all the data, put it in the right format, visualize it and prepare the data for the interactive scatter plot. No tedious manual editing anymore!

I actually edited the script and ran it on Friday morning (the graphic was published on Saturday). Getting new data while I still had a lot of work to do for publishing was something I would have never done if there were some manual steps involved in the data gathering and processing.

Bonus: Explaining the Median

I always struggle to explain in words what the median means exactly. But graphically this is surprisingly easy: on the scatter plot half of the points are always above, below left and right of the black lines. Can’t be easier, I think.

The Code

An image preview of the code is included below. Click here to get access to the code on the original post and scroll to the bottom.

This post was originally published on Maarten Lambrechts’ website and is reproduced with permission from the author.

Maarten Lambrechts is a data and multimedia editor at De Tijd and L’Echo in Belgium. He is a numbers cruncher, specialist in graphics and data visualization, mapmaker, and interactive builder. @maartenzam

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

Coding D3 data scraping ggplot2 graphics netherlands members of parliament R rvest R

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>How I Built a Scraper To Measure MP Activity</h2><p class="byline"> <span>by</span> <a href="https://twitter.com/maartenzam">Maarten Lambrechts</a> <span>for Global Investigative Journalism Network</span> <span>&bull; October 12, 2016</span> </p><p>When the president of the parliament states that there are some MPs (Members of Parliament) &ldquo;doing nothing,&rdquo; you know what to do as a data journalist: you turn to the numbers. This is how I did that and how I got a scatter plot in <a href="http://krant.tijd.be/ipaper/20161001?itm_campaign=newsstream_paper#paper/tijd/8">a printed paper</a> and an <a href="http://multimedia.tijd.be/vlaamsparlement/">interactive one online</a>.</p><h3>The Data</h3><p>I knew that the Flemish parliament has a strong open data policy and publishes all parliamentary activities of the members of parliament, so I decided to check out their <a href="http://ws.vlpar.be/e/opendata/api/">API</a>. But the API proved to be a bit difficult for me:</p><ul>
<li>To get the info I was interested in, I had to make a lot of API calls, store the results and make a lot of other API calls.</li>
<li>The response file formats are json and xml. I don&rsquo;t have a lot of experience getting data out of these formats and this proved to be challenging.</li>
</ul><p>After a while I gave up on the API, the xml&rsquo;s and the json&rsquo;s, and I decided to just scrape the website instead. Luckily, the website is very well structured and contains all the information I wanted in a very structured way.</p><p>I used the <a href="https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/">rvest</a> R package for scraping. I took some time in the summer to learn R and some of its useful packages. I&rsquo;m very glad I did that: it is paying off already.</p><p>What the scraper does (you can find all the code at the bottom of this page):</p><ol>
<li>It visits the <a href="https://www.vlaamsparlement.be/vlaamse-volksvertegenwoordigers">page where all the MPs are listed</a> and stores their names, the party they belong to and the urls of their personal profile pages.</li>
<li>It then goes to all the profile pages and collects the urls to the pages where the activity of the MPs are listed (questions they asked, things they said in parliament and proposals they made).</li>
<li>It then changes a parameter in these url&rsquo;s to filter out the activity of only the current term.</li>
<li>It visits the urls with the filter and gets the number of activities listed on these pages.</li>
</ol><p>Fairly simple, all in all. I wasted much more time trying to collect the data with the API then writing the html scraper.</p><h3>Then What?</h3><p>I decided to analyze two measures: how much an MP said something in parliament and how many official documents (proposals, amendments, &hellip;) they filed. An obvious choice then was to make a scatter plot. I used <a href="http://ggplot2.org/">ggplot2</a>, another great R package I learned to work with, to do that.</p><p><a href="https://gijn.org/wp-content/uploads/2016/10/plot_withoutmedians.jpg"><img class="aligncenter wp-image-20856 size-large" src="https://gijn.org/wp-content/uploads/2016/10/plot_withoutmedians-771x771.jpg" alt="plot_withoutmedians" width="771" height="771"></a>A clear trend, but also with some outliers in all directions: not bad for building a story. But how to do it?</p><p>Key was to add lines for both medians. This divides the plot into 4 quadrants and I used these quadrants to classify the MPs as Busy Bees (a lot of interventions in parliament, a lot of documents filed), Silent Workers (few interventions, lot of docs), Chatterers (lot of interventions, few docs) and the Passive MPs (few interventions and few docs).</p><p>&nbsp;</p><p><a href="https://gijn.org/wp-content/uploads/2016/10/tussenkomsten_vs_vrageninitiatieven.jpg"><img class="aligncenter wp-image-20857 size-large" src="https://gijn.org/wp-content/uploads/2016/10/tussenkomsten_vs_vrageninitiatieven-771x771.jpg" alt="tussenkomsten_vs_vrageninitiatieven" width="771" height="771"></a>This added layer of classification, both in the story and in the graphic, proved to be the sugar to let the dry graphic that a scatter plot is to a lot of people (not to me!) go down. Without it, I don&rsquo;t think I could have convinced the editors to run the graphic and I think a lot of people would have a harder time getting the chart.</p><h3>Output</h3><p>For print, I generated the scatter plot with ggplot2 and exported it as a pdf. Further processing for print (which involved the manual placement of the overlapping labels) was done by my colleague&nbsp;<a href="https://twitter.com/filipysenbaert">Filip Ysenbaert</a>, while reporting was done by 2 colleagues of the politics desk.</p><p><a href="http://krant.tijd.be/ipaper/20161001?itm_campaign=newsstream_paper#paper/tijd/8"><img class="aligncenter wp-image-20858 size-large" src="https://gijn.org/wp-content/uploads/2016/10/009_GPV1QU_20161001_TYD01_00_orig-771x1150.jpg" alt="009_gpv1qu_20161001_tyd01_00_orig" width="771" height="1150"></a>For <a href="http://multimedia.tijd.be/vlaamsparlement/">the online version</a>, I used D3 to make a scatter plot with buttons for highlighting and for zooming in on the &lsquo;passive&rsquo; zone of the plot. Details of every MP are shown on hover/tap.</p><p><a href="https://gijn.org/wp-content/uploads/2016/10/scatterplot_mps.png"><img class="aligncenter wp-image-20859 size-large" src="https://gijn.org/wp-content/uploads/2016/10/scatterplot_mps-771x507.png" alt="scatterplot_mps" width="771" height="507"></a>Mobile readers only get a static scatter plot, but they still get the small multiples for comparing the parties in parliament. Those were also generated with ggplot2.</p><h3><a href="https://gijn.org/wp-content/uploads/2016/10/smallmultiples.jpg"><img class="aligncenter wp-image-20860 size-large" src="https://gijn.org/wp-content/uploads/2016/10/smallmultiples-771x202.jpg" alt="smallmultiples" width="771" height="202"></a>R</h3><p>As I wrote already: learning R payed off. And not only for getting the data and visualizing it: I now have an R script (see below) that I can run by clicking a button and it will get all the data, put it in the right format, visualize it and prepare the data for the interactive scatter plot. No tedious manual editing anymore!</p><p>I actually edited the script and ran it on Friday morning (the graphic was published on Saturday). Getting new data while I still had a lot of work to do for publishing was something I would have never done if there were some manual steps involved in the data gathering and processing.</p><h3>Bonus: Explaining the Median</h3><p>I always struggle to explain in words what the median means exactly. But graphically this is surprisingly easy: on the scatter plot half of the points are always above, below left and right of the black lines. Can&rsquo;t be easier, I think.</p><h3>The Code</h3><p>An image preview of the code is included below. <a href="http://www.maartenlambrechts.be/how-i-built-a-scraper-to-measure-mps-acitivity-and-got-a-scatter-plot-in-the-newspaper/">Click here </a>to get access to the code on the original post and scroll to the bottom.<br>
<a href="http://www.maartenlambrechts.be/how-i-built-a-scraper-to-measure-mps-acitivity-and-got-a-scatter-plot-in-the-newspaper/"><img class="aligncenter wp-image-21011 size-large" src="https://gijn.org/wp-content/uploads/2016/10/Maartens-code-771x1251.jpg" alt="maartens-code" width="771" height="1251"></a></p><hr><p><em>This post was originally published <a href="http://www.maartenlambrechts.be/how-i-built-a-scraper-to-measure-mps-acitivity-and-got-a-scatter-plot-in-the-newspaper/">on Maarten Lambrechts' </a><a href="http://www.maartenlambrechts.be/">website</a> and is reproduced with permission from the author.&nbsp;</em></p><p><em><a href="https://gijn.org/wp-content/uploads/2016/10/Marten.jpg"><img class="wp-image-20862 alignleft" src="https://gijn.org/wp-content/uploads/2016/10/Marten.jpg" alt="marten" width="180" height="180"></a><a href="http://www.maartenlambrechts.be/">Maarten Lambrechts</a> is a data and multimedia editor at <a href="http://www.tijd.be/">De Tijd</a> and <a href="http://www.lecho.be/">L'Echo</a>&nbsp;in Belgium. He is a numbers cruncher, specialist in graphics and data visualization, mapmaker, and interactive builder.&nbsp;<a class="ProfileHeaderCard-screennameLink u-linkComplex js-nav" href="https://twitter.com/maartenzam">@maartenzam</a> </em></p><p>
	This <a target="_blank" href="https://gijn.org/stories/how-i-built-a-scraper-to-measure-mp-activity/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">
</p>

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

by Pınar Dağ • November 3, 2023

Knowing where to look for data — and accessing it via scraping data from websites — can be a powerful force multiplier for investigative journalists.

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Wagner’s Corporate Network, Barbie’s Career History, and Spain’s Election Results

by Alexa van Sickle and Eunice Au • July 28, 2023

GIJN’s weekly round-up of the Top 10 in Data Journalism looks at the Wagner Group’s vast corporate network in Russia, the many careers of Barbie, and Spain’s surprising election results.

data journalism extract DocumentCloud redaction

Reporting Tools & Tips Research

New Document Tools to Unearth Redacted Text, Personal Information, and More

by Rowan Philp • April 10, 2023

DocumentCloud now includes many more cutting-edge functions — which include extracting personal identification information embedded in large files, importing data from programs like Google Drive, transcribing YouTube audio, and even peering through weak blackout redactions.

Reporting Tools & Tips

New Investigative Tools for Monitoring Social Media Platforms

by Rowan Philp • March 20, 2023

Social media platforms are among the most difficult sites to scrape for data across the internet. A recent session at NICAR23 unveiled several dynamic new tools — including Junkipedia, a possible CrowdTangle replacement — that can perform a wealth of social media monitoring tasks, from tracking down who is behind harmful ads to identifying conspiracy groups or influencers spreading disinformation.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

How I Built a Scraper To Measure MP Activity

The Data

Then What?

Output

R

Bonus: Explaining the Median

The Code

Read other stories tagged with:

Republish this article

Read Next

Tipsheet Data Journalism GIJC23

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Wagner’s Corporate Network, Barbie’s Career History, and Spain’s Election Results

Reporting Tools & Tips Research

New Document Tools to Unearth Redacted Text, Personal Information, and More

Stories

Topics

How I Built a Scraper To Measure MP Activity

Related Resources

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Document of the Day: Visual Vocabulary

How To Create a Data Journalism Team

Share

The Data

Then What?

Output

R

Bonus: Explaining the Median

The Code

Related Resources

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Document of the Day: Visual Vocabulary

How To Create a Data Journalism Team

Related Stories

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism Top 10: Wagner’s Corporate Network, Barbie’s Career History, and Spain’s Election Results

New Document Tools to Unearth Redacted Text, Personal Information, and More

New Investigative Tools for Monitoring Social Media Platforms

Read other stories tagged with:

Republish this article

Read Next

Tipsheet Data Journalism GIJC23

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Wagner’s Corporate Network, Barbie’s Career History, and Spain’s Election Results

Reporting Tools & Tips Research

New Document Tools to Unearth Redacted Text, Personal Information, and More

Reporting Tools & Tips

New Investigative Tools for Monitoring Social Media Platforms