Data mining workshop scraping website GIJC23

A data scraping workshop at GIJC23. Image: Smaranda Tolosano for GIJN

Topics

» Data Journalism » GIJC23

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

by Pınar Dağ • November 3, 2023

Read this article in

Knowing where to look for data — and how to access it — should be a priority for investigative journalists. Effective use of data can not only improve the overall quality of an investigation, but increase its public service value.

Over the last 20 years, the amount of data available has grown at an unprecedented rate. According to the International Data Corporation (IDC), by 2025 the collective sum of the world’s data will reach 175 zettabytes (one zettabyte is one trillion gigabytes; as IDC puts it, if one could store the 2025 datasphere onto DVDs, the resulting line of DVDs would encircle the Earth 222 times).

Some estimates claim that Google, Facebook, Microsoft, and Amazon alone store at least 1,200 petabytes (one petabyte = one million gigabytes) of data between them. Investigative and data journalists are using more quantitative, qualitative, and categorical data than ever before — but obtaining good data is still a challenge.

Access to, or finding, structured data — defined as data in a clearly defined, standardized format ready for analysis — from the oceans of bad or incomplete data (including false data, dirty, faulty, or “rogue” data, fake data, scattered data, and unclear data) is still difficult, no matter the field. Part of the solution to this problem is increasing data literacy: we need to understand how data is collected, cleaned, verified, analyzed, and visualized, because it is an interconnected process. For journalists, data literacy is crucial.

In data journalism, as with any kind of journalistic practice, we look for ways to access all kinds of data, such as from leaks, from thousands of pdf files, or from indexes recorded on websites — organized or not. Some of these are easy to access, while others require technology to access, which takes time.

However, there are tools and methods that make it both enjoyable and simple — such as scraping data from websites. Scraping in this manner means using computer programs or software to extract or copy specific data from websites. This process can be used to collect or analyze the data, and it is faster and more efficient than acquiring data manually.

The benefits of data scraping for journalists include:

Speed and scope: Data scraping allows journalists to gather information quickly and efficiently. Pulling data from a variety of sources across the internet gives you a broader perspective, and helps you base your stories on a more solid foundation.
Verification: Data scraping can help journalists in the verification process. You can compare data to check information on the web and spot contradictions, which helps verify information and increase its credibility.
Uncovering trends: Data scraping can be used to uncover patterns related to a particular topic or event. By analyzing large datasets, you can, for example, understand trends in social media or public opinion and integrate this information into your news.
Data visualization: Visualizing the data collected by data scraping helps journalists present their stories more effectively. By using graphs, charts, and interactive visuals, you can make the data more understandable and give readers a better understanding of the topic.
Enabling in-depth investigation: Data scraping allows journalists to conduct more in-depth research. By analyzing large datasets, for example, in financial data, you can gain a deeper understanding of company operations or government policies.
Increasing news value: Data scraping can lead to newsworthy stories. Statistics, trends, demographics, or other data can make your stories more engaging and compelling.

Data Miner is a free data extraction tool and browser extension that enables users to scrape web pages and collect secure data quickly. It automatically collects data from web pages and saves it in Excel, CSV, or JSON formats.

However, keep in mind that collecting data from websites in bulk may violate their terms of use, or the law. It’s important to read the website’s terms of use carefully before using a browser extension or plugin, and to act in accordance with all legal rules and regulations. You should also review the terms of service of the extension you are using.

GIJN Turkish editor Pınar Dağ, the author of this story, gives a presentation on using Data Miner at GIJC23 in Gothenburg. Image: Smaranda Tolosano for GIJN

How Journalists Can Use Data Miner

Here are the steps for scraping a website with the Data Miner browser extension.

1. Install the Data Miner add-on to your browser. The add-on is generally available for browsers such as Chrome or Firefox. Find and install the Data Miner add-on from your browser’s add-on store.

Image: Screenshot

Open the target website. Open the website from which you want to scrape data in your browser, and launch its extension — or in other words, find Data Miner in the extension/plugins menu in your browser and open it. The extension is usually located in the top right corner of your browser.

Image: Screenshot

3. Create a new task/recipe for web scraping. The Data Miner extension has a “My Recipes” option. Click this option to create a new web scraping task. You will be presented with a command screen to continue the mining process.

Image: Screenshot

4. Set options for scraping the website: Data Miner has various options and settings for scraping a website. For example, you can specify which data you want to scrape, and you can set automatic actions, such as page navigation or form filling.

Image: Screenshot

Start scraping the website. Once you have finalized the settings, you can start the data scraping by clicking the “Scrape” button in the Data Miner extension dashboard. The extension will crawl the website and collect the data you have specified. (You can also watch the process in this short video.)
Save or export the data. You can usually save your scraped data as a CSV file or Excel spreadsheet. You can also copy the scraping screen using the Clipboard feature — a convenient and time-saving feature. If your scraped data is more than 10,000 rows, it will be downloaded as two separate files.

Image: Screenshot

By following these steps, you can scrape one or multiple websites with Data Miner, and you can run any of the 60,000-plus data scraping rules, or create your own customized data scraping method to get only the data you need from a web page, because it is possible to create single page or multi-page automatic scraping.

You can automate scraping and can run batches of scraping jobs, based on a list of website URLs. Plus, you can use 50,000 free, pre-made queries for more than 15,000 popular websites. You can also crawl URLs, paginate them, and scrape a single page from a single location — no coding required.

Using the extension also has the following advantages.

It helps you use it safely and securely: It behaves as if you are clicking on the page yourself in your own browser.
It helps you scrape without worry: It’s not a bot, so you won’t get blocked when you make a query.
It keeps your data private: The add-on does not sell or share your data.

Pınar Dağ is the editor of GIJN Turkish and a lecturer at Kadir Has University. She is the co-founder of the Data Literacy Association, Data Journalism Platform Turkey, and DağMedya. She works on data literacy, open data, data visualization, and data journalism, and is on the jury of the Sigma Data Journalism Awards.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

data journalism Data Miner data mining data scraping

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner</h2> by <a href="https://gijn.org/staff-member/pinar-dag/">P&#305;nar Da&#287;</a> for Global Investigative Journalism Network &bull; November 3, 2023 Knowing where to look for data &mdash; and how to access it &mdash; should be a priority for investigative journalists. Effective use of data can not only improve the overall quality of an investigation, but increase its public service value.<aside>Data scraping allows journalists to gather information quickly and efficiently.</aside>Over the last 20 years, <a href="https://seedscientific.com/how-much-data-is-created-every-day/#:~:text=Every%20day%2C%20we%20create%20roughly,rate%20will%20become%20even%20greater.">the amount of data</a> available has grown at an unprecedented rate. According to the International Data Corporation (IDC), by 2025 the collective sum of the world&rsquo;s data will reach <a href="https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf">175 zettabytes</a> (one zettabyte is one trillion gigabytes; as IDC puts it, if one could store the 2025 datasphere onto DVDs, the resulting line of DVDs would encircle the Earth 222 times).Some estimates claim that Google, Facebook, Microsoft, and Amazon alone <a href="https://www.sciencefocus.com/future-technology/how-much-data-is-on-the-internet">store at least 1,200 petabytes</a> (one petabyte = one million gigabytes) of data between them. Investigative and data journalists are <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5958489/">using more quantitative, qualitative, and categorical data</a> than ever before &mdash; but obtaining good data is still a challenge.Access to, or finding, <a href="https://5stardata.info/en/">structured data</a> &mdash; defined as data in a clearly defined, standardized format ready for analysis &mdash; from the oceans of bad or incomplete data (including false data, <a href="https://en.wikipedia.org/wiki/Dirty_data">dirty, faulty, or "rogue" data</a>, <a href="https://math.scholastic.com/issues/2017-18/092517/fake-news-fake-data.html">fake data</a>, <a href="https://github.com/Quartz/bad-data-guide#aggregations-were-computed-on-missing-values">scattered data</a>, and unclear data) is still difficult, no matter the field. Part of the solution to this problem is increasing data literacy: we need to understand how data is collected, cleaned, verified, analyzed, and visualized, because it is an interconnected process. For journalists, data literacy is crucial.In data journalism, as with any kind of journalistic practice, we look for ways to access all kinds of data, such as from leaks, from thousands of pdf files, or from indexes recorded on websites &mdash; organized or not. Some of these are easy to access, while others require technology to access, which takes time.<aside>Investigative and data journalists are using more quantitative, qualitative, and categorical data than ever before &mdash; but obtaining good data is still a challenge.</aside>However, there are tools and methods that make it both enjoyable and simple &mdash; such as scraping data from websites. Scraping in this manner means using computer programs or software to extract or copy specific data from websites. This process can be used to collect or analyze the data, and it is faster and more efficient than acquiring data manually.The benefits of data scraping for journalists include:<ol>
<li> Speed and scope: Data scraping allows journalists to gather information quickly and efficiently. Pulling data from a variety of sources across the internet gives you a broader perspective, and helps you base your stories on a more solid foundation.</li>
<li> Verification: Data scraping can help journalists in the verification process. You can compare data to check information on the web and spot contradictions, which helps verify information and increase its credibility.</li>
<li> Uncovering trends: Data scraping can be used to uncover patterns related to a particular topic or event. By analyzing large datasets, you can, for example, understand trends in social media or public opinion and integrate this information into your news.</li>
<li> Data visualization: Visualizing the data collected by data scraping helps journalists present their stories more effectively. By using graphs, charts, and interactive visuals, you can make the data more understandable and give readers a better understanding of the topic.</li>
<li> Enabling in-depth investigation: Data scraping allows journalists to conduct more in-depth research. By analyzing large datasets, for example, in financial data, you can gain a deeper understanding of company operations or government policies.</li>
<li> Increasing news value: Data scraping can lead to newsworthy stories. Statistics, trends, demographics, or other data can make your stories more engaging and compelling.</li>
</ol><a href="https://dataminer.io/">Data Miner</a> is a free data extraction tool and browser extension that enables users to scrape web pages and collect secure data quickly. It automatically collects data from web pages and saves it in Excel, CSV, or JSON formats.However, keep in mind that collecting data from websites in bulk may violate their terms of use, or the law. It&rsquo;s important to read the website's terms of use carefully before using a browser extension or plugin, and to act in accordance with all legal rules and regulations. You should also review the <a href="https://dataminer.io/tos">terms of service</a> of the extension you are using.<h4>How Journalists Can Use Data Miner</h4>Here are the steps for scraping a website with the Data Miner browser extension.1. Install the <a href="https://dataminer.io/">Data Miner</a> add-on to your browser. The add-on is generally available for browsers such as Chrome or Firefox. Find and install the Data Miner add-on from your browser's add-on store.<ol start="2">
<li> Open the target website. Open the website from which you want to scrape data in your browser, and launch its extension &mdash; or in other words, find Data Miner in the extension/plugins menu in your browser and open it. The extension is usually located in the top right corner of your browser.</li>
</ol>3. Create a new task/recipe for web scraping. The Data Miner extension has a "My Recipes" option. Click this option to create a new web scraping task. You will be presented with a command screen to continue the mining process.4.&nbsp;Set options for scraping the website: Data Miner has various options and settings for scraping a website. For example, you can specify which data you want to scrape, and you can set automatic actions, such as page navigation or form filling.<ol start="5">
<li> Start scraping the website. Once you have finalized the settings, you can start the data scraping by clicking the "Scrape" button in the Data Miner extension dashboard. The extension will crawl the website and collect the data you have specified. (You can also watch the process in this <a href="https://www.youtube.com/watch?v=rjuuVdebWiY">short video</a>.)</li>
<li> Save or export the data. You can usually save your scraped data as a CSV file or Excel spreadsheet. You can also copy the scraping screen using the Clipboard feature &mdash; a convenient and time-saving feature. If your scraped data is more than 10,000 rows, it will be downloaded as two separate files.</li>
</ol>By following these steps, you can scrape one or multiple websites with Data Miner, and you can run any of the 60,000-plus data scraping rules, or create your own customized data scraping method to get only the data you need from a web page, because it is possible to create single page or multi-page<a href="https://dataminer.io/features"> automatic scraping.</a>You can automate scraping and can run batches of scraping jobs, based on a list of website URLs. Plus, you can use 50,000 free, pre-made queries for more than 15,000 popular websites. You can also crawl URLs, paginate them, and scrape a single page from a single location &mdash; no coding required.Using the extension also has the following advantages.<ul>
<li>It helps you use it safely and securely: It behaves as if you are clicking on the page yourself in your own browser.</li>
<li>It helps you scrape without worry: It's not a bot, so you won't get blocked when you make a query.</li>
<li>It keeps your data private: The add-on does not sell or share your data. 
<hr>
<a href="https://gijn.org/wp-content/uploads/2023/11/pinar-dag-twitter-profile.jpg"><img class="wp-image-1244829 alignleft" src="https://gijn.org/wp-content/uploads/2023/11/pinar-dag-twitter-profile.jpg" alt="" width="153" height="153"></a><a href="https://gijn.org/about/staff-member/pinar-dag/">P&#305;nar Da&#287;&nbsp;</a>is the editor of GIJN Turkish and a lecturer at&nbsp;<a href="https://twitter.com/khasedutr" target="_blank" rel="noopener">Kadir Has University</a>. She is the co-founder of the&nbsp;<a href="https://www.voyd.org.tr/" target="_blank" rel="noopener">Data Literacy Association</a>,&nbsp;<a href="http://www.verigazeteciligi.com/" target="_blank" rel="noopener">Data Journalism Platform Turkey,</a>&nbsp;and&nbsp;<a href="https://twitter.com/Dagmedyanet" target="_blank" rel="noopener">Da&#287;Medya.&nbsp;</a>She works on data literacy, open data, data visualization, and data journalism, and is on the jury of the&nbsp;<a href="https://sigmaawards.org/about/" target="_blank" rel="noopener">Sigma Data Journalism Awards</a>.</li>
</ul>
	This <a target="_blank" href="https://gijn.org/resource/no-coding-required-a-step-by-step-guide-to-scraping-websites-with-data-miner/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

4 Common Angles Data Journalists Use to Tell Stories

by Paul Bradshaw • July 13, 2023

At the Online Journalism Blog, data journalism expert Paul Bradshaw analyzed 100 pieces of data that journalists use and found that there are several common story angles.

Data Journalism

How a Global Team Used Data to Interrogate Green Claims for Deforestation Inc.

by ICIJ • June 8, 2023

The latest cross-border investigation by the ICIJ — Deforestation Inc. — used thousands of pages of documents, court records, green certificates databases, trade data, and audit reports to expose how a lightly regulated sustainability industry overlooks forest destruction and human rights violations.

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Global Causes of Death, Forced Disappearances in Mexico, German Citizenship Trends, and the Best Pizza in US

by Alexa van Sickle and Laura Dixon • September 15, 2023

Data Journalism News & Analysis

Honoring the Best in Data Journalism: Winners of the 2023 Sigma Awards

by Laura Dixon • March 28, 2023

Winning entries at this year’s Sigma Awards focused on the war in Ukraine, air pollution, rising sea levels, political candidates, and road accidents involving schoolchildren, and used data, satellite imagery, gaming techniques, and 3D imagery to create compelling stories.

Accessibility Settings

text size

color options

reading tools

other

Global Academy

Resource

Stories

Topics

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Read this article in

How Journalists Can Use Data Miner

Read other stories tagged with:

Republish this article

Read Next

Data Journalism News & Analysis

4 Common Angles Data Journalists Use to Tell Stories

Data Journalism

How a Global Team Used Data to Interrogate Green Claims for Deforestation Inc.

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Global Causes of Death, Forced Disappearances in Mexico, German Citizenship Trends, and the Best Pizza in US

Data Journalism News & Analysis

Honoring the Best in Data Journalism: Winners of the 2023 Sigma Awards

Global Academy

Resource

Stories

Topics

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Read this article in

Related Resources

Digging Up Hidden Data with the Web Inspector

Why Web Scraping Is Vital to Democracy

Tools for Scraping, Cleaning, and Prepping Data

Web Scraping: A Journalist’s Guide

Share

How Journalists Can Use Data Miner

Related Resources

Digging Up Hidden Data with the Web Inspector

Why Web Scraping Is Vital to Democracy

Tools for Scraping, Cleaning, and Prepping Data

Web Scraping: A Journalist’s Guide

Related Stories

4 Common Angles Data Journalists Use to Tell Stories

How a Global Team Used Data to Interrogate Green Claims for Deforestation Inc.

Data Journalism Top 10: Global Causes of Death, Forced Disappearances in Mexico, German Citizenship Trends, and the Best Pizza in US

Honoring the Best in Data Journalism: Winners of the 2023 Sigma Awards

Read other stories tagged with:

Republish this article

Read Next

Data Journalism News & Analysis

4 Common Angles Data Journalists Use to Tell Stories

Data Journalism

How a Global Team Used Data to Interrogate Green Claims for Deforestation Inc.

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Global Causes of Death, Forced Disappearances in Mexico, German Citizenship Trends, and the Best Pizza in US

Data Journalism News & Analysis

Honoring the Best in Data Journalism: Winners of the 2023 Sigma Awards