Register for #GIJC25
November 20, 2025 • 09:00
-
day
days
-
hour
hours
-
min
mins
-
sec
secs

Accessibility Settings

color options

monochrome muted color dark

reading tools

isolation ruler
AI data web scraping tools, no coding
AI data web scraping tools, no coding

Image: Shutterstock

Stories

Topics

How Non-Coding Journalists Can Build Web Scrapers With AI — Examples and Prompts Included

Data is a crucial part of investigative journalism: It helps journalists verify hypotheses, reveal hidden insights, follow the money, scale investigations, and add credibility to stories. The Pulitzer Center’s Data and Research team has supported major investigations, including shark meat procurements by the Brazilian government, financial instruments funding environmental violations in the Amazon, and the “black box” algorithm of a popular ride-hailing app in Southeast Asia.

Before embarking on a data-driven investigation, we usually ask three questions to gauge feasibility:

  1. Is the data publicly available?
  2. How good is the quality of the data?
  3. How difficult is it to access the data?

Even if the first two answers are a clear yes, we still can’t celebrate, because the last question is often the most challenging and time-consuming step in the entire process.

When Should You Build a Web Scraper?

Accessing data published online can be as simple as a few clicks to copy-paste tables or download a data file. But we often need to extract large amounts of data from online databases or websites and move it to another platform for analysis. If the site doesn’t offer a download function, we first reach out to the people behind it to request access. In some countries, if the data belongs to the government, you may be able to request access through a Freedom of Information (FOI) or Right to Information (RTI) request. However, this process often comes with its own set of challenges, which we won’t go into here.

If those options are unavailable, we may have to manually extract the data from the website. That could mean repeatedly clicking “next page” and copy-pasting hundreds of times, opening hundreds of URLs to download files, or running many searches with different variables and saving all results. Repetitions can reach the hundreds of thousands, depending on the dataset size and site design. In some cases this is possible if we have the time or budget, but for many journalists and newsrooms, time and money are a luxury.

This is when we consider web scraping, the technique of using a program to automate extraction of specific data from online sources and organize it into a format users choose (e.g., a spreadsheet or CSV). The tool that does this is called a scraper.

There are many off-the-shelf scrapers that require no coding. They may be Chrome extensions such as Instant Data Scraper, Table Capture, Scraper, Web Scraper, and Data Miner. Most use a subscription model, though some offer free access to nonprofits, journalists, and researchers (e.g., Oxylabs).

Most commercial scrapers are built to target popular sites: e-commerce, social media, news portals, search results, and hotel platforms. However, websites are built in many ways, and journalists often face clunky, unfriendly government databases. When commercial tools can’t handle these, we build custom scrapers. Building our own can be cheaper and faster, since we don’t need the extra features offered by commercial scrapers.

Building scrapers used to require solid coding skills. In our recent investigations, however, Large Language Models (LLMs) like ChatGPT, Google Gemini, or Claude helped us build scrapers for complex databases much faster and without advanced coding skills. With basic web knowledge, guidance, and examples, even non-coding journalists can build a scraper with LLMs. Here’s our recipe:

1. Understand How the Web Page Is Built (Static vs. Dynamic)

Before asking an LLM to write a script, you need a basic sense of how the target page is built, especially how the data you want to scrape is loaded. This brings us to the difference between a static and a dynamic web page.

A static web page is like a printed newspaper: The content doesn’t change once created unless someone edits the page. What you see is exactly what’s stored on the server. If you refresh, you’ll get the same content. An example is a Wikipedia page with a list of countries.

You can imagine a dynamic web page like a social media feed, where content changes automatically depending on who’s visiting, what they click, or what new data is available. When you load or refresh, your browser typically runs scripts and talks to a database or API (Application Programming Interface) to fetch the content. A dynamic page we scraped for the shark meat procurements investigation was São Paulo’s public procurements database, which requires you to fill in the search box to view data.

Static pages are usually easier to scrape because the data is present in the page’s HTML source, whereas dynamic pages often require extra steps because the data is hidden behind scripts or loads only when you interact with the page. That calls for a more advanced scraper, but don’t worry, we’ll show you how to build it.

However, you can’t always tell by just looking. Some pages that appear static are actually dynamic. Do a quick test: Open the page, right-click, and select “View Page Source” (in Chrome) to see the HTML. Use Find (Ctrl+F/Cmd+F) to search for the data you want to scrape. If the data is in the code, it’s likely static. If you don’t see it, the page is probably dynamic.

 

Alternatively, you can ask an LLM to check for you. Suggested prompt:

  • Here is the URL of a web page [paste below] that I would like to scrape data from. Can you tell me if the information and data on the page is directly available in the code [static] or if it loads later with scripts [dynamic]?

For LLMs without internet browsing, copy and paste the full HTML source code and replace “URL” in the prompt with “HTML source code.” If the source code exceeds the prompt limit, you need to pick the part containing your target data. Step 3 will explain how to do that.

To make this easier, here’s a real-world example from one of our investigations. We asked ChatGPT (GPT-5 model) to look at The Metals Company’s press releases page. We scraped all the press releases to analyze the company’s public messaging on deep-sea mining. ChatGPT identified it as a static web page.

2. Build Your Scraper with LLMs (for a Static Web Page)

If the web page is static, you can ask an LLM to write a Python scraping script (Python is a common language for scraping). Include the following in your prompt:

  • URL of the web page;
  • HTML source code (optional, but include it for reliability);
  • Which data you need to scrape (fields/columns);
  • Where/how to store the data (e.g., CSV file);
  • Pagination details (Does the page have multiple pages? How to navigate them?);
  • Where you’ll run the scraper (your computer or an online platform like Google Colab);
  • Final check: Ask if the LLM needs any other information before generating the script.

Here’s the prompt we gave to ChatGPT:

  • I need to build a scraper for the web page below. It is a static web page. Write me a Python script to scrape the web page from my computer. https://investors.metals.co/news-events/press-releases  It has a list of press releases with title, date, and URL to the press release page. I need all three of them stored in separate columns in a CSV file. The web page has pagination. I need the full list of press releases from all pages. Here’s the HTML source code: [paste the full HTML source code]

The press releases are listed across multiple URLs, accessible via the pagination bar at the bottom of the page. Even though we only provided ChatGPT the first-page URL, it inferred the rest by appending “?page=N” to the base URL.

ChatGPT will return a script with a brief explanation. Copy and paste it into a code editor (we use Visual Studio Code), save it as a Python file (.py), and run it in Terminal (macOS) or PowerShell (Windows).

If you’ve never run a Python script on your computer, you’ll need a quick setup. LLMs can generate step-by-step instructions with the prompt below. Setup usually includes installing Python (and its package manager Pip, or Homebrew on macOS) plus common scraping libraries like Requests, Beautifulsoup4, and optionally Selenium.

  • I need to run a Python scraping script on macOS/Windows. Show me step-by-step setup instructions.

Pro tip: Keep prompts in the same LLM conversation so it retains context. If the script doesn’t work or setup errors appear, paste the full error messages or logs and ask it to troubleshoot. You may need a few iterations to get everything working.

3. Extra Steps for Dynamic Web Pages

Scraping a dynamic web page requires a few more steps and some extra web development know-how, but LLMs can still do the heavy lifting.

If the page requires you to interact with it to access data, you need to tell the LLM the exact actions to perform and where to find the data you need, because unlike a static page, that information isn’t in the plain HTML source. The LLM will show you how to install Python libraries like Selenium or Playwright, which can open browsers in headed or headless mode (more on that later) and interact with the web page like a human.

We’ll use São Paulo’s procurements database as an example. You need to select or fill in some of the 12 fields in the search box and click the “Buscar” (search) button to view the table.

 

In this case, a scraper works like a robot: It imitates your actions, waits for the data to load, and then scrapes it. To tell the scraper which fields to fill, options to select, buttons to click, and tables or lists to extract, you need the “names” of those elements on the page. This is where some basic HTML knowledge helps.

HTML is the language that structures a web page. Each piece of content is inside an HTML element. Think of them as boxes that store content. Common elements include <h1> for a header, <a> for a link, <table> for a data table, and <p> for a paragraph. Many elements also have attributes (their “labels”), such as class and id, which help you identify them.

We need to know the elements we must interact with, the ones that hold the data we want, and their attributes (class or id) so we can tell the LLM what to do.

For example, below is the HTML element for the first field, Área, in São Paulo’s search box. It’s a <select> element with the id “content_content_content_Negocio_cboArea” and class “form-control.” You can copy and paste this into your LLM prompt to help it build the scraper.

<select name=”ctl00$ctl00$ctl00$content$content$content$Negocio$cboArea”

id=”content_content_content_Negocio_cboArea”

class=”form-control”

onchange=”javascript:CarregaSubareasNegocios();”>

<option value=””></option>

<option value=”8″>Atividade</option>

<option value=”6″>Imóveis</option>

<option value=”3″>Materiais e Equipamentos</option>

<option value=”2″>Obras</option>

<option value=”7″>Projeto</option>

<option value=”4″>Recursos Humanos</option>

<option value=”1″>Serviços Comuns</option>

<option value=”5″>Serviços de Engenharia</option>

</select>

An HTML element can be nested within another element, creating layers of hierarchical structure. For example, a <table> usually contains multiple <tr>, which refers to the rows in the table and each <tr> contains multiple <td>, which is the cells in the row. You might need to tell this to LLMs to reduce errors in some cases.

To identify an element and its attributes, right-click it and choose Inspect (in Chrome). This opens DevTools, showing the page’s HTML and attributes. The selected element is highlighted; hovering in DevTools highlights the matching content on the page. To copy it, right click the element in the Elements panel and choose Copy → Copy element (gets the full HTML, including nested children). To target it in code, choose Copy → Copy selector (or Copy XPath) to grab a unique selector that helps you locate the element.

 

If the web page loads slow, you need to tell LLMs to include waiting time after an action.

To build the scraper for São Paulo’s procurements database, below is our prompt for ChatGPT (GPT-5 model). We want to search for closed tenders (“ENCERRADA”) under “Materiais e Equipamentos” and “Generos Alimenticios” between January 1, 2024, and December 31, 2024 (see the previous video).

I would like to write a Python script that I will run from my computer to scrape data from an online database and store it in a CSV file. It is a dynamic web page. Below are the steps required to view the data. I’ve specified the HTML elements that the scraper should interact with. Print messages at different steps to show progress and help debug any errors. Let me know if you need any more information from me.

1. Go to the search page:

https://www.imprensaoficial.com.br/ENegocios/BuscaENegocios_14_1.aspx

2. Fill in the search criteria in nine dropdown menus:

Select “Materiais e Equipamentos” for:

<select name=”ctl00$ctl00$ctl00$content$content$content$Negocio$cboArea” id=”content_content_content_Negocio_cboArea” class=”form-control” onchange=”javascript:CarregaSubareasNegocios();”>

Select “Generos Alimenticios” for:

<select name=”ctl00$ctl00$ctl00$content$content$content$Negocio$cboSubArea” id=”content_content_content_Negocio_cboSubArea” class=”form-control”>

Select “ENCERRADA” for:

<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboStatus” id=”content_content_content_Status_cboStatus” class=”form-control” onchange=”fncAjustaCampos();”>

Select “1” for:

<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoInicioDia” id=”content_content_content_Status_cboAberturaSecaoInicioDia” class=”form-control”>

Select “1” for:

<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoInicioMes” id=”content_content_content_Status_cboAberturaSecaoInicioMes” class=”form-control”>

Select “2024” for:

<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoInicioAno” id=”content_content_content_Status_cboAberturaSecaoInicioAno” class=”form-control”>

Select “31” for:

<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoFimDia” id=”content_content_content_Status_cboAberturaSecaoFimDia” class=”form-control”>

Select “12” for:

<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoFimMes” id=”content_content_content_Status_cboAberturaSecaoFimMes” class=”form-control”>

Select “2024” for:

<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoFimAno” id=”content_content_content_Status_cboAberturaSecaoFimAno” class=”form-control”>

3. Use Javascript to click this button:

<input type=”submit” name=”ctl00$ctl00$ctl00$content$content$content$btnBuscar” value=”Buscar” onclick=”return verify();” id=”content_content_content_btnBuscar” class=”btn btn-primary”>

4. Wait for the result page to load; wait for the complete loading of the result table:

<table class=”table table-bordered table-sm table-striped table-hover” cellspacing=”0″ rules=”all” border=”1″ id=”content_content_content_ResultadoBusca_dtgResultadoBusca” style=”border-collapse:collapse;”></table>

5. Scrape the text content in the whole table. The first <tr> is the header. There are another 10 <tr> in the table after the header. In each <tr> there are 6 <td>. In the output CSV, create another column (eighth column) to store the <href> tag of the last <td> (the 7th <td>).

6. Go to the next page of the result table and repeat the scraping of the table until the last page. The number of pages appears in:

<span id=”content_content_content_ResultadoBusca_PaginadorCima_QuantidadedePaginas”>5659</span>

Use javascript to click this button to go to the next page: 

<a id=”content_content_content_ResultadoBusca_PaginadorCima_btnProxima” class=”btn btn-link xx-small” href=”javascript:__doPostBack(‘ctl00$ctl00$ctl00$content$content$content$ResultadoBusca$PaginadorCima$btnProxima’,”)”>próxima &gt;&gt;</a>

Remember to wait for the table to finish loading before scraping.

7. Data from all result pages should be appended to the same CSV file.

Several things to note in the prompt. We explicitly asked the scraper to print messages during the run; this is useful for troubleshooting if errors occur. We also instructed the scraper to click the button using JavaScript, which imitates a human click. If you don’t specify this, the button might be triggered by other methods that won’t work on some pages. Our instructions were very specific about the table content, but such detail may not be necessary as the LLM can often infer it.

Scraping a dynamic page usually requires additional Python libraries. If you see an error message: “ModuleNotFoundError,” paste the message into the LLM; it will provide the install command. Think of libraries as toolkits the scraper needs to perform certain functions.

If you want to test the scraper and inspect the downloaded data before running the full job (which may take a long time), add an instruction like: “Scrape only the first two pages for testing.”

4. Build Multiple Scrapers for a Multi-Layered Database

Most online databases organize data across multiple layers of pages. In São Paulo’s procurements database, after getting the list of tenders, you click the objeto (object) of each tender to open a new page with the tender’s contents. On each tender page, there may be one or more evento (event). You then click detalhes for each evento to view its contents. Hence, you may need two more scrapers: one to scrape each tender page and another to scrape each evento page. It’s possible to combine all three in one script, but that can get complicated. For beginners, it’s better to break the operation into smaller tasks.

In our case, we built a first scraper to extract the search results, including the URL of each tender. We then fed that URL list to a second scraper, which visited each tender page to extract the tender’s contents and the URLs of its eventos. Finally, we built a third scraper that visited each evento page, extracted the contents, and searched for keywords like “cação,” “pescado,” and “peixe” to determine whether the tender involved shark meat.

 

Not all websites are built the same, so the prompts shown here may not work verbatim for your targets. The advantage of using LLMs is their ability to troubleshoot errors, suggest fixes, and fold those changes into your scripts. We often go back and forth with ChatGPT to handle more complex pages.

5. Common Strategies to Overcome Blocking

By now, you know what you need to build your first scrapers and start systematically collecting information from the web. Sooner or later, though, when you try a new site or run the same scraper frequently, you’ll likely hit a wall: Your code is correct, you’ve identified the right HTML elements, everything should work … but the server returns an error.

If the error says “Forbidden,” “Unauthorized,” “Too Many Requests,” or similar, your scraper may be blocked. You can ask the LLM what the specific error means. Generally, websites aren’t designed to be scraped, and developers implement techniques to prevent it.

Below are three common types of blocking and strategies to try. This isn’t exhaustive. You’ll often need a tailored solution, but it’s a practical guide.

  • Blocking by geolocation

Some sites restrict access by region or country. For example, a government site may only accept connections from local IP (Internet Protocol) addresses. To work around this, use a VPN (e.g., Proton VPN or NordVPN) to obtain a local IP. You can, for instance, appear to connect from Argentina while you’re in France.

Some sites detect VPN traffic or block known VPN ranges, so the VPN may not do the trick. In that case, you can try a residential proxy provider (e.g., Oxylabs), which routes requests through real household IPs and heavily reduces detection.

  • Blocking due to request frequency

If your scraper makes many rapid requests, the site may flag that behavior as automated as that’s not the usual human behavior. Add random delays in your code, e.g., using the Python function time.sleep(seconds), and consider simulating mouse movement or scrolling to appear more human. An LLM can suggest the best approach for your stack.

Even with delays, a site may block your IP after a threshold of requests. You can rotate IPs with a code-driven VPN setup (e.g., ProtonVPN + Tunnelblick + tunblkctl) or, more simply, use a residential proxy that automates IP rotation across a large pool.

  • Blocking by bot identification

When you visit a website, your browser quietly sends a few behind-the-scenes notes like “I’m Chrome on a Mac,” “my language is English,” and small files called cookies. These notes are called HTTP headers. A bare-bones scraper often skips or fakes them, which looks suspicious. To blend in, make your scraper act like a normal browser: Set realistic HTTP headers (e.g., a believable User-Agent and Accept-Language), use a typical screen size (viewport), and include valid cookies when needed.

A starter prompt for the LLM to set this could be:

Below is the script for a Python scraper collecting information from the web page xxx. Provide a basic configuration for HTTP headers and cookies to minimize detection or blocking, and show where to add them in my code.

[paste the full scraper script]

The LLM will return parameters to place near the start of your script.

You can toggle a parameter to run a scraper in headless mode (no visible window) or headed mode (with a window). It’s genuinely fun to watch Chrome click around in headed mode, and it’s easier to debug because you can see clicks, errors, and pop-ups, but it’s slower and uses more CPU/RAM. Headless is more efficient, though it can be easier for sites to detect. A good practice is to develop and test in headed mode, then switch to headless for full runs.

These strategies are just a starting point. If your scraper is still blocked, copy and paste the error into the LLM to get more clues on how to fix it.

6. Advanced Scraping: Schedule Your Scraper

Sometimes you’ll want to run a scraper on a schedule — for example, when a page frequently adds or removes information and you need to track changes. You could run it manually every day, but that’s cumbersome and error-prone.

A better approach is a tool that runs at set intervals. GitHub Actions is a very practical (and free) platform for executing workflows in the cloud. And it brings a key advantage: Your computer doesn’t need to be on as everything runs online on a virtual machine.

If you’re familiar with GitHub, a site for storing code and collaborating on software, you can upload your scraper to your repository, create a .yml file, and use YAML to configure the virtual machine and specify the run steps.

Again, you can get both the steps and the YAML script from an LLM. You might start with a prompt like:

I have a Python scraper in xxxx.py inside the xxxx repository of my GitHub account. I want to use GitHub Actions to run it once a day at 1 PM UTC. Tell me the steps and the YAML code I need to configure it. This is the source code for my scraper:

[paste the full scraper script]

With the steps, examples, and prompts above, we hope this guide helps non-coding journalists leverage the growing power of LLMs to work faster and smarter. If you have any questions or feedback, please get in touch with us at keng@pulitzercenter.org and frainis@pulitzercenter.org.

Editor’s Note: This story was originally published by the Pulitzer Center and is reposted here with permission. 


Kuek Ser Kuang Keng profile pic

Kuek Ser Kuang Keng is a digital journalist, data journalism trainer, and media consultant based in Kuala Lumpur, Malaysia. He is the founder of Data-N, a training program that helps newsrooms integrate data journalism into daily reporting. He has more than 15 years of experience in digital journalism. He started as a reporter at Malaysiakini, the leading independent online media in Malaysia, and worked on high-profile corruption cases and electoral frauds during the eight-year stint. He has also worked in several U.S. newsrooms, including NBC, Foreign Policy, and PRI.org, mostly on data-driven reporting. He holds a master’s in journalism from New York University. He is a Fulbright scholarship recipient, a Google Journalism Fellow, and a Tow-Knight Fellow.

Federico Acosta RainisFederico Acosta Rainis is a data specialist at the Pulitzer Center’s Environment Investigations Unit. After a decade as an IT consultant, he started working as a journalist in several independent media in Argentina. In 2017 he joined La Nación, Argentina’s leading newspaper, where he reported on education, health, human rights, inequality, and poverty, and he did extensive on-the-ground coverage of the COVID-19 pandemic in Buenos Aires. He has participated in investigations that won national and international awards for digital innovation and investigative journalism from ADEPA/Google and Grupo de Diarios América (GDA). He holds a master’s degree in data journalism from Birmingham City University. In 2021 he was awarded the Chevening Scholarship, and in 2022 he joined The Guardian’s Visuals Team as a Google News Initiative fellow.

Republish our articles for free, online or in print, under a Creative Commons license.

Republish this article


Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

Read Next