What do you do when you don’t get the dataset you need from authorities, or it doesn’t exist? Journalists are too often faced with this problem, because good data is crucial for investigative work. Without data, it may be difficult for journalists to hold powerful institutions accountable or expose wrongdoing.
At the 13th Global Investigative Conference (#GIJC23), journalists learned how to create their own datasets from scratch in a class led by data journalism editors Helena Bengtsson of Gota Media and Jennifer LaFleur of the Center for Public Integrity.
The lack of data is not always a bad thing, said Bengtsson: sometimes it’s an opportunity to build data that journalists can use without the pressure of deadlines and competition.
Steps to Building Your Own Data
- Creating data from documents: Journalists can create data from documents obtained from agencies and government offices. Bengtsson said that most of the time there is access to documents, but they are not in the right format. When journalists get these disorganized datasets, the first thing to do is to develop a structure. In structuring datasets, journalists must make sure to arrange each part of the dataset carefully so it is easily accessible.
- Creating data from human sources: Another effective way to create great datasets is to make use of people. Media outlets and journalists can ask people questions on specific subjects and build data from the responses. Bengtsson talked about how she and her group dispatched a few reporters to the field to observe and document events for an investigation into crime in Swedish cities — speaking with small business owners in areas with crime. “Reporters were asked to fill out a form personally, and this was exported to a spreadsheet to create data,” she said. They got a lot of stories out of the exercise. Other ways to obtain human data include surveys and polls, crowdsourcing, sampling, testing, and scraping large language models — AI models trained on large amounts of text data.
- Research or observation: Journalists can build data from research and observation by developing methodologies based on best practices. For this method, Bengtsson said it is important to work with statisticians who can vet your chosen methodology. A run test — to analyze the data statistically before using it — is also vital in ensuring that you have the right information.
Pitfalls and Benefits in Building Data from Scratch
Creating datasets from scratch requires time and effort, LaFleur explained — journalists must allocate enough time to create them. “You must also make sure that your dataset is accurate and consistency must be ensured,” she said. This can be especially challenging if the data is not publicly available, or is scattered across multiple sources. It is time-consuming and requires patience, vigilance, and efficiency.
However, while there are challenges in building datasets, journalists can have the advantage of thorough, original data reporting to which no one else has access, and the opportunity of having control over all parts of the data — because they are able to choose which variable to use and how to clean, prepare, and analyze the dataset.
“When you collect data, remember to leave time for follow-up after collecting it,” LaFleur said, adding that journalists should make sure they verify it and attempt to get to the “ground truth” when writing data stories. She also noted that journalists should set cut-off dates while preparing data and should have the data guide their reporting.
Best Practices for Creating Surveys and Polls
Surveys and polls are some best practices for building datasets. When creating surveys, LaFleur highlighted some practices journalists should consider:
- Know the universe and possible subgroups. It is important to understand what group one is analyzing.
- Avoid too many open-ended questions, because they will be a nightmare to deal with at the data analysis stage.
- Think about sampling.
- Test your questions with a small group before rolling out the whole thing. “Make sure you don’t have data that people could misunderstand — test them all before launching,” LaFleur said.
- Documents should be good quality, and be properly scanned and scraped. “Be cautious of documents with lots of tiny numbers,” she warned. Having neat rows and columns is not the whole battle — the accuracy of the data is more important.