When using data for investigative stories, it is important to learn how to obtain and clean the information. But it is also vital that you interpret your findings correctly and extract the right conclusions from the numbers, filters, and spreadsheets. If you do the math correctly but fail to read the answers properly, you may end up misleading your audience.
Why does that happen? Sometimes, the data we work with does not really answer the questions we ask. In other cases, we may forget to apply traditional journalism ethics during the data collection and analysis stages of the investigation: We should be careful not to cherry-pick statistics that support our point of view, miss the context, or focus so much on our question that we don’t listen to what the data says. Remember, in data journalism, data is our source and we need to respect it.
Lesson one is to make sure that you do not draw conclusions about individuals based on data that is aggregated by country, or other breakdowns. The reality for people in the country or location you are talking about can be very different than the impressions provided by such broad-brush data.
Data scientist Heather Krause, founder of Canadian company Datassist and the project for equity in data science We All Count, explains this problem — known as the Ecological Fallacy — with an example about cigarettes. When examining life expectancy and cigarettes smoked in a number of countries, there appears to be a positive correlation.
That is, in countries with higher cigarette consumption, people also have a longer life expectancy. However, it would be incorrect to then conclude that smoking makes you live longer, not only because logic and numerous scientific studies say the opposite, but also because in this example the data examined did not evaluate what cigarettes do to individuals, it simply compared two aggregated national indicators.
This fallacy also reveals another problem that can lead to misinterpretation: the data that is being compared is not cause-effect related. Another way to put this: correlation is not causation. Because the data was not originally gathered with the purpose of seeing if smoking more cigarettes has an effect on life expectancy, the statistical analysis shows a prediction instead of a causal result.
Reporters looking at these two sets of data then should take into account other variables to find a possible explanation for the correlation, such as the purchasing power that allows people in wealthy countries to buy cigarettes but also to access better healthcare.
Besides making sure our calculations are accurate, as journalists we need to assess if the figures reveal the reality we are trying to report. “Be very careful with averages: they may be misleading if there are too-large intervals of values between the data,” warns Sandra Crucianelli, director of the Data Intelligence Unit at Infobae, an Argentine news site.
This happens quite frequently when reporting about salaries or other issues related to inequality. In countries with high levels of inequality, average salary figures are not representative of reality since this data fails to reflect the gap between the wealthy and the poor.
Miguel Paz, a former Nieman Foundation Fellow and founder of digital subscription agency Reveniu, advises reporters to use the median — the value that lies in the central position of an organized list of data — because it gets closer to what the majority of individuals live on. “We must stop writing in averages!” urges Paz, who has developed a variety of data journalism projects and conducted workshops on how data journalists can make mistakes even when doing the math correctly.
Percentages and rates are also good allies when describing socio-demographic conditions. Every year, as soon as the police release crime figures in my country — Colombia — I see dozens of media reports saying things like “Medellín is the Most Dangerous City” or “Bogotá is the Worst Place to Have a Cellphone.” But these reports use only the absolute values and, therefore, fail to reflect the real security situation. If the journalists working on these stories did a relative analysis, by contextualizing the data relative to the population or cellphone usage, they would find that in Bogotá, a metropolis of 8 million, or Medellin, a city of 2.5 million, the figures are not so bad as the absolute values suggest, and in fact, insecurity needs more attention in other cities with higher per-capita crime rates.
Those headlines also show how generalization can cause problems. To properly explore which city is the most dangerous, a broader number of indicators is required to paint a more nuanced picture.
COVID-19 and the Great Data Learning Curve
Something similar has happened in COVID-19 coverage. Several reports have been published comparing the number of infected people or deaths in different countries, but these comparisons are usually not accurate, mainly because the right measurement to use in this case is the infection rate — the number of infected people relative to the population — but also because a number of other factors have made international comparisons difficult.
For example, during the first months of the outbreak, some countries didn’t include the deaths that occurred in care homes, but later did. And there are variations related to those countries that add a new case on national coronavirus registries only if the virus was the main cause of death or if there is a test confirming the presence of the virus at the time of death. Others use less demanding counting criteria. It is also important to consider the amount of time the virus has been present in each country, because that factor and others like strong or weak health systems make a difference in how each government learns how to respond to the pandemic.
In relation to the coverage of the pandemic, there is another important indicator to bear in mind: the infection fatality rate, or the number of COVID-19 deaths divided by all those infected. The problem with this figure is that each nation has established its own diagnostic regime, some testing more than others, some changing the testing frequency as the pandemic evolves, which makes comparisons impossible. And of course, in many countries there is believed to be a huge undercount in terms of registered cases (due to a lack of testing equipment) and deaths (due to not all deaths being registered or linked to COVID-19), making international comparisons of countries like the UK or the US and India difficult. In the UK, for instance, at the beginning of the pandemic, only people who were admitted to hospitals got tested, which made the death rate seem higher than it really was, since only the most serious cases were picked up by the testing regime.
The pandemic has proved that it is important for journalists in every field to become data literate so as to deal correctly with aggregated data, and that we should be careful not to misuse non-comparable variables. Many public officials make this kind of error, and as journalists we must learn to identify these mistakes and avoid them.
Another example, but this time from the pre-pandemic world: Colombiacheck, the first fact-checking media organization in Colombia, examined a claim from a congresswoman who said that rural land in the country was concentrated in the hands of Black and Indigenous communities, which caused a huge controversy, since these communities have been frequent victims of land dispossession as a result of long-running conflict in the country. While checking her statement, reporters learned that even though official numbers did show that those two groups owned more land in total than Colombians of other ethnic groups, it was a mistake to suggest the collective ownership deeds of those communities meant the individuals within those communities exercised greater landowner power.
Tips for Bulletproofing Your Data Interpretation
So here is a checklist to ensure you are interpreting the data correctly before publishing:
- Always ask yourself whether the data really does relate to your investigative question. Is there enough information? Ask: Am I looking at it from the right angle? Am I asking enough questions to the data? Am I breaking it down enough to see all its important nuances? Are the variables comparable?
- Tell the story according to the level of data you have. If you only have country-level information, your findings should only address national trends or predictions. If you have data down at the individual level, then you can make conclusions about people’s behavior or trends.
- Verify if the variables you are analyzing have a direct causal relation (one causes the other) or if there are intermediary elements that should be taken into account. You can do this by looking at the way the data was collected and processed.
- If the correlation is not causal but predictive, make sure to tell the story that way, with sentences like: “If x increases, it will be more likely that y will fall.” If the correlation is coincidental, consider discarding it.
- Be aware of what each registry represents (a person, a fact, a case, a location) and describe your findings accordingly.
- When you find outliers, don’t rush to publish them quickly. First, see if the explanation for any extremely high or low values really makes them newsworthy, if the data is flawed, or if there is an extra element that explains the outliers.
- Consider what statistical operation you will use to analyze your data: with percentages, using an average, the rate, or a ratio. Your decision will depend on the characteristics of the data and on the topic.
- Talk to experts. A statistician can help you identify the type of data you are dealing with: predictive, causal, comparable or not, etc. Also, a specialist in your story’s specific area should help you see gaps, misinterpretations, missing elements, and new correlations.
Finally, always keep in mind that the stories that are published as a result of these deep dives into the data shape the way people and governments make decisions. Data processing, and data literacy, matter. If we don’t take into account all the required factors and push conclusions without the necessary contextual analysis, we may draw attention to the wrong focus, unintentionally persuade people to take on habits that may hurt them, or produce a story that excludes a chunk of the population.
Miriam Forero Ariza is a Colombian freelance investigative and data journalist whose work has been published by VICE, Colombiacheck, and El Espectador. She has more than a decade of experience in collaborative investigations, data analysis, and visualizations. She is co-author of the Iberoamerican Data Journalism Handbook.