Big Data in Need of Analytic Rigor by Journalists
Kate Crawford, a visiting professor at the MIT Center for Civic Media and a principal researcher at Microsoft Research, recently warned about failing to closely scrutinize the results of big data analysis. In a keynote speech at the Strata Conference in Santa Clara, California, she called on “data scientists” to use the methods of social science in examining data to avoid misinterpretations and wrong conclusions.
We decided to seek the thoughts and comments of award-winning journalists Jennifer LaFleur and David Donald. LaFleur is the director of computer-assisted reporting at ProPublica and for two decades has taught how to correctly use social science methods in investigative journalism. David Donald is the data editor for the Center for Public Integrity and has taught social science methods to journalists for two decades.
Do you share Crawford’s concerns and her recommendations?
Jennifer: I do. I think the more anyone can put their data through rigorous tests, the better the results will be. Social scientists approach their work in a way that bolsters it. They have to prove that their results could not be caused by something else. In a sense, they search to prove their own work wrong. All of us could benefit from that frame of mind.
David: Yes. One of her major points is summed up: “Data and data sets are not objective; they are creations of human design.” When working with journalists just entering the world of computer-assisted reporting and data journalism, I try to impress upon them that there is no such thing as an “immaculate database.” Every database somehow somewhere has been touched by humans. That means if we rely just on data as we’re given them and the algorithms we write, we have a great chance of making a mistake. No human interaction is infallible.
Like all scientists worthy of the name, social scientists rely on scientific method, which recognizes human limitations and, hence, data limitations. The method is where objectivity enters by helping us check our biases. It’s not perfect, but when we toss that method out the window, we do so at our own risk.
Crawford mentions the need to address weaknesses in this new big data science. I’m concerned that the emphasis in the data science movement is on the data and not the science, at least in journalism. What is encouraging is that scientists confronting big data are aware of the impact of big data on their research and on scientific method. The book The Fourth Paradigm: Data-Intensive Scientific Discovery is a good introduction to the issues. It shows that scientists are excited about the possibilities of big data but aren’t exploring big data as wide-eyed innocents. Journalists need to make sure they aren’t too.
Have you seen examples yourselves of analyses of big data that failed to consider the sources of the information or the limitations created by the way the data were gathered?
David: Our world of “big data” isn’t as big as some of what Crawford outlines. I’ve analyzed a database of about 1.7 terabytes of Medicare claims for stories. I don’t know if it’s the largest database a journalist has analyzed, but it’s probably one of the larger ones. I’ve seen estimates that Google processes 20 petabytes of data a day. We’re not at that scale yet. That said, without going into details, we have examples of news organizations posting databases online that at the minimum do not even alert users to the limitations in their data, however big or small the database is.
Jennifer: I think in the world of journalism, we’re still looking at pretty little data compared to what some researchers are dealing with. In newsrooms, we would regularly get studies put forth where data were not put through rigorous statistical tests. I think often journalists can avoid that by simply asking for the methodology to make sure they really understand how a study was done.
Some remarks in her talk about using data from social media seem to hearken back to the classic book How to Lie With Statistics. Are there some other basic shortcomings you have seen in some of the data journalism being done now?
Jennifer: In some cases, what passes for “data journalism” is simply posting of big data sets without checking it or putting it in context. As journalists our job is to report and present information – whether it is interviews, documents or even data. Data, like any other source, will have flaws such as missing data or data that was entered incorrectly. As journalists, we need to interview the data and make sure we account for those problems.
David: I agree with Crawford that there’s a sense that numbers from big data can’t lie, but in reality, we all can learn to lie with numbers and statistics. My bigger fear is that journalists – even some data journalists – really do not have a sense of both strengths and weaknesses in reporting with numbers and statistics from the data, however large or small they are.
Are there some recent stellar examples of journalism using social science methods you could suggest that journalists look at?
David: For about 10 years now, the National Institute for Computer-Assisted Reporting and the Knight Chair in CAR at Arizona State University have sponsored the annual Philip Meyer Journalism Award for the best use of social science methods in reporting. Look at any of the winners through the years. And I should mention that NICAR is a joint project of Investigative Reporters and Editors and the Missouri School of Journalism. It’s not a coincidence, I think, that investigative journalists and social scientists share a basic need for evidence for their work to go forward. Any of the Meyer winners will illustrate how social science methods can make the evidence more solid.
Jennifer: I would suggest going to IRE.org. Investigative Reporters and Editors has an annual contest for stories that use these techniques. They post the winning stories online. There has been great work done over the years by journalists using social science techniques, from football injuries to unsolved murders to government spending.
Both of you teach social science methods to journalists. Are there some basic tip sheets and a few books you would recommend so they can avoid embarrassing assumptions or conclusions?
David: If you’re new to data journalism and computer-assisted reporting, go to IRE’s Resource Center online and at the Tip Sheet search page type in “dirty data.” That should be the first dose of reality about problematic data and over reaching. Now that you’re back down to earth, read – no, study – Philip Meyer’s Precision Journalism. A world of possibility will open.
Jennifer: I think the best thing a journalist can do to avoid coming to the wrong conclusion – particularly when it comes to more complicated analysis – is to vet their work with experts. At ProPublica, we develop “white papers” on our analyses that we can then send to experts. That usually makes the analysis better. When we publish, we also provide information about how our analysis was done. Including what we do and do not know from the data.
Tip sheets: I would definitely encourage folks to check out the IRE resource center. There are tip sheets for years of conferences on many different subjects. That is the first place I go when I start looking into a new subject.
Books: Precision Journalism by Philip Meyer – one of the heroes of data journalism.
A Mathematician Reads the Newspaper by John Allen Paulos
Numbers in the Newsroom by Sarah Cohen
Brant Houston is the Knight Chair in Investigative and Enterprise Reporting at the University of Illinois. He served for more than a decade as executive director of Investigative Reporters and Editors, and is the author of Computer-Assisted Reporting and co-author of The Investigative Reporter’s Handbook.