Stories

Big Data in Need of Analytic Rigor by Journalists

by Brant Houston • April 9, 2013

Kate Crawford, a visiting professor at the MIT Center for Civic Media and a principal researcher at Microsoft Research, recently warned about failing to closely scrutinize the results of big data analysis. In a keynote speech at the Strata Conference in Santa Clara, California, she called on “data scientists” to use the methods of social science in examining data to avoid misinterpretations and wrong conclusions.

We decided to seek the thoughts and comments of award-winning journalists Jennifer LaFleur and David Donald. LaFleur is the director of computer-assisted reporting at ProPublica and for two decades has taught how to correctly use social science methods in investigative journalism. David Donald is the data editor for the Center for Public Integrity and has taught social science methods to journalists for two decades.

: ProPublica’s Lefleur: The more you test your data, the better your results will be.

Do you share Crawford’s concerns and her recommendations?

Jennifer: I do. I think the more anyone can put their data through rigorous tests, the better the results will be. Social scientists approach their work in a way that bolsters it. They have to prove that their results could not be caused by something else. In a sense, they search to prove their own work wrong. All of us could benefit from that frame of mind.

David: Yes. One of her major points is summed up: “Data and data sets are not objective; they are creations of human design.” When working with journalists just entering the world of computer-assisted reporting and data journalism, I try to impress upon them that there is no such thing as an “immaculate database.” Every database somehow somewhere has been touched by humans. That means if we rely just on data as we’re given them and the algorithms we write, we have a great chance of making a mistake. No human interaction is infallible.

Like all scientists worthy of the name, social scientists rely on scientific method, which recognizes human limitations and, hence, data limitations. The method is where objectivity enters by helping us check our biases. It’s not perfect, but when we toss that method out the window, we do so at our own risk.

Crawford mentions the need to address weaknesses in this new big data science. I’m concerned that the emphasis in the data science movement is on the data and not the science, at least in journalism. What is encouraging is that scientists confronting big data are aware of the impact of big data on their research and on scientific method. The book The Fourth Paradigm: Data-Intensive Scientific Discovery is a good introduction to the issues. It shows that scientists are excited about the possibilities of big data but aren’t exploring big data as wide-eyed innocents. Journalists need to make sure they aren’t too.

Have you seen examples yourselves of analyses of big data that failed to consider the sources of the information or the limitations created by the way the data were gathered?

: CPI’s Donald: Journalists need to make sure they aren’t wide-eyed innocents when it comes to data.

David: Our world of “big data” isn’t as big as some of what Crawford outlines. I’ve analyzed a database of about 1.7 terabytes of Medicare claims for stories. I don’t know if it’s the largest database a journalist has analyzed, but it’s probably one of the larger ones. I’ve seen estimates that Google processes 20 petabytes of data a day. We’re not at that scale yet. That said, without going into details, we have examples of news organizations posting databases online that at the minimum do not even alert users to the limitations in their data, however big or small the database is.

Jennifer: I think in the world of journalism, we’re still looking at pretty little data compared to what some researchers are dealing with. In newsrooms, we would regularly get studies put forth where data were not put through rigorous statistical tests. I think often journalists can avoid that by simply asking for the methodology to make sure they really understand how a study was done.

Some remarks in her talk about using data from social media seem to hearken back to the classic book How to Lie With Statistics. Are there some other basic shortcomings you have seen in some of the data journalism being done now?

Jennifer: In some cases, what passes for “data journalism” is simply posting of big data sets without checking it or putting it in context. As journalists our job is to report and present information – whether it is interviews, documents or even data. Data, like any other source, will have flaws such as missing data or data that was entered incorrectly. As journalists, we need to interview the data and make sure we account for those problems.

David: I agree with Crawford that there’s a sense that numbers from big data can’t lie, but in reality, we all can learn to lie with numbers and statistics. My bigger fear is that journalists – even some data journalists – really do not have a sense of both strengths and weaknesses in reporting with numbers and statistics from the data, however large or small they are.

Are there some recent stellar examples of journalism using social science methods you could suggest that journalists look at?

David: For about 10 years now, the National Institute for Computer-Assisted Reporting and the Knight Chair in CAR at Arizona State University have sponsored the annual Philip Meyer Journalism Award for the best use of social science methods in reporting. Look at any of the winners through the years. And I should mention that NICAR is a joint project of Investigative Reporters and Editors and the Missouri School of Journalism. It’s not a coincidence, I think, that investigative journalists and social scientists share a basic need for evidence for their work to go forward. Any of the Meyer winners will illustrate how social science methods can make the evidence more solid.

Jennifer: I would suggest going to IRE.org. Investigative Reporters and Editors has an annual contest for stories that use these techniques. They post the winning stories online. There has been great work done over the years by journalists using social science techniques, from football injuries to unsolved murders to government spending.

Both of you teach social science methods to journalists. Are there some basic tip sheets and a few books you would recommend so they can avoid embarrassing assumptions or conclusions?

David: If you’re new to data journalism and computer-assisted reporting, go to IRE’s Resource Center online and at the Tip Sheet search page type in “dirty data.” That should be the first dose of reality about problematic data and over reaching. Now that you’re back down to earth, read – no, study – Philip Meyer’s Precision Journalism. A world of possibility will open.

Jennifer: I think the best thing a journalist can do to avoid coming to the wrong conclusion – particularly when it comes to more complicated analysis – is to vet their work with experts. At ProPublica, we develop “white papers” on our analyses that we can then send to experts. That usually makes the analysis better. When we publish, we also provide information about how our analysis was done. Including what we do and do not know from the data.

Tip sheets: I would definitely encourage folks to check out the IRE resource center. There are tip sheets for years of conferences on many different subjects. That is the first place I go when I start looking into a new subject.

Books: Precision Journalism by Philip Meyer – one of the heroes of data journalism.

A Mathematician Reads the Newspaper by John Allen Paulos

Numbers in the Newsroom by Sarah Cohen

Brant Houston is the Knight Chair in Investigative and Enterprise Reporting at the University of Illinois. He served for more than a decade as executive director of Investigative Reporters and Editors, and is the author of Computer-Assisted Reporting and co-author of The Investigative Reporter’s Handbook.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Big Data in Need of Analytic Rigor by Journalists</h2> by Brant Houston for Global Investigative Journalism Network &bull; April 9, 2013 <a href="http://www.katecrawford.net/">Kate Crawford</a>, a visiting professor at the MIT Center for Civic Media and a principal researcher at Microsoft Research, recently warned about failing to closely scrutinize the results of big data analysis. In <a href="http://strataconf.com/strata2013/public/schedule/detail/28189">a keynote speech</a> at the Strata Conference in Santa Clara, California, she called on "data scientists" to use the methods of social science in examining data to avoid misinterpretations and wrong conclusions.We decided to seek the thoughts and comments of award-winning journalists <a href="http://www.propublica.org/site/author/jennifer_lafleur">Jennifer LaFleur</a> and <a href="http://www.publicintegrity.org/authors/david-donald">David Donald</a>. LaFleur is the director of computer-assisted reporting at <a href="http://www.propublica.org/">ProPublica</a> and for two decades has taught how to correctly use social science methods in investigative journalism. David Donald is the data editor for the <a href="http://www.publicintegrity.org/">Center for Public Integrity</a> and has taught social science methods to journalists for two decades. <dl class="wp-caption alignright" id="attachment_1507">
<dt class="wp-caption-dt"><a href="https://gijn.org/wp-content/uploads/2013/04/Lefleur.jpg"><img class="size-full wp-image-1507" alt="Lefleur" src="https://gijn.org/wp-content/uploads/2013/04/Lefleur.jpg" width="300" height="250"></a></dt>
<dd class="wp-caption-dd">ProPublica's Lefleur: The more you test your data, the better your results will be.</dd>
</dl>Do you share Crawford&rsquo;s concerns and her recommendations?Jennifer: I do. I think the more anyone can put their data through rigorous tests, the better the results will be. Social scientists approach their work in a way that bol<ins cite="mailto:brant%20houston" datetime="2013-04-05T12:53">s</ins>ters it. They have to prove that their results could not be caused by something else. In a sense, they search to prove their own work wrong. All of us could benefit from that frame of mind.David: Yes. One of her major points is summed up: &ldquo;Data and data sets are not objective; they are creations of human design.&rdquo; When working with journalists just entering the world of computer-assisted reporting and data journalism, I try to impress upon them that there is no such thing as an &ldquo;immaculate database.&rdquo; Every database somehow somewhere has been touched by humans. That means if we rely just on data as we&rsquo;re given them and the algorithms we write, we have a great chance of making a mistake. No human interaction is infallible.Like all scientists worthy of the name, social scientists rely on scientific method, which recognizes human limitations and, hence, data limitations. The method is where objectivity enters by helping us check our biases. It&rsquo;s not perfect, but when we toss that method out the window, we do so at our own risk.Crawford mentions the need to address weaknesses in this new big data science. I&rsquo;m concerned that the emphasis in the data science movement is on the data and not the science, at least in journalism. What is encouraging is that scientists confronting big data are aware of the impact of big data on their research and on scientific method. The book The Fourth Paradigm: Data-Intensive Scientific Discovery is a good introduction to the issues. It shows that scientists are excited about the possibilities of big data but aren&rsquo;t exploring big data as wide-eyed innocents. Journalists need to make sure they aren&rsquo;t too.<del datetime="2013-04-05T12:55">&nbsp;</del>Have you seen examples yourselves of analyses of big data that failed to consider the sources of the information or the limitations created by the way the data were gathered?<dl class="wp-caption alignright" id="attachment_1508">
<dt class="wp-caption-dt"><a href="https://gijn.org/wp-content/uploads/2013/04/David-Donald-2110.jpg"><img class="size-full wp-image-1508" alt="David-Donald-2110" src="https://gijn.org/wp-content/uploads/2013/04/David-Donald-2110.jpg" width="253" height="253"></a></dt>
<dd class="wp-caption-dd">CPI's Donald: Journalists need to make sure they aren&rsquo;t wide-eyed innocents when it comes to data.</dd>
</dl>David: Our world of &ldquo;big data&rdquo; isn&rsquo;t as big as some of what Crawford outlines. I&rsquo;ve analyzed a database of about 1.7 terabytes of Medicare claims for stories. I don&rsquo;t know if it&rsquo;s the largest database a journalist has analyzed, but it&rsquo;s probably one of the larger ones. I&rsquo;ve seen estimates that Google processes 20 petabytes of data a day. We&rsquo;re not at that scale yet. That said, without going into details, we have examples of news organizations posting databases online that at the minimum do not even alert users to the limitations in their data, however big or small the database is.Jennifer: I think in the world of journalism, we&rsquo;re still looking at pretty little data compared to what some researchers are dealing with. In newsrooms, we would regularly get studies put forth where data were not put through rigorous statistical tests. I think often journalists can avoid that by simply asking for the methodology to make sure they really understand how a study was done.Some remarks in her talk about using data from social media seem to hearken back to the classic book How to Lie With Statistics.&nbsp; Are there some other basic shortcomings you have seen in some of the data journalism being done now?Jennifer: In some cases, what passes for &ldquo;data journalism&rdquo; is simply posting of big data sets without checking it or putting it in context. As journalists our job is to report and present information &ndash; whether it is interviews, documents or even data. Data, like any other source, will have flaws such as missing data or data that was entered incorrectly. As journalists, we need to interview the data and make sure we account for those problems.David: I agree with Crawford that there&rsquo;s a sense that numbers from big data can&rsquo;t lie, but in reality, we all can learn to lie with numbers and statistics. My bigger fear is that journalists &ndash; even some data journalists &ndash; really do not have a sense of both strengths and weaknesses in reporting with numbers and statistics from the data, however large or small they are.Are there some recent stellar examples of journalism using social science methods you could suggest that journalists look at?David: For about 10 years now, the <a href="http://www.ire.org/nicar/">National Institute for Computer-Assisted </a><a href="http://www.ire.org/awards/philip-meyer-awards/"><img class="alignright size-full wp-image-1510" alt="Meyer Award" src="https://gijn.org/wp-content/uploads/2013/04/Meyer-Award.png" width="326" height="381"></a>Repor<a href="http://www.ire.org/nicar/">ting&nbsp;</a>and the <a href="http://cronkite.asu.edu/faculty/doigbio.php">Knight Chair in CAR at Arizona State University</a> have sponsored the annual <a href="http://www.ire.org/awards/philip-meyer-awards/">Philip Meyer Journalism Award </a>for the best use of social science methods in reporting. Look at any of the winners through the years.<ins cite="mailto:brant%20houston" datetime="2013-04-05T12:56"></ins> And I should mention that NICAR is a joint project of <a href="http://www.ire.org/">Investigative Reporters and Editors </a>and the <a href="http://journalism.missouri.edu/">Missouri School of Journalism</a>. It&rsquo;s not a coincidence, I think, that investigative journalists and social scientists share a basic need for evidence for their work to go forward. Any of the Meyer winners will illustrate how social science methods can make the evidence more solid.Jennifer: I would suggest going to <a href="http://www.ire.org/">IRE.org</a>. Investigative Reporters and Editors has an annual contest for stories that use these techniques. They post the winning stories online. There has been great work done over the years by journalists using social science techniques, from football injuries to unsolved murders to government spending.Both of you teach social science methods to journalists. Are there some basic tip sheets and a few books you would recommend so they can avoid embarrassing assumptions or conclusions?David: If you&rsquo;re new to data journalism and computer-assisted reporting, go to IRE&rsquo;s <a href="http://www.ire.org/resource-center/">Resource Center</a> online and at the Tip Sheet search page type in &ldquo;dirty data.&rdquo; That should be the first dose of reality about problematic data and over reaching. Now that you&rsquo;re back down to earth, read &ndash; no, study &ndash; Philip Meyer&rsquo;s <a href="http://www.amazon.com/Precision-Journalism-Reporters-Introduction-Science/dp/0742510883">Precision Journalism</a>. A world of <a href="https://gijn.org/wp-content/uploads/2013/04/Precision-Journalism.jpg"><img class="alignright size-full wp-image-1509" alt="Precision Journalism" src="https://gijn.org/wp-content/uploads/2013/04/Precision-Journalism.jpg" width="265" height="402"></a>possibility will open.Jennifer: I think the best thing a journalist can do to avoid coming to the wrong conclusion &ndash; particularly when it comes to more complicated analysis &ndash; is to vet their work with experts. At ProPublica, we develop &ldquo;white papers&rdquo; on our analyses that we can then send to experts. That usually makes the analysis better. When we publish, we also provide information about how our analysis was done. Including what we do and do not know from the data.Tip sheets: I would definitely encourage folks to check out the IRE resource center. There are tip sheets for years of conferences on many different subjects. That is the first place I go when I start looking into a new subject.Books: <a href="http://www.amazon.com/Precision-Journalism-Reporters-Introduction-Science/dp/0742510883">Precision Journalism</a> by Philip Meyer &ndash; one of the heroes of data journalism.<a href="http://www.amazon.com/Mathematician-Reads-Newspaper-Allen-Paulos/dp/038548254X">A Mathematician Reads the Newspaper </a>by John Allen Paulos<a href="http://www.amazon.com/Numbers-newsroom-Using-math-statistics/dp/B0006E8VEC">Numbers in the Newsroom </a>by Sarah Cohen<a href="https://gijn.org/wp-content/uploads/2013/04/houston_brant08_sm.jpg"><img class="alignleft size-full wp-image-1506" alt="houston_brant08_sm" src="https://gijn.org/wp-content/uploads/2013/04/houston_brant08_sm.jpg" width="126" height="126"></a>Brant Houston is the Knight Chair in Investigative and Enterprise Reporting at the University of Illinois. He served for more than a decade as executive director of Investigative Reporters and Editors, and is the author of <a href="http://www.amazon.com/Computer-Assisted-Reporting-A-Practical-Guide/dp/0312411499">Computer-Assisted Reporting</a> and co-author of <a href="http://www.amazon.com/Investigative-Reporters-Handbook-Documents-Techniques/dp/0312589972">The Investigative Reporter's Handbook.</a><del datetime="2013-04-05T12:57"></del>
	This <a target="_blank" href="https://gijn.org/stories/big-data-needs-analytic-rigor/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

Iron Dome’s Defense Network, Hong Kong’s Plastic Waste Ban, European ‘Brain Waste,’ and 100 Days to Paris Olympics

by Ana Beatriz Assam • April 26, 2024

In this edition of our Top 10 in Data Journalism, we also highlight the export destinations of smuggled Peruvian gold, the worker shortage hampering the Russian defense industry, and the highly partisan — and evenly balanced — political divide in the US.

How They Did It

How ProPublica Exposed Ethics Scandals at the US Supreme Court

by Clark Merrefield, The Journalist's Resource • April 25, 2024

The Journalist’s Resource talked with ProPublica reporters about their blockbuster series, which revealed behind-the-scenes connections between billionaires and US Supreme Court justices and prompted historic reforms on the nation’s high court.

data journalism missing piece common mistake

Data Journalism News & Analysis

Lessons Learned: 10 Common Mistakes in Data Journalism

by Rowan Philp • April 24, 2024

GIJN asked speakers and attendees in the NICAR conference hallways for the data journalism gaps they see, and for under-covered topic areas and under-used skills that newsrooms can address.

Reporting Tools & Tips Teaching & Training

Best Practices for Election Coverage Using YouTube

by Carolina de Assis, LatAm Journalism Review • April 23, 2024

At the recent International Symposium on Online Journalism (ISOJ), one workshop offered suggestions on how journalists and media outlets can better use YouTube in their election coverage.

Accessibility Settings

text size

color options

reading tools

other

Stories

Big Data in Need of Analytic Rigor by Journalists

Republish this article

Read Next

Data Journalism Top 10

Iron Dome’s Defense Network, Hong Kong’s Plastic Waste Ban, European ‘Brain Waste,’ and 100 Days to Paris Olympics

How They Did It

How ProPublica Exposed Ethics Scandals at the US Supreme Court

Data Journalism News & Analysis

Lessons Learned: 10 Common Mistakes in Data Journalism

Reporting Tools & Tips Teaching & Training

Best Practices for Election Coverage Using YouTube

Stories

Big Data in Need of Analytic Rigor by Journalists

Related Resources

Tipsheet: Latest Tools for Investigating with Telegram

Investigating Elections: Threat from AI Audio Deepfakes

Updated GIJN Databases (Poverty, Crime, Corruption, and Terrorism)

Updated Resources on Corruption

Share

Related Resources

Tipsheet: Latest Tools for Investigating with Telegram

Investigating Elections: Threat from AI Audio Deepfakes

Updated GIJN Databases (Poverty, Crime, Corruption, and Terrorism)

Updated Resources on Corruption

Related Stories

Iron Dome’s Defense Network, Hong Kong’s Plastic Waste Ban, European ‘Brain Waste,’ and 100 Days to Paris Olympics

How ProPublica Exposed Ethics Scandals at the US Supreme Court

Lessons Learned: 10 Common Mistakes in Data Journalism

Best Practices for Election Coverage Using YouTube

Republish this article

Read Next

Data Journalism Top 10

Iron Dome’s Defense Network, Hong Kong’s Plastic Waste Ban, European ‘Brain Waste,’ and 100 Days to Paris Olympics

How They Did It

How ProPublica Exposed Ethics Scandals at the US Supreme Court

Data Journalism News & Analysis

Lessons Learned: 10 Common Mistakes in Data Journalism

Reporting Tools & Tips Teaching & Training

Best Practices for Election Coverage Using YouTube