Accessibility Settings

color options

monochrome muted color dark

reading tools

isolation ruler

Photo: Pixabay



Struck by Lightning: A Quick Lesson on Cleaning up Your Data

Photo: Pixabay

Being struck by lightning is often used as an example of heavenly retribution because it is so unlikely. The occurrence is simply so rare that it might as well be due to divine intervention – only 49 deaths per year on average are reported by the US National Weather Service. But these fatalities are statistical outliers since most people struck by lightning survive.

So what is the best way to avoid becoming one of these outliers? The following is a step-by-step set of instructions for unpacking a dataset – and being careful about the conclusions we draw.

While correlation certainly does not mean causation, here are a few interesting trends that jump out. If you are male, you might do well to stay indoors during thunderstorms, as men accounted for 79 percent of lightning fatalities between 2006 to 2017.

While this gender breakdown is easy to graph, not all categories – such as activities being performed as individuals were struck – are simple to visualize. Turns out, the activity linked to the most lightning strike fatalities was also the most ubiquitous – walking. Fishing was second!

To understand how I reached that conclusion, let’s examine the raw data and how it was organized. Taken from the National Oceanic and Atmospheric Association’s website and fed into a spreadsheet, the categories I found were date, day, state, city, age, sex, location, activity, victim and year. As an aspiring environmental journalist, this was a treasure trove of weather-related data (of a very specific nature) I could poke around.

After creating a pivot table of the data, I noticed several discrepancies in the “activity” category. Descriptions were entered inconsistently, with even minor differences in word choice creating new data points — i.e., “roofing” and “working on the roof,” while clearly the same, were listed as different activities.

When these seemingly minor inconsistencies added up, they could badly skew the results. Any trend or conclusion drawn from this data would be wrong. Treasure trove it may be, but also an object lesson in the perils of large datasets and not to ever take data at face value.

Before it could be of any use, the data had to be heavily cleaned. Duplicate data points in the “location” and “activity” categories needed to be collapsed, or consolidated. The following graph was created using the unrefined data, and the results are quite different from those graphed above, giving the impression that the most dangerous activity during a lightning storm is fishing, rather than walking.

It would have been easy to assume this was correct, since I was already predisposed to assume that a water-related activity would take the top spot.

However, the filter option on Excel’s pivot table function allowed me to search out all similar activities, regardless of precise wording, and either edit the label or manually factor it into the data visualization. Whether walking your dog or walking to your car, for example, it should still be counted as “walking” for the purposes of this graph. As shown here, there were several variations on the activity “walking” and I collapsed most of those into the single heading. Left as it was, each of these would have been calculated as an individual activity.

With the adjusted data, the new graph showed that walking was actually, by and large, the most dangerous, up from 24 to 36 fatalities.

With data processing tools such as Excel, Google Sheets, Python or R at our fingertips, it’s easy to think that finding this information would be a relatively simple matter of feeding the raw data into a program and publishing the resulting graph. But it’s no surprise that most datasets require a lot of “data wrangling” before conclusions can be derived. In fact, a 2016 survey by Forbes found that data scientists spend about 80% of their time on data preparation alone. Data journalists presenting at the 2019 NICAR conference have been warning about the same thing.

This article first appeared on Storybench and is reproduced here with permission.

Veer Mudambi is a freelance reporter specializing in environmental news. He has been published on and holds a masters of journalism from Northeastern University.  

Republish our articles for free, online or in print, under a Creative Commons license.

Republish this article

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to

Read Next