Image: Shutterstock

Stories

•

Topics

» Data Journalism

Data Cleaning Tools and Techniques for Non-Coders

by Pinar Dag • November 14, 2025

Read this article in

Every country produces data, but not every country produces it in an organized manner. What matters is not just the volume of data, but how it’s standardized and structured. The messiest or most data usually comes from manual systems — processes run by humans without standardization. These systems are not only slow but make verification difficult and can lead to major errors.

Even countries that produce massive amounts of data often have datasets that are inaccessible, fragmented, or lack metadata:

The United States produces huge volumes of data, but decentralized structures and legacy systems are common.
China has massive platforms, but its closed infrastructure limits data sharing.
India is a leading data producer, but inconsistent digitization reduces data quality.
Brazil has strong transparency laws, but struggles to standardize data.
In European countries (including Turkey), conflicting regulations sometimes create data incompatibilities.
Countries such as Nigeria have limited infrastructure, which restricts their data ecosystem.

For investigative journalists, this means looking beyond the content of a dataset — considering how it was produced and structured is equally important. Why should journalists care about messy data markets? Because, just like big companies, public institutions, and NGOs, journalists often see only part of the story. The goal is to uncover what is hidden.

In this context, investigative and data journalism require different approaches depending on the type of data. Structured data — organized, often numeric, and table-based — is ideal for analysis, comparison, and visualization. Today, however, much of the digital world consists of unstructured data: emails, social media posts, customer reviews, videos, audio files, and other irregular content.

These datasets can be treasure troves of information, but their messy nature makes deep analysis difficult unless they are cleaned and organized. Today, around 80% of digital data is unstructured, posing a significant challenge for journalists: before conducting meaningful analyses or uncovering stories, the data must first be cleaned and organized.

According to market research firm DataIntelo’s 2024 report, the global unstructured data analytics market was valued at US$7.92 billion in 2024, and it’s expected to reach US$65.45 billion by 2033. This growth is driven by the huge expansion of digital content and AI integration. However, technological advancements do not automatically make data easy to work with — the need for thorough data cleaning is greater than ever.

Even in data-rich countries such as the US or China, messy data, missing metadata, and inconsistent formats make analysis challenging. PDFs, scanned documents, non-standardized Excel files, and restricted-access databases are all examples of data cleaning stories journalists must tackle.

Journalists often encounter Excel files, PDFs, complex tables from open data portals, or raw social media datasets published by various institutions. These datasets are typically inconsistent, incomplete, or erroneous. Coders can handle these issues with Python, R, or SQL — but not every journalist codes. Even without coding, failing to engage deeply with data can lead to serious errors.

GIJN’s Struck by Lightning: A Quick Lesson on Cleaning Up Your Data illustrates this perfectly. Using a large dataset of lightning strikes, it highlights how minor differences in the “activity” column — like “roofing” versus “working on the roof” — can result in misclassification. The article demonstrates that visualizing data without cleaning it first can produce misleading results and stories, making datacleaning not just a technical task but an ethical responsibility for journalists.

Fortunately, there are tools and resources to make these processes easier. GIJN’s Using Pinpoint to Organize Unstructured Data explains how the Pinpoint tool helps organize unstructured datasets. Working with messy data can feel like climbing an endless mountain, but such tools make it easier to extract meaningful insights from text, documents, and files.

Quartz’s data cleaning guide provides journalists with a framework, exploring the causes of poor data quality, missing metadata, and conflicting sources, and suggesting how to achieve reliable, meaningful datasets.

These examples show that data cleaning is not merely a technical skill — it’s a fundamental step for trustworthy journalism. Below we discuss the process of data cleaning.

What Is Data Cleaning and Why Is It Important?

Data cleaning (or data wrangling) means identifying and correcting errors, filling gaps, removing duplicates, and resolving inconsistencies in a dataset. This process ensures that data is reliable for analysis and reporting.

For example, if a city’s spending table lists the same department as both “Ankara Belediyesi” and “Ank. Bld.,” calculating total expenses becomes impossible. Similarly, mixed date formats or missing rows lead to misleading results. Dirty data produces dirty stories. That’s why cleaning is one of the most critical, though invisible, steps in a journalist’s investigation.

The main goal of cleaning is preparing data — deciding what datasets you need, what formats to use, which rows and columns to adjust, and documenting every step. Tracking processes, performing error checks, and maintaining documentation are all part of a sustainable workflow.

Cleaning Data Without Coding

In recent years, no-code tools have been developed to allow journalists to clean, organize, and analyze data using visual interfaces. Instead of writing complex code, these tools provide drag-and-drop features, filters, and automatic cleaning suggestions, freeing journalists to focus on storytelling rather than technical details.

Steps for Data Cleaning

Even without coding, cleaning data should follow a logical sequence:

Understand the Data
Observe before cleaning.
How many columns?
Are there missing values?
Are spelling and formatting consistent?
Are dates in the same format?
Back Up the Original Data
Always copy the original file before cleaning.
Remove Duplicates
Many datasets contain repeated rows.
Google Sheets: Data → Remove Duplicates
OpenRefine: Facet → Duplicates
Identify and Handle Missing Values
Detecting empty cells.
Remove rows or fill missing values logically (e.g., copy the city name from above).
Standardize Formats
Correct capitalization.
Convert dates to a single format.
Standardize currency, percentages, etc.
Merge Categories
Combine similar categories written differently:
>“F” “FEMALE,” “female” → “Female”
Check for Logical Consistency
Clean data can still contain errors (e.g., birth years like 1890 or 2060).
Save and Document
Save the cleaned dataset separately (e.g., city_expenses_cleaned.csv).
Document all cleaning steps for transparency.

Imagine downloading a city’s 2025 spending table in Excel with the following issues:

Date	Department	Expense Item	Amount
12/01/24	Financial Affairs	Cleaning Service	25000
13.01.2024	FINANCIAL AFFAIRS	CLEANING	25.000,00 TL
15/01/24	F.Affair	Garbage Collection	12.5
16/01/2024	Financialaffairs	CLEANING SERVICE	25,000

Example: Cleaning a City’s Expense Data

Problems:

Mixed date formats.
Department names are inconsistent.
Amounts are formatted differently.

Cleaning Steps:

Convert all dates to a single format.
Standardize department names using OpenRefine’s “Cluster & Edit” → “Financial Affairs”
Convert all amounts to a single numeric format.

After cleaning, the data is ready for analysis: categorize expenses, calculate totals, and visualize trends.

Leading Tools for Data Cleaning

Below you will find accessible and practical tools for journalists, along with their advantages:

Google Sheets

It is one of the easiest tools for data cleaning. In a spreadsheet environment that almost everyone is familiar with, powerful cleaning operations can be performed with simple formulas and filters.

Uses: Deleting duplicate rows, correcting text formats, and standardizing dates.

Example:

=TRIM(A2) → Cleans unnecessary spaces in the cell.

=PROPER(A2) → Adjusts upper/lower case letters.

Image: Screenshot

The “Remove Duplicates” tool in the “Data” tab identifies repeating rows.

Advantage: Free, cloud-based, easy to share.
Disadvantage: May slow down with large data sets.

An alternative GIJN article on the subject and my recordings, which are in Turkish but can be accessed using subtitles.

My Data Is Dirty! Basic Spreadsheet Cleaning Functions
#2.1 Google E-tablolar İle Veri Temizleme
#3.1 Google Tablolar ile Veri Düzenleme ve Pivot Tablo Kullanımı

OpenRefine

OpenRefine is the most widely used free data cleaning tool among data journalists. Formerly known as “Google Refine,” this open-source program can organize thousands of lines of data in seconds. I use it frequently in my classes.

Uses: It allows you to merge duplicate records, normalize text formats, convert columns, and more.
Standout feature: The “Cluster and Edit” feature automatically groups similar spellings.

Image: Screenshot

For example, you can convert records like “Istanbul,” “İstanbul,” and “Ist” into a single standard format.

Data types: CSV, TSV, Excel, JSON, XML.
Advantage: Simplifies complex cleaning tasks and provides powerful filtering.
Disadvantage: Seems technical at first setup, but is easy to learn with a few examples.

My training record, which is in Turkish but can be accessed using subtitles:

#2.2 OpenRefine ile Veri Temizleme

Excel Power Query

Microsoft Excel’s “Power Query” add-in provides significant convenience for traditional Excel users.

Image: Screenshot

Usage: It allows you to perform operations such as merging multiple files, reformatting columns, and converting text.
Feature: It records all operations, allowing you to automatically apply the same cleaning steps to new data.

Advantage: A natural transition for Excel users.
Disadvantage: Limited support in older versions, may require a paid license.

Learn to Automate Everything with Power Query in Excel

AirTable

AirTable is a hybrid system between a spreadsheet and a database. Users can visually organize data, categorize it, and create related tables.

Usage: Organizes source data, maintains data accuracy, and creates news tracking tables.
Features: Filtering, color coding, linking (e.g., person-organization connections).

Advantages: Suitable for teamwork, aesthetically pleasing and intuitive.
Disadvantages: The free version has storage limitations.

How to set up automated data cleaning routines in Airtable

Trifacta Wrangler (Alteryx Cloud)

This is a powerful cleaning tool at the enterprise level. It provides AI-powered recommendations; it detects data errors itself and offers correction options.

Usage area: Cleaning large data sets, automatic conversion.

Image: Screenshot

Advantage: Saves time, supports complex data sources.
Disadvantage: Focused on the paid version, the interface is in English.

Tabula (for PDF)

Tabula is a tool for liberating data tables locked inside PDF files. This is a common problem journalists face: public institutions sharing data in PDF format.

Tabula converts tables in PDF files to Excel or CSV format.

Use case: Extracting tables from PDFs.

Image: Screenshot

Advantage: Free, open source.
Disadvantage: Errors may occur in complex or visual PDFs.

My training record here: #1.2 Tabula ile PDF Dosyalarından Veri Kazıma

Advanced Techniques in Code-Free Data Cleaning

Filtering and Conditional Cleaning

In Google Sheets or Excel, you can use “Conditional Formatting” to highlight abnormal values in color and quickly spot errors.

Formula-Based Automation

Cleaning can be automated using simple formulas instead of code:

=UNIQUE(A:A) → Lists non-repeating values.
=CLEAN(A2) → Removes invisible characters.
=SUBSTITUTE(A2,“,”,“.”) → Corrects the difference between commas and periods.

Data Validation

In AirTable or Sheets, you can ensure that users only enter data in specific categories. This maintains consistency in the long term.

Best Practices and Ethics

Data cleaning is not just technical — it’s ethical. Journalists should maintain the original meaning while ensuring accuracy and consistency.

Transparency: Note cleaning steps.
Preserve Originals: Keep raw data.
Reproducibility: Document steps so others can replicate your work.
Don’t Guess: If a value is missing, mark it as “unknown.”

Data journalism and investigative reporting are not just about technical skills. Understanding, organizing, and validating data directly affects the accuracy of your stories. New tools make cleaning accessible to journalists without coding. Think of yourself as a storyteller, not an engineer — but remember: every strong story depends on solid data. With the right tools and methods, even non-coders can clean data and turn it into reliable news.

Pinar Dag is the editor of GIJN Turkish and a lecturer at Kadir Has University. She is the co-founder of the Data Literacy Association (DLA), Data Journalism Platform Turkey, and DağMedya. She works on data literacy, open data, data visualization, and data journalism and has been organizing workshops on these issues since 2012. She is also on the jury of the Sigma Data Journalism Awards.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

data cleaning

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Data Cleaning Tools and Techniques for Non-Coders</h2> by <a href="https://gijn.org/staff-member/pinar-dag/">Pinar Dag</a> for Global Investigative Journalism Network &bull; November 14, 2025 Every country produces <a href="https://www.visualcapitalist.com/ranked-the-top-25-countries-with-the-most-data-centers/">data</a>, but not every country produces it in an organized manner. What matters is not just the volume of data, but how it&rsquo;s standardized and structured. The messiest or most&nbsp; data usually comes from manual systems &mdash; processes run by humans without standardization. These systems are not only slow but make verification difficult and can lead to major errors.Even countries that produce massive amounts of data often have datasets that are inaccessible, fragmented, or lack metadata:<ul>
<li>The United States produces huge volumes of data, but decentralized structures and legacy systems are common.</li>
<li>China has massive platforms, but its closed infrastructure limits data sharing.</li>
<li>India is a leading data producer, but inconsistent digitization reduces data quality.</li>
<li>Brazil has strong transparency laws, but struggles to standardize data.</li>
<li>In European countries (including Turkey), conflicting regulations sometimes create data incompatibilities.</li>
<li>Countries such as Nigeria have limited infrastructure, which restricts their data ecosystem.</li>
</ul>For investigative journalists, this means looking beyond the content of a dataset &mdash; considering how it was produced and structured is equally important. Why should journalists care about messy data markets? Because, just like big companies, public institutions, and NGOs, journalists often see only part of the story. The goal is to uncover what is hidden.<aside>At GIJC25, there will be a <a href="https://gijc2025.org/program/schedule/sessions/237bbe877b036c901459fa28d5e65ff6/">hands-on workshop</a> based on this guide. We look forward to seeing you there!</aside>In this context, investigative and data journalism require different approaches depending on the type of data. <a href="https://www.ibm.com/think/topics/structured-vs-unstructured-data">Structured data</a> &mdash; organized, often numeric, and table-based &mdash; is ideal for analysis, comparison, and visualization. Today, however, much of the digital world consists of unstructured data: emails, social media posts, customer reviews, videos, audio files, and other irregular content.These datasets can be treasure troves of information, but their messy nature makes deep analysis difficult unless they are cleaned and organized. Today, around <a href="https://mitsloan.mit.edu/ideas-made-to-matter/tapping-power-unstructured-data">80% of digital data is unstructured</a>, posing a significant challenge for journalists: before conducting meaningful analyses or uncovering stories, the data must first be cleaned and organized.According to market research firm <a href="https://dataintelo.com/report/unstructured-data-analytics-market/amp">DataIntelo</a>&rsquo;s 2024 report, the global unstructured data analytics market was valued at US$7.92 billion in 2024, and it's expected to reach US$65.45 billion by 2033. This growth is driven by the huge expansion of digital content and AI integration. However, technological advancements do not automatically make data easy to work with &mdash; the need for thorough data cleaning is greater than ever.Even in data-rich countries such as the US or China, messy data, missing metadata, and inconsistent formats make analysis challenging. PDFs, scanned documents, non-standardized Excel files, and restricted-access databases are all examples of data cleaning stories journalists must tackle.Journalists often encounter Excel files, PDFs, complex tables from open data portals, or raw social media datasets published by various institutions. These datasets are typically inconsistent, incomplete, or erroneous. Coders can handle these issues with Python, R, or SQL &mdash; but not every journalist codes. Even without coding, failing to engage deeply with data can lead to serious errors.GIJN&rsquo;s <a href="https://gijn.org/stories/struck-by-lightning-a-quick-lesson-on-cleaning-up-your-data/">Struck by Lightning: A Quick Lesson on Cleaning Up Your Data</a> illustrates this perfectly. Using a large dataset of lightning strikes, it highlights how minor differences in the &ldquo;activity&rdquo; column &mdash; like &ldquo;roofing&rdquo; versus &ldquo;working on the roof&rdquo; &mdash; can result in misclassification. The article demonstrates that visualizing data without cleaning it first can produce misleading results and stories, making datacleaning not just a technical task but an ethical responsibility for journalists.<aside>Data cleaning (or data wrangling) means identifying and correcting errors, filling gaps, removing duplicates, and resolving inconsistencies in a dataset.</aside>Fortunately, there are tools and resources to make these processes easier. GIJN&rsquo;s <a href="https://gijn.org/resource/gijc23-using-pinpoint-to-organize-unstructured-data/">Using Pinpoint to Organize Unstructured Data</a> explains how the Pinpoint tool helps organize unstructured datasets. Working with messy data can feel like climbing an endless mountain, but such tools make it easier to extract meaningful insights from text, documents, and files.<a href="https://gijn.org/stories/the-quartz-guide-to-bad-data/">Quartz</a>&rsquo;s data cleaning guide provides journalists with a framework, exploring the causes of poor data quality, missing metadata, and conflicting sources, and suggesting how to achieve reliable, meaningful datasets.These examples show that data cleaning is not merely a technical skill &mdash; it&rsquo;s a fundamental step for trustworthy journalism. Below we discuss the process of data cleaning.<h4>What Is Data Cleaning and Why Is It Important?</h4>Data cleaning (or data wrangling) means identifying and correcting errors, filling gaps, removing duplicates, and resolving inconsistencies in a dataset. This process ensures that data is reliable for analysis and reporting.For example, if a city&rsquo;s spending table lists the same department as both &ldquo;Ankara Belediyesi&rdquo; and &ldquo;Ank. Bld.,&rdquo; calculating total expenses becomes impossible. Similarly, mixed date formats or missing rows lead to misleading results. Dirty data produces dirty stories. That&rsquo;s why cleaning is one of the most critical, though invisible, steps in a journalist&rsquo;s investigation.The main goal of cleaning is <a href="https://gijn.org/resource/prepping-data-tips/">preparing data</a> &mdash; deciding what datasets you need, what formats to use, which rows and columns to adjust, and documenting every step. Tracking processes, performing error checks, and maintaining documentation are all part of a sustainable workflow.<h4>Cleaning Data Without Coding</h4>In recent years, no-code tools have been developed to allow journalists to clean, organize, and analyze data using visual interfaces. Instead of writing complex code, these tools provide drag-and-drop features, filters, and automatic <a href="https://gijn.org/resource/tools-for-scraping-cleaning-and-prepping-data/">cleaning suggestions</a>, freeing journalists to focus on storytelling rather than technical details.<h4>Steps for Data Cleaning</h4>Even without coding, cleaning data should follow a logical sequence:<ul>
<li>Understand the Data 
Observe before cleaning. 
How many columns? 
Are there missing values? 
Are spelling and formatting consistent? 
Are dates in the same format?</li>
<li>Back Up the Original Data 
Always copy the original file before cleaning.</li>
<li>Remove Duplicates 
Many datasets contain repeated rows. 
Google Sheets: Data &rarr; Remove Duplicates 
OpenRefine: Facet &rarr; Duplicates</li>
<li>Identify and Handle Missing Values 
Detecting empty cells. 
Remove rows or fill missing values logically (e.g., copy the city name from above).</li>
<li>Standardize Formats 
Correct capitalization. 
Convert dates to a single format. 
Standardize currency, percentages, etc.</li>
<li>Merge Categories 
Combine similar categories written differently: 
&gt;&ldquo;F&rdquo; &ldquo;FEMALE,&rdquo; &ldquo;female&rdquo; &rarr; &ldquo;Female&rdquo;</li>
<li>Check for Logical Consistency 
Clean data can still contain errors (e.g., birth years like 1890 or 2060).</li>
<li>Save and Document 
Save the cleaned dataset separately (e.g., city_expenses_cleaned.csv). 
Document all cleaning steps for transparency.</li>
</ul>Imagine downloading a city&rsquo;s 2025 spending table in Excel with the following issues:<table>
<tbody>
<tr>
<td>Date</td>
<td>Department</td>
<td>Expense Item</td>
<td>Amount</td>
</tr>
<tr>
<td>12/01/24</td>
<td>Financial Affairs</td>
<td>Cleaning Service</td>
<td>25000</td>
</tr>
<tr>
<td>13.01.2024</td>
<td>FINANCIAL AFFAIRS</td>
<td>CLEANING</td>
<td>25.000,00 TL</td>
</tr>
<tr>
<td>15/01/24</td>
<td>F.Affair</td>
<td>Garbage Collection</td>
<td>12.5</td>
</tr>
<tr>
<td>16/01/2024</td>
<td>Financialaffairs</td>
<td>CLEANING SERVICE</td>
<td>25,000</td>
</tr>
</tbody>
</table><a href="https://docs.google.com/spreadsheets/d/1t3kJDgEyInuyJsitUnHh4XW7Xntz0Hia6-imH6FQRrs/edit?gid=0#gid=0">Example: Cleaning a City&rsquo;s Expense Data</a>Problems:<ul>
<li>Mixed date formats.</li>
<li>Department names are inconsistent.</li>
<li>Amounts are formatted differently.</li>
</ul>Cleaning Steps:<ol>
<li>Convert all dates to a single format.</li>
<li>Standardize department names using OpenRefine&rsquo;s &ldquo;Cluster &amp; Edit&rdquo; &rarr; &ldquo;Financial Affairs&rdquo;</li>
<li>Convert all amounts to a single numeric format.</li>
</ol>After cleaning, the data is ready for analysis: categorize expenses, calculate totals, and visualize trends.<h4>Leading Tools for Data Cleaning</h4>Below you will find accessible and practical tools for journalists, along with their advantages:&nbsp;<a href="https://docs.google.com/spreadsheets/create?hl=en-GB">Google Sheets</a>It is one of the easiest tools for data cleaning. In a spreadsheet environment that almost everyone is familiar with, <a href="https://support.google.com/docs/answer/10098582?hl=en">powerful cleaning operations</a> can be performed with simple formulas and filters.Uses: Deleting duplicate rows, correcting text formats, and standardizing dates.Example:=<a href="https://support.google.com/docs/answer/6325535?hl=en-GB&amp;co=GENIE.Platform%3DDesktop&amp;sjid=467760939603170816-EU">TRIM</a>(A2) &rarr; Cleans unnecessary spaces in the cell.=<a href="https://support.google.com/docs/answer/3094133?hl=en&amp;sjid=467760939603170816-EU">PROPER</a>(A2) &rarr; Adjusts upper/lower case letters.The &ldquo;<a href="https://support.google.com/docs/search?q=Find+and+remove+duplicates&amp;sjid=467760939603170816-EU">Remove Duplicates</a>&rdquo; tool in the &ldquo;Data&rdquo; tab identifies repeating rows.Advantage: Free, cloud-based, easy to share. 
Disadvantage: May slow down with large data sets.<table>
<tbody>
<tr>
<td>An alternative GIJN article on the subject and my recordings, which are in Turkish but can be accessed using subtitles.
<a href="https://gijn.org/stories/my-data-is-dirty-basic-spreadsheet-cleaning-functions/">My Data Is Dirty! Basic Spreadsheet Cleaning Functions</a> 
<a href="https://www.youtube.com/watch?v=zMp8nWLKOh4">#2.1 Google E-tablolar İle Veri Temizleme 
</a><a href="https://www.youtube.com/watch?v=bZdmXJBiJhY">#3.1 Google Tablolar ile Veri D&uuml;zenleme ve Pivot Tablo Kullanımı</a></td>
</tr>
</tbody>
</table>&nbsp;<a href="https://openrefine.org/">OpenRefine</a>OpenRefine is the most widely used free data cleaning tool among data journalists. Formerly known as &ldquo;Google Refine,&rdquo; this open-source program can organize thousands of lines of data in seconds. I use it frequently in my classes.<ul>
<li>Uses: It allows you to merge duplicate records, normalize text formats, convert columns, and more.</li>
<li>Standout feature: The &ldquo;<a href="https://guides.library.illinois.edu/openrefine/clustering">Cluster and Edit</a>&rdquo; feature automatically groups similar spellings.</li>
</ul>For example, you can convert records like &ldquo;Istanbul,&rdquo; &ldquo;İstanbul,&rdquo; and &ldquo;Ist&rdquo; into a single standard format.Data types: CSV, TSV, Excel, JSON, XML. 
Advantage: Simplifies complex cleaning tasks and provides powerful filtering. 
Disadvantage: Seems technical at first setup, but is easy to learn with a few examples.<table>
<tbody>
<tr>
<td>My training record, which is in Turkish but can be accessed using subtitles:
<a href="https://www.youtube.com/watch?v=LPF-WOouxgc">#2.2 OpenRefine ile Veri Temizleme</a></td>
</tr>
</tbody>
</table>&nbsp;<a href="https://support.microsoft.com/en-us/office/about-power-query-in-excel-7104fbee-9e62-4cb9-a02e-5bfb1a6c536a">Excel Power Query 
</a>Microsoft Excel's &ldquo;Power Query&rdquo; add-in provides significant convenience for traditional Excel users.<ul>
<li>Usage: It allows you to perform operations such as merging multiple files, reformatting columns, and converting text.</li>
<li>Feature: It records all operations, allowing you to automatically apply the same cleaning steps to new data.</li>
</ul>Advantage: A natural transition for Excel users. 
Disadvantage: Limited support in older versions, may require a paid license.<a href="https://www.youtube.com/watch?v=x7mzOYEn0XA&amp;vl=en">Learn to Automate Everything with Power Query in Excel</a>&nbsp;<a href="https://www.airtable.com/">AirTable</a>AirTable is a hybrid system between a spreadsheet and a database. Users can visually organize data, categorize it, and create related tables.<img class="alignnone size-full wp-image-2604688" src="https://gijn.org/wp-content/uploads/2025/11/image6.jpg" alt="" width="686" height="386"><ul>
<li>Usage: Organizes source data, maintains data accuracy, and creates news tracking tables.</li>
<li>Features: Filtering, color coding, linking (e.g., person-organization connections).</li>
</ul>Advantages: Suitable for teamwork, aesthetically pleasing and intuitive. 
Disadvantages: The free version has storage limitations.<a href="https://bootstrapped.app/guide/how-to-set-up-automated-data-cleaning-routines-in-airtable">How to set up automated data cleaning routines in Airtable</a>&nbsp;<a href="https://api.trifacta.com/saas-pro/index.html">Trifacta Wrangler</a> (<a href="https://www.alteryx.com/products/designer-cloud">Alteryx Cloud</a>)This is a powerful cleaning tool at the enterprise level. It provides AI-powered recommendations; it detects data errors itself and offers correction options.Usage area: Cleaning large data sets, automatic conversion.Advantage: Saves time, supports complex data sources. 
Disadvantage: Focused on the paid version, the interface is in English.&nbsp;<a href="https://tabula.technology/">Tabula (for PDF)</a>Tabula is a tool for liberating data tables locked inside PDF files. This is a common problem journalists face: public institutions sharing data in PDF format.Tabula converts tables in PDF files to Excel or CSV format.<ul>
<li>Use case: Extracting tables from PDFs.</li>
</ul>Advantage: Free, open source. 
Disadvantage: Errors may occur in complex or visual PDFs.<table>
<tbody>
<tr>
<td>My training record here: <a href="https://www.youtube.com/watch?v=dVFBcEAWcr4&amp;t=4s">#1.2 Tabula ile PDF Dosyalarından Veri Kazıma</a></td>
</tr>
</tbody>
</table><h4>Advanced Techniques in Code-Free Data Cleaning</h4>Filtering and Conditional CleaningIn Google Sheets or Excel, you can use &ldquo;<a href="https://www.w3schools.com/excel/excel_conditional_formatting.php">Conditional Formatting</a>&rdquo; to highlight abnormal values in color and quickly spot errors.Formula-Based AutomationCleaning can be automated using simple formulas instead of code:<ul>
<li><a href="https://support.microsoft.com/en-us/office/unique-function-c5ab87fd-30a3-4ce9-9d1a-40204fb85e1e">=UNIQUE(A:A)</a> &rarr; Lists non-repeating values.</li>
<li>=CLEAN(A2) &rarr; Removes invisible characters.</li>
<li>=SUBSTITUTE(A2,&ldquo;,&rdquo;,&ldquo;.&rdquo;) &rarr; Corrects the difference between commas and periods.</li>
</ul>Data Validation 
 
In AirTable or Sheets, you can ensure that users only enter data in specific categories. This maintains consistency in the long term.Best Practices and EthicsData cleaning is not just technical &mdash; it&rsquo;s ethical. Journalists should maintain the original meaning while ensuring accuracy and consistency.<ul>
<li>Transparency: Note cleaning steps.</li>
<li>Preserve Originals: Keep raw data.</li>
<li>Reproducibility: Document steps so others can replicate your work.</li>
<li>Don&rsquo;t Guess: If a value is missing, mark it as &ldquo;unknown.&rdquo;</li>
</ul>Data journalism and investigative reporting are not just about technical skills. Understanding, organizing, and validating data directly affects the accuracy of your stories. New tools make cleaning accessible to journalists without coding. Think of yourself as a storyteller, not an engineer &mdash; but remember: every strong story depends on solid data. With the right tools and methods, even non-coders can clean data and turn it into reliable news.<hr><img class=" wp-image-1244829 alignleft" src="https://gijn.org/wp-content/uploads/2023/11/pinar-dag-twitter-profile.jpg" alt="Pinar Dag" width="153" height="153"><a href="https://gijn.org/staff-member/pinar-dag/">Pinar Dag</a> is the editor of GIJN Turkish and a lecturer at <a href="https://twitter.com/khasedutr" target="_blank" rel="noopener">Kadir Has University</a>. She is the co-founder of the&nbsp;<a href="https://www.voyd.org.tr/" target="_blank" rel="noopener">Data Literacy Association (DLA)</a>,&nbsp;<a href="http://www.verigazeteciligi.com/" target="_blank" rel="noopener">Data Journalism Platform Turkey,</a>&nbsp;and&nbsp;<a href="https://twitter.com/Dagmedyanet" target="_blank" rel="noopener">DağMedya.&nbsp;</a>She works on data literacy, open data, data visualization, and data journalism and has been organizing workshops on these issues since 2012. She is also on the jury of the&nbsp;<a href="https://sigmaawards.org/about/" target="_blank" rel="noopener">Sigma Data Journalism Awards</a>.
	This <a target="_blank" href="https://gijn.org/stories/data-cleaning-tools-and-techniques-for-non-coders/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

Struck by Lightning: A Quick Lesson on Cleaning up Your Data

by Veer Mudambi • May 17, 2019

Being struck by lightning is often used as an example of heavenly retribution because it is so unlikely. Fatalities due to lightning are statistical outliers, since most people struck by lightning survive. So what is the best way to avoid becoming one of these outliers? The following is a step-by-step set of instructions for unpacking a dataset – and being careful about the conclusions we draw.

Resource Video

GIJC23 – Using Pinpoint to Organize Unstructured Data

Pinpoint is a powerful tool for converting unstructured data (text and other forms of messy data) into datasets that can be analyzed and used for stories. ⚠️ Warning for first-time Pinpoint users: In your web browser you need to be logged into a gmail.com account. Open another tab and go to the following site https://journaliststudio.google.com/pinpoint/about. […]

Data journalism training class at Izmir University of Economics in Türkiye

Data Journalism Reporting Tools & Tips

Tips for Using Data in a Small Newsroom

by Pınar Dağ • June 5, 2024

Small newsrooms need to focus on the importance of data use more than ever — but they often face numerous hurdles, including a lack of funding and limited human resources.

GICJ21, Lightning Round Great Tools for Investigators

Reporting Tools & Tips

Key Tools and Datasets for Unearthing Hard-to-Find Information

by Rowan Philp • November 24, 2021

In a lightning round session at #GIJC21, a panel of leading reporters and editors needed just five minutes each to outline new tools and databases that any reporter can use to gather hard-to-find facts.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Data Cleaning Tools and Techniques for Non-Coders

Read this article in

What Is Data Cleaning and Why Is It Important?

Cleaning Data Without Coding

Steps for Data Cleaning

Leading Tools for Data Cleaning

Advanced Techniques in Code-Free Data Cleaning

Read other stories tagged with:

Republish this article

Read Next

Data Journalism

Struck by Lightning: A Quick Lesson on Cleaning up Your Data

Resource Video

GIJC23 – Using Pinpoint to Organize Unstructured Data

Data Journalism Reporting Tools & Tips

Tips for Using Data in a Small Newsroom

Reporting Tools & Tips

Key Tools and Datasets for Unearthing Hard-to-Find Information

Stories

Topics

Data Cleaning Tools and Techniques for Non-Coders

Read this article in

Related Resources

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Testing the Potential of Using ChatGPT to Extract Data from PDFs

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Tools for Scraping, Cleaning, and Prepping Data

Share

What Is Data Cleaning and Why Is It Important?

Cleaning Data Without Coding

Steps for Data Cleaning

Leading Tools for Data Cleaning

Advanced Techniques in Code-Free Data Cleaning

Related Resources

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Testing the Potential of Using ChatGPT to Extract Data from PDFs

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Tools for Scraping, Cleaning, and Prepping Data

Related Stories

Struck by Lightning: A Quick Lesson on Cleaning up Your Data

GIJC23 – Using Pinpoint to Organize Unstructured Data

Tips for Using Data in a Small Newsroom

Key Tools and Datasets for Unearthing Hard-to-Find Information

Read other stories tagged with:

Republish this article

Read Next

Data Journalism

Struck by Lightning: A Quick Lesson on Cleaning up Your Data

Resource Video

GIJC23 – Using Pinpoint to Organize Unstructured Data

Data Journalism Reporting Tools & Tips

Tips for Using Data in a Small Newsroom

Reporting Tools & Tips

Key Tools and Datasets for Unearthing Hard-to-Find Information