Website code open source metadata analysis tool

Image: Shutterstock

Stories

•

Topics

» Reporting Tools & Tips

Look Inside the Open Source ‘Information Laundromat’ Tool for Examining Website Content and Metadata

by Craig Silverman, Digital Investigations • November 8, 2024

Read this article in

中文

I love investigating websites. I wrote a chapter about it for the most recent edition of the Verification Handbook and I’m always looking for new tools and methods to connect sites together, identify owners, and to analyze site content, infrastructure, and behavior.

The Information Laundromat is one of the newest and most interesting free website analysis tools I’ve come across. Developed by the George Marshall Fund’s Alliance For Securing Democracy (ASD), it can analyze content and metadata. ASD, with researchers at the University of Amsterdam and the Institute for Strategic Dialogue, used it in their recent report, “The Russian Propaganda Nesting Doll: How RT is Layered into the Digital Information Environment.”

The Information Laundromat can analyze two elements: the content posted to a site and the metadata used to build and run it. Here’s a rundown of how it works, based on initial testing by me and an interview with Peter Benzoni, the tool’s developer.

Peter told me that the Information Laundromat works best for lead generation: “It’s not supposed to automate your investigation.” The Information Laundromat is open source and available on the ASD’s GitHub account.

Content Similarity Analysis

Content vs. Metadata Similarity, Information Laundromat website analysis tool

Image: Screenshot, Digital Investigations

This tool analyzes a link, title, or snippet of text to identify other web properties with similar or identical content. It was useful in the ASD investigation because they wanted to see which sites consistently copied from Russia Today (RT), the Russian state broadcaster. They were able to identify sites that consistently reprinted RT content, and which played a role helping push RT’s narratives across the web, laundering its narratives, according to the research.

How It Works

Enter the URL, title, or snippet of text you want to check.
The system looks across search engines, the Copyscape plagiarism checker tool, and the GDELT database to analyze and rank the similarity of your source content and other sites.
A results page sorts sites by the percentage of similar content to your original source.

I ran a sample search with a URL that I knew was a near carbon copy of a news article published elsewhere. The Information Laundromat correctly identified the original source of the text, giving it a 97% similarity score.

Content similarity score checker Information Laundromat website analysis tool

Image: Screenshot, Digital Investigations

The tool also highlights what it doesn’t do,

Content Similarity Search attempts to find similar articles or text across the open web. It does not provide evidence of where that text originated or any relationship between two entities posting two similar texts. Determination of a given text’s provenance is outside the scope of this tool.

If you get a lot of results, Peter suggested “downloading everything into as an Excel and looking through it on your own a little bit with a pivot table.”

Sites with 70% or higher similarity rating or higher are likely to be most of interest, according to Peter. The tool also has a batch upload option if you register on the site.

Metadata Similarity Analysis

Content Metadata Similarity URLSCAN Information Laundromat website analysis tool

Image: Screenshot, Digital Investigations

The Information Laundromat’s metadata similarity tool works best when you have a set of sites you want to analyze. It’s possible, but less effective, to use it to analyze a single site.

How It Works

Enter a set of domains you want to analyze for shared connections.
The tool scans each domain, including infrastructure such as IP addresses and source code, to extract unique indicators and to determine if there’s overlap between domains. It flags direct matches for IP addresses and it also highlights if sites are hosted in the same IP range, which is a weaker connection but still potentially of note. Along with looking for unique advertising and analytics codes, the tool scans a site’s CSS file to look for similarities. Peter told me that “it has to be greater than 90% similar CSS classes” for the tool to flag it as notable. (View the tool’s full list of website indicators here.)
The metadata page sorts the results into two sections.
- The first table lists which indicators are present on each site.
- The second table identifies shared indicators across sites.
The tool also sorts results in each table according to the relative strength of each indicator. (I explain more in the final section of this post.)

“The idea to try and draw out anything that you can tell about the sites that you might use to link to sites together,” Peter told me.

If you’re not familiar with the method of linking sites together via analytics and ad codes, you can read this basic guide and this recent post from me (read the guide first!). The Information Laundromat’s metadata module is most useful if you’re familiar with website infrastructure such as IP addresses and if you understand how to connect sites together using indicators. The risk in using this tool comes if you don’t understand the relative strengths and weaknesses of each indicator and connection. (More on that below.)

Peter said the metadata analysis tool is a great starting point for finding connections between a set of sites.

“If you have a set of sites and you would like to get a sense of the potential overlap, then this is a good way to do a quick snapshot of that, as opposed to running them manually in a bunch of other tools,” he said.

I agree that it’s potentially a good starting point if you have a set of sites you think may have connections. The Information Laundromat will give a useful overview of potential connections. Then you can take those and do a deeper dive using tools such as DNSlytics, BuiltWith, SpyOnWeb, and your favorite passive DNS platform.

While the tool works best with a group of domains, you can run a metadata search with a single URL. This is useful if you’d like the system to extract indicators such as analytics codes for you to easily search in places like DNSlytics. You can also see if the URL shares any indicators with the set of roughly 10,000 domains stored in the Information Laundromat database. The tool’s about page lists the sources.

EU vs Disinfo’s Database.

Research from partner and related organizations, such as the Institute for Strategic Dialogue’s (ISD) report on RT Mirror Sites.

Known state media sites.

Lists of unreliable sources, pink slime sites, and faux local news sites.

Wikipedia’s list of fake news websites and Wikidata’s list of news websites.

Notably, Peter said that as of now the tool does not add user-inputted domains to the database. So if you’re searching using a set of domains that you consider sensitive, you can take some solace in the fact that the tool will not add your site(s) to the Information Laundromat dataset.

Ranking Technical Website Indicators

As noted above, it’s critical to understand the relative strengths and weaknesses of site indicators surfaced by the tool. Otherwise you risk overstating the connection between sites. Fortunately, the Information Laundromat’s documentation offers a useful breakdown of indicators.

For example, it’s a weak connection if multiple sites use WordPress as their content management system. Hundreds of millions of websites use WordPress; it’s not a useful signal on its own to connect sites together. But the connection between sites is much stronger if they all use the same Google AdSense code.

Ideally, you want to identify multiple technical indicators that connect a set of sites, and to combine that with other information to properly assess the strength of the connections.

To aid with analysis, the Information Laundromat has sorted indicators into three tiers. The results page helpfully uses color coding to point you towards strong, moderate, or weak indicators. You still need to perform your own analysis, but it’s a useful starting point.

Here are the three indicator tiers from the Information Laundromat’s documentation.

- Tier 1: These “are typically unique or highly indicative of the provenance of a website” and includes “unique IDs for verification purposes and web services like Google, Yandex, etc as well as site metadata like WHOIS information and certification.”
- Tier 2: Such indicators “offer a moderate level of certainty regarding the provenance of a website.” They “provide valuable context” and include “IPs within the same subnet, matching meta tags, and commonalities in standard and custom response headers.”
- Tier 3: They suggest using these indicators in combination with higher-tier indicators. Tier 3 includes “shared CSS classes, UUIDs, and Content Management Systems.”

Editor’s Note: This post was originally published on ProPublica reporter Craig Silverman’s Digital Investigations Substack and is reprinted here with permission.

Craig Silverman is a national reporter for ProPublica, covering voting, platforms, disinformation, and online manipulation. He was previously media editor of BuzzFeed News, where he pioneered coverage of digital disinformation.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

open source tool website tools

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Look Inside the Open Source &lsquo;Information Laundromat&rsquo; Tool for Examining Website Content and Metadata</h2> by <a href="https://digitalinvestigations.substack.com/">Craig Silverman, Digital Investigations</a> for Global Investigative Journalism Network &bull; November 8, 2024 I love investigating websites. I wrote a <a href="https://datajournalism.com/read/handbook/verification-3/investigating-platforms/8-investigating-websites" rel="">chapter</a> about it for the most recent edition of the <a href="https://datajournalism.com/read/handbook/verification-3" rel="">Verification Handbook</a> and I&rsquo;m always looking for new tools and methods to connect sites together, identify owners, and to analyze site content, infrastructure, and behavior.<aside>The Information Laundromat is open source and available on the Alliance for Securing Democracy&rsquo;s <a href="https://github.com/ASD-at-GMF" rel="">GitHub account</a>.</aside>The <a href="https://informationlaundromat.com/" rel="">Information Laundromat</a> is one of the newest and most interesting free website analysis tools I&rsquo;ve come across. Developed by the George Marshall Fund&rsquo;s <a href="https://securingdemocracy.gmfus.org/" rel="">Alliance For Securing Democracy</a> (ASD), it can analyze content and metadata. ASD, with researchers at the University of Amsterdam and the Institute for Strategic Dialogue, used it in their recent report, <a href="https://securingdemocracy.gmfus.org/the-russian-propaganda-nesting-doll-how-rt-is-layered-into-the-digital-information-environment/" rel="">&ldquo;The Russian Propaganda Nesting Doll: How RT is Layered into the Digital Information Environment.&rdquo;</a>The Information Laundromat can analyze two elements: the content posted to a site and the metadata used to build and run it. Here&rsquo;s a rundown of how it works, based on initial testing by me and an interview with <a href="https://securingdemocracy.gmfus.org/author/peter-benzoni/" rel="">Peter Benzoni</a>, the tool&rsquo;s&nbsp;developer.Peter told me that the Information Laundromat works best for lead generation: &ldquo;It&rsquo;s not supposed to automate your investigation.&rdquo; The Information Laundromat is open source and available on the ASD&rsquo;s <a href="https://github.com/ASD-at-GMF" rel="">GitHub account</a>.<h4 class="header-anchor-post">Content Similarity Analysis</h4>This tool analyzes a link, title, or snippet of text to identify other web properties with similar or identical content. It was useful in the ASD investigation because they wanted to see which sites consistently copied from Russia Today (RT), the Russian state broadcaster. They were able to identify sites that consistently reprinted RT content, and which played a role helping push RT&rsquo;s narratives across the web, laundering its narratives, <a href="https://securingdemocracy.gmfus.org/the-russian-propaganda-nesting-doll-how-rt-is-layered-into-the-digital-information-environment/" rel="">according to the research</a>.How It Works<ul>
<li>Enter the URL, title, or snippet of text you want to check.</li>
<li>The system looks across search engines, the Copyscape plagiarism checker tool, and the <a href="https://www.gdeltproject.org/about.html" rel="">GDELT database</a> to analyze and rank the similarity of your source content and other sites.</li>
<li>A results page sorts sites by the percentage of similar content to your original source.</li>
</ul>I ran a sample search with a URL that I knew was a near carbon copy of a news article published elsewhere. The Information Laundromat correctly identified the original source of the text, giving it a 97% similarity score.The tool also <a href="https://informationlaundromat.com/about" rel="">highlights</a> what it doesn&rsquo;t do,<blockquote>Content Similarity Search attempts to find similar articles or text across the open web. It does not provide evidence of where that text originated or any relationship between two entities posting two similar texts. Determination of a given text's provenance is outside the scope of this tool.</blockquote>If you get a lot of results, Peter suggested &ldquo;downloading everything into as an Excel and looking through it on your own a little bit with a pivot table.&rdquo;Sites with 70% or higher similarity rating or higher are likely to be most of interest, according to Peter. The tool also has a batch upload option if you register on the site.<h4 class="header-anchor-post">Metadata Similarity Analysis</h4>The Information Laundromat&rsquo;s metadata similarity tool works best when you have a set of sites you want to analyze. It&rsquo;s possible, but less effective, to use it to analyze a single site.How It Works<ul>
<li>Enter a set of domains you want to analyze for shared connections.</li>
<li>The tool scans each domain, including infrastructure such as IP addresses and source code, to extract unique indicators and to determine if there&rsquo;s overlap between domains. It flags direct matches for IP addresses and it also highlights if sites are hosted in the same IP range, which is a weaker connection but still potentially of note. Along with looking for unique advertising and analytics codes, the tool scans a site&rsquo;s CSS file to look for similarities. Peter told me that &ldquo;it has to be greater than 90% similar CSS classes&rdquo; for the tool to flag it as notable. (View the tool&rsquo;s full list of website indicators <a href="https://informationlaundromat.com/indicators" rel="">here</a>.)</li>
<li>The metadata page sorts the results into two sections.
<ul>
<li>The first table lists which indicators are present on each site.</li>
<li>The second table identifies shared indicators across sites.</li>
</ul>
</li>
<li>The tool also sorts results in each table according to the relative strength of each indicator. (I explain more in the final section of this post.)</li>
</ul>&ldquo;The idea to try and draw out anything that you can tell about the sites that you might use to link to sites together,&rdquo; Peter told me.If you&rsquo;re not familiar with the method of linking sites together via analytics and ad codes, you can read <a href="https://datajournalism.com/read/handbook/verification-3/investigating-platforms/8-investigating-websites" rel="">this basic guide</a> and this <a href="https://digitalinvestigations.substack.com/p/what-the-rollout-of-google-analytics" rel="">recent post</a> from me (read the guide first!). The Information Laundromat&rsquo;s metadata module is most useful if you&rsquo;re familiar with website infrastructure such as IP addresses and if you understand how to connect sites together using indicators. The risk in using this tool comes if you don&rsquo;t understand the relative strengths and weaknesses of each indicator and connection. (More on that below.)Peter said the metadata analysis tool is a great starting point for finding connections between a set of sites.&ldquo;If you have a set of sites and you would like to get a sense of the potential overlap, then this is a good way to do a quick snapshot of that, as opposed to running them manually in a bunch of other tools,&rdquo; he said.I agree that it&rsquo;s potentially a good starting point if you have a set of sites you think may have connections. The Information Laundromat will give a useful overview of potential connections. Then you can take those and do a deeper dive using tools such as <a href="http://dnslytics.com/" rel="">DNSlytics</a>, <a href="https://builtwith.com/" rel="">BuiltWith</a>, <a href="https://spyonweb.com/" rel="">SpyOnWeb,</a> and your favorite passive DNS platform.While the tool works best with a group of domains, you can run a metadata search with a single URL. This is useful if you&rsquo;d like the system to extract indicators such as analytics codes for you to easily search in places like DNSlytics. You can also see if the URL shares any indicators with the set of roughly 10,000 domains stored in the Information Laundromat database. The tool&rsquo;s <a href="https://informationlaundromat.com/about" rel="">about</a> page lists the sources.<blockquote>
<ul>
<li><a href="https://euvsdisinfo.eu/disinformation-cases/" rel="">EU vs Disinfo's Database</a>.</li>
<li>Research from partner and related organizations, such as the <a href="https://isdglobal.org/digital_dispatches/rt-articles-are-finding-their-way-to-european-audiences-but-how/" rel="">Institute for Strategic Dialogue&rsquo;s (ISD) report on RT Mirror Sites</a>.</li>
<li>Known <a href="https://github.com/ASD-at-GMF/state-media-profiles" rel="">state media sites</a>.</li>
<li>Lists of <a href="https://iffy.news/index/Unreliable%20Sources" rel="">unreliable sources</a>, <a href="https://iffy.news/pink-slime-fake-local-news/" rel="">pink slime sites</a>, and <a href="https://www.midwestradionetwork.com/" rel="">faux local news sites</a>.</li>
<li>Wikipedia's list of <a href="https://en.wikipedia.org/wiki/List_of_fake_news_websites" rel="">fake news websites</a> and Wikidata's <a href="https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q17232649&amp;limit=50&amp;dir=next&amp;offset=0%7C3014523" rel="">list of news websites</a>.</li>
</ul>
</blockquote>Notably, Peter said that as of now the tool does not add user-inputted domains to the database. So if you&rsquo;re searching using a set of domains that you consider sensitive, you can take some solace in the fact that the tool will not add your site(s) to the Information Laundromat dataset.<h4 class="header-anchor-post">Ranking Technical Website Indicators</h4>As noted above, it&rsquo;s critical to understand the relative strengths and weaknesses of site indicators surfaced by the tool. Otherwise you risk overstating the connection between sites. Fortunately, the Information Laundromat&rsquo;s documentation offers a useful breakdown of indicators.For example, it&rsquo;s a weak connection if multiple sites use WordPress as their content management system. Hundreds of millions of websites use WordPress; it&rsquo;s not a useful signal on its own to connect sites together. But the connection between sites is much stronger if they all use the same Google AdSense code.Ideally, you want to identify multiple technical indicators that connect a set of sites, and to combine that with other information to properly assess the strength of the connections.To aid with analysis, the Information Laundromat has sorted indicators into three tiers. The results page helpfully uses color coding to point you towards strong, moderate, or weak indicators. You still need to perform your own analysis, but it&rsquo;s a useful starting point.<figure>
</figure>Here are the three indicator tiers from the Information Laundromat&rsquo;s <a href="https://informationlaundromat.com/about" rel="">documentation.</a><ul>
<li>
<ul>
<li>Tier 1: These &ldquo;are typically unique or highly indicative of the provenance of a website&rdquo; and includes &ldquo;unique IDs for verification purposes and web services like Google, Yandex, etc as well as site metadata like WHOIS information and certification.&rdquo;</li>
<li>Tier 2: Such indicators &ldquo;offer a moderate level of certainty regarding the provenance of a website.&rdquo; They &ldquo;provide valuable context&rdquo; and include &ldquo;IPs within the same subnet, matching meta tags, and commonalities in standard and custom response headers.&rdquo;</li>
<li>Tier 3: They suggest using these indicators in combination with higher-tier indicators. Tier 3 includes &ldquo;shared CSS classes, UUIDs, and Content Management Systems.&rdquo;</li>
</ul>
</li>
</ul>Editor's Note: This <a href="https://digitalinvestigations.substack.com/p/a-look-at-the-information-laundromat">post</a> was originally published on ProPublica reporter Craig Silverman's <a href="https://digitalinvestigations.substack.com">Digital Investigations</a> Substack and is reprinted here with permission.<hr><a href="https://gijn.org/wp-content/uploads/2023/05/Screenshot-2023-05-04-at-14.36.13.png"><img class="alignleft wp-image-637172" src="https://gijn.org/wp-content/uploads/2023/05/Screenshot-2023-05-04-at-14.36.13-140x140.png" alt="" width="100" height="100"></a><a href="https://www.craigsilverman.ca/">Craig Silverman </a>is a national reporter for ProPublica, covering voting, platforms, disinformation, and online manipulation. He was previously media editor of BuzzFeed News, where he pioneered coverage of digital disinformation.
	This <a target="_blank" href="https://gijn.org/stories/open-source-information-laundromat/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

Questions and Tips to Guide Website Investigations

by Rowan Philp • June 12, 2023

At a panel at NICAR 2023, two digital experts discussed tips and techniques for conducting investigations into suspicious websites and their owners.

Databases Investigative Techniques

Guide to Investigating Digital Ad Libraries

by Craig Silverman • June 26, 2024

Digital investigations expert Craig Silverman offers a short guide to using digital ad libraries to dig through online influence efforts.

Reporting Tools & Tips

GIJN’s Top 10 Investigative Tools of 2023

by Rowan Philp • November 27, 2023

Having flagged the top tips at NICAR23, IRE23, and GIJC23 in Sweden, GIJN offers the following 10 user-friendly tools that you might consider in your next investigations.

Tipsheet Investigative Techniques Reporting Tools & Tips

4 More Essential Tips for Using the Wayback Machine

by Craig Silverman, Digital Investigations • May 11, 2023

Based on an interview with Wayback Machine’s director, Mark Graham, ProPublica’s Craig Silverman shares more essential tips on using it, including how to bulk archive pages, compare changes, and see when elements of a page were archived.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Look Inside the Open Source ‘Information Laundromat’ Tool for Examining Website Content and Metadata

Read this article in

Content Similarity Analysis

Metadata Similarity Analysis

Ranking Technical Website Indicators

Read other stories tagged with:

Republish this article

Read Next

News & Analysis Reporting Tools & Tips

Questions and Tips to Guide Website Investigations

Databases Investigative Techniques

Guide to Investigating Digital Ad Libraries

Reporting Tools & Tips

GIJN’s Top 10 Investigative Tools of 2023

Tipsheet Investigative Techniques Reporting Tools & Tips

4 More Essential Tips for Using the Wayback Machine

Stories

Topics

Look Inside the Open Source ‘Information Laundromat’ Tool for Examining Website Content and Metadata

Read this article in

Related Resources

Investigating Digital Threats: Introduction

Online Research Guide with Henk van Ess

Investigating Digital Threats: Digital Infrastructure

Investigative Journalism and Digital Threats in 2024 Elections

Share

Content Similarity Analysis

Metadata Similarity Analysis

Ranking Technical Website Indicators

Related Resources

Investigating Digital Threats: Introduction

Online Research Guide with Henk van Ess

Investigating Digital Threats: Digital Infrastructure

Investigative Journalism and Digital Threats in 2024 Elections

Related Stories

Questions and Tips to Guide Website Investigations

Guide to Investigating Digital Ad Libraries

GIJN’s Top 10 Investigative Tools of 2023

4 More Essential Tips for Using the Wayback Machine

Read other stories tagged with:

Republish this article

Read Next

News & Analysis Reporting Tools & Tips

Questions and Tips to Guide Website Investigations

Databases Investigative Techniques

Guide to Investigating Digital Ad Libraries

Reporting Tools & Tips

GIJN’s Top 10 Investigative Tools of 2023

Tipsheet Investigative Techniques Reporting Tools & Tips

4 More Essential Tips for Using the Wayback Machine