The Internet Archive and its Wayback Machine are invaluable tools for investigative journalists. Image: Shutterstock

Resource

» Tipsheet

•

Topics

» Investigative Techniques » Reporting Tools & Tips

Tips for Using the Internet Archive’s Wayback Machine in Your Next Investigation

by Mark Graham • May 5, 2021

Read this article in

The Internet Archive is a nonprofit library that, this year, is celebrating 25 years of advancing the mission of “universal access to all knowledge.” It is best known for the Wayback Machine — the service I currently manage — which archives and makes available much of the public web at the rate of more than 1 billion archived URLs per day.

There are many ways journalists, researchers, fact checkers, activists, and the general public access the free-to-use Wayback Machine every day. Several thousand articles have been written about us, or reference our services. In fact, in GIJN’s My Favorite Tools series wrap for 2020, several leading investigative journalists identified it as a mainstay of their work.

Following is an introduction for reporters interested in trying out the Wayback Machine for their next investigation.

Archiving URLs

If you publish an article that references a website and the owners of that site remove key pages, or the site itself, they might be lost forever if they haven’t been archived. Don’t let that happen to you!

Tens of millions of URLs are archived each day by users with the Wayback Machine’s “Save Page Now” service. Anyone can submit URLs and, if you are logged in with a free archive account, you can also ask to archive any “outlinks” — external links within the original page that you want to capture — and to have an overview report of this capture process emailed to you. Another useful feature is that you can download the captured URLs in a WACZ file and review/process it with your own tools.

Save Page Now can do a lot of automated Twitter archiving. For example, you can easily archive up to 3,200 most recent Tweets from any Twitter profile if you insert its URL and check the relevant option.

Here’s the technical bit: If you have a list of URLs you want to archive, add them to “column A” of a Google Sheet and submit that via the “Save Page Now” Google Sheets service, which you can find here. Columns B, C, and D will be populated with a status code, archived URL, and a flag if the URL has been archived by the Wayback Machine before.

Another option is to submit a single URL by emailing it to “spn@archive.org” and, if you add “capture outlinks” to the subject line, those will be preserved as well. Again, you will get an email report when the process is completed.

Finally, for the more technically proficient, the Wayback Machine provides an API, or programming interface, that will allow for integration into your existing software workflows, or when building new applications, to help automate your work. An example of this is how Meedan — the San Francisco-based technology nonprofit that builds software and initiatives to strengthen global journalism — has integrated its “Check” service with the Wayback Machine.

Compare Changes on Different Archived Versions

Have you ever wanted to discover and display the difference between two versions of the same web page — perhaps to see how a company or individual has changed their site or adapted wording on their page? You can do that with the “Changes” feature.

To try this out, enter any archived URL into the search function on the homepage of the Wayback Machine. Then select the “Change” option.

You will be shown a list of archived versions of various dates and times; these changes are color coded to represent degrees of change from one archived URL to the next.

Next, select any two time-stamped versions of the URL and they will be rendered side-by-side, with the text differences highlighted with blue and yellow text. This feature was used to show how a British blogger and political adviser tried to rewrite history, and is illustrated in the screenshot below.

The Wayback Machine showing how Dominic Cummings made stealth additions (in blue) to a blog post. Image: Screenshot

The Wayback Machine’s “Changes” feature captured how Dominic Cummings, the former chief adviser to the British prime minister, made stealth additions (in blue, right) to his original blog post (left). Image: Screenshot

Deeper Archival Searches

You can use the URLs option of the Wayback Machine to search sub-URLs of any captured URL using keywords and/or mime-types. You can easily filter and sort the results to locate interesting captures.

Specific files and collections of websites have been indexed by our engineers and the Wayback Machine offers a full text-search interface for them. Check out “Collection Search” at the bottom of the Wayback Machine homepage. Highlights include lost websites such as poetry.com, Russian Independent Media and a collection of 749M PDFs. Another place where you can see the services available for collections is the Internet Archive home page. If you would like us to index specific collections of archived material (e.g. matching various URL patterns) please reach out to us at info@archive.org.

Using APIs with the Wayback Machine

In addition to an API to support archiving via the “Save Page Now” service, there are also APIs that can be used to query the Wayback Machine to see if specific URLs have been archived. You can read more about them here.

Like most of its services, the Wayback does not put formal caps on the frequency of the use of its APIs. However, it may occasionally implement throttling measures. If you encounter any issues related to the use of the Wayback Machine, send us an email or DM us on Twitter; supporting journalists is a high priority for us.

Adding Context to Archived Pages

We recognize that context and provenance are vital for a more complete understanding of any archive. With that in mind we have started to add context banners to help patrons better understand our archived resources. These types of banners might be used when an archived web page has been removed or when the page has been written about by a known research organization.

The Wayback Machine includes yellow headers that link to external uses of archived pages, and features an “About this capture” tab that provides additional historical context about the page. Image: Screenshot

The provenance of each of the archived URLs that make up a web page can be critical to an understanding of that page. For example, were certain images on an archived web page captured at same time and date as other elements on the page? You can see that information by clicking on the “About this capture” link at the top-right of every archived URL playback page.

The care and attention we have paid to the integrity of our archives, and the transparency we bring to their provenance over the years, has contributed to the overall confidence people have in the Wayback Machine, which is why evidence stored on the Wayback Machine has been accepted by multiple courts worldwide.

If you would like us to consider adding context to archives that you have created with our “Save Page Now” feature, please contact us.

Browser Extensions

As you might expect, we have browser extensions available for Safari, Firefox, and Chrome as well as native mobile apps for iOS and Android. And, as a special treat, we partnered with Brave — a search engine — to build native 404 (and other error condition) detection right into their browser for super-easy Wayback Machine support of web navigation experiences.

Above all else, please know that support for the Internet Archive and the Wayback Machine is just an email or Twitter DM away. Please share your questions, requests, bug reports, and success stories. We especially want to hear what you don’t like about our services, or what features you think we should improve on, or add. That way we can work to do a better job supporting journalists’ needs and desires.

But Wait! There’s More…

In addition to archiving much of the public web, the Internet Archive preserves and makes available other collections of materials, including more than 25 million open access scholarly papers through our Internet Archive Scholar service; nearly 30 million ebooks and texts that can be previewed, borrowed, or downloaded; and millions of hours of archived TV news (dozens of stations for the better part of 10 years) are searchable via full-text indexing of associated closed captions.

To keep up-to-date on the projects and services of the Internet Archive, and the Wayback Machine, please follow us on Twitter @internetarchive and @waybackmachine and read our blog posts.

Additional Resources

What is the Internet Archive and What Can I Find on It?

How to Use the Internet Archive’s Wayback Machine

Using Archive.org for OSINT Investigations

GIJN Webinar: Using Open Source Info to Report from Home

GIJN Resource Center: Online Research Tools

Mark Graham has managed the Wayback Machine for more than five years. Prior to that, he was a senior vice president with NBC News Digital. Graham also helped run the first US-Soviet email service; started a project to build the first web-based interface for an online discussion system; and helped run iVillage, an early online service for women.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

archiving history Internet Archive Internet Research investigative Journalism text search Wayback Machine web journalism

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Tips for Using the Internet Archive&rsquo;s Wayback Machine in Your Next Investigation</h2><p class="byline"> <span>by</span> <a href="https://twitter.com/MarkGraham">Mark Graham</a> <span>for Global Investigative Journalism Network</span> <span>&bull; May 5, 2021</span> </p><p>The <a href="https://archive.org">Internet Archive</a> is a nonprofit library that, this year, is celebrating 25 years of advancing the mission of&nbsp; &ldquo;universal access to all knowledge.&rdquo; It is best known for the <a href="https://web.archive.org">Wayback Machine</a> -- the service I currently manage -- which archives and makes available much of the public web at the rate of more than 1 billion archived URLs per day.</p><aside class="module align-right half type-pull-quote">Interested in more investigative tips and tools? Be sure to check out <a href="https://gijn.org/resource">GIJN's Resource Center</a>.</aside><p>There are many ways journalists, researchers, fact checkers, activists, and the general public access the free-to-use Wayback Machine every day. <a href="https://archive.org/about/news-stories/search/">Several thousand articles</a> have been written about us, or reference our services. In fact, in GIJN&rsquo;s <a href="https://gijn.org/2020/12/15/my-favorite-tools-2020-top-investigative-journalists-tell-us-what-theyre-using/">My Favorite Tools</a> series wrap for 2020, several leading investigative journalists identified it as a mainstay of their work.</p><p>Following is an introduction for reporters interested in trying out the Wayback Machine for their next investigation.</p><h4>Archiving URLs</h4><p>If you publish an article that references a website and the owners of that site remove key pages, or the site itself, they might be lost forever if they haven&rsquo;t been archived. Don&rsquo;t let that happen to you!</p><p>Tens of millions of URLs are archived each day by users with the Wayback Machine&rsquo;s <a href="https://web.archive.org/save/">&ldquo;Save Page Now&rdquo;</a> service. Anyone can submit URLs and, if you are logged in with a <a href="https://archive.org/account/signup">free archive account</a>, you can also ask to archive any &ldquo;outlinks&rdquo; -- external links within the original page that you want to capture -- and to have an overview report of this capture process emailed to you. Another useful feature is that you can download the captured URLs in a WACZ file and review/process it with your own tools.</p><p>Save Page Now can do a lot of automated Twitter archiving. For example, you can easily archive up to 3,200 most recent Tweets from any Twitter profile if you insert its URL and check the relevant option.</p><p>Here&rsquo;s the technical bit: If you have a list of URLs you want to archive, add them to &ldquo;column A&rdquo; of a Google Sheet and submit that via the &ldquo;Save Page Now&rdquo; Google Sheets service, which you can find <a href="https://archive.org/services/wayback-gsheets/">here</a>. Columns B, C, and D will be populated with a status code, archived URL, and a flag if the URL has been archived by the Wayback Machine before.</p><p>Another option is to submit a single URL by emailing it to &ldquo;spn@archive.org&rdquo; and, if you add &ldquo;capture outlinks&rdquo; to the subject line, those will be preserved as well. Again, you will get an email report when the process is completed.</p><p>Finally, for the more technically proficient, the Wayback Machine provides an <a href="https://docs.google.com/document/d/19RJsRncGUw2qHqGGg9lqYZYf7KKXMDL1Mro5o1Qw6QI/edit#heading=h.urqx2xttomuw">API, or programming interface,</a> that will allow for integration into your existing software workflows, or when building new applications, to help automate your work. An example of this is how Meedan -- the San Francisco-based technology nonprofit that builds software and initiatives to strengthen global journalism -- has <a href="https://medium.com/meedan-updates/at-risk-information-in-arabic-b1b71b6c0a2e">integrated its &ldquo;Check&rdquo; service with the Wayback Machine</a>.</p><h4>Compare Changes on Different Archived Versions</h4><p>Have you ever wanted to discover and display the difference between two versions of the same web page -- perhaps to see how a company or individual has changed their site or adapted wording on their page? You can do that with the &ldquo;Changes&rdquo; feature.</p><p>To try this out, enter any archived URL into the search function on the homepage of the Wayback Machine. Then select the &ldquo;Change&rdquo; option.</p><p>You will be shown a list of archived versions of various dates and times; these changes are color coded to represent degrees of change from one archived URL to the next.</p><p>Next, select any two time-stamped versions of the URL and they will be rendered side-by-side, with the text differences highlighted with blue and yellow text. This feature was used to show how a <a href="https://www.wired.co.uk/article/dominic-cummings-blog-pandemic">British blogger and political adviser tried to rewrite history</a>, and is illustrated in the screenshot below.</p><h4>Deeper Archival Searches</h4><p>You can use the <a href="https://web.archive.org/web/*/https://agriculture.house.gov/*">URLs option</a> of the Wayback Machine to search sub-URLs of any captured URL using keywords and/or mime-types. You can easily filter and sort the results to locate interesting captures.</p><p>Specific files and collections of websites have been indexed by our engineers and the Wayback Machine offers a full text-search interface for them. Check out &ldquo;Collection Search&rdquo; at the bottom of the <a href="http://web.archive.org/">Wayback Machine</a> homepage. Highlights include lost websites such as <a href="http://web.archive.org/poetry.com/search/Love">poetry.com</a>, <a href="http://web.archive.org/russian-independent-media/search/War">Russian Independent Media</a> and a collection of <a href="https://web.archive.org/pdf/search/San%20Francisco">749M PDFs</a>. Another place where you can see the services available for collections is the <a href="https://archive.org/">Internet Archive home page</a>.&nbsp; If you would like us to index specific collections of archived material (e.g. matching various URL patterns) please reach out to us at info@archive.org.</p><h4>Using APIs with the Wayback Machine</h4><p>In addition to an API to support archiving via the &ldquo;Save Page Now&rdquo; service, there are also APIs that can be used to query the Wayback Machine to see if specific URLs have been archived. You can read more about them&nbsp;<a href="https://archive.org/help/wayback_api.php">here</a>.</p><p>Like most of its services, the Wayback does not put formal caps on the frequency of the use of its APIs. However, it may occasionally implement throttling measures. If you encounter any issues related to the use of the Wayback Machine, send us an <a href="mailto:info@wayback.org">email</a> or <a href="https://twitter.com/waybackmachine">DM us on Twitter;</a> supporting journalists is a high priority for us.</p><h4>Adding Context to Archived Pages</h4><p>We recognize that context and provenance are vital for a more complete understanding of any archive. With that in mind we have started to add <a href="http://blog.archive.org/2020/10/30/fact-checks-and-context-for-wayback-machine-pages/">context banners</a> to help patrons better understand our archived resources. These types of banners might be used when an <a href="http://web.archive.org/web/20200405061401/https://medium.com/@agaiziunas/covid-19-had-us-all-fooled-but-now-we-might-have-finally-found-its-secret-91182386efcb">archived web page has been removed</a> or when the <a href="http://web.archive.org/web/20200418152657/https://www.indymedia.org.uk/en/2016/06/525302.html">page has been written about</a> by a known research organization.</p><p>The provenance of each of the archived URLs that make up a web page can be critical to an understanding of that page. For example, were certain images on an archived web page captured at same time and date as other elements on the page? You can see that information by clicking on the &ldquo;About this capture&rdquo; link at the top-right of every archived URL playback page.</p><p>The care and attention we have paid to the integrity of our archives, and the transparency we bring to their provenance over the years, has contributed to the overall confidence people have in the Wayback Machine, which is why <a href="https://www.theregister.com/2018/09/04/wayback_machine_legit/">evidence stored on the Wayback Machine has been accepted</a> by multiple courts worldwide.</p><p>If you would like us to consider adding context to archives that you have created with our "Save Page Now&rdquo; feature, please <a href="mailto:info@wayback.org">contact us</a>.</p><h4>Browser Extensions</h4><p>As you might expect, we have browser extensions available for <a href="https://apps.apple.com/us/story/id1377753262?id=archive.org.waybackmachine-ZSFX78H3ZT">Safari</a>, <a href="https://addons.mozilla.org/en-US/firefox/addon/wayback-machine_new/">Firefox</a>, and <a href="https://chrome.google.com/webstore/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak?hl=en-US">Chrome</a> as well as native mobile apps for <a href="https://apps.apple.com/us/app/wayback-machine/id1201888313">iOS</a> and <a href="https://play.google.com/store/apps/details?id=com.archive.waybackmachine&amp;hl=en_US">Android</a>. And, as a special treat, we partnered with <a href="http://blog.archive.org/2020/02/25/brave-browser-and-the-wayback-machine-working-together-to-help-make-the-web-more-useful-and-reliable/">Brave</a> -- a search engine -- to build native 404 (and other error condition) detection right into their browser for super-easy Wayback Machine support of web navigation experiences.</p><p>Above all else, please know that support for the Internet Archive and the Wayback Machine is just an <a href="mailto:info@wayback.org">email</a> or <a href="https://twitter.com/waybackmachine">Twitter DM</a> away. Please share your questions, requests, bug reports, and success stories. We especially want to hear what you don&rsquo;t like about our services, or what features you think we should improve on, or add. That way we can work to do a better job supporting journalists' needs and desires.</p><h4>But Wait! There&rsquo;s More&hellip;</h4><p>In addition to archiving much of the public web, the Internet Archive preserves and makes available other collections of materials, including more than 25 million open access scholarly papers through our <a href="https://scholar.archive.org">Internet Archive Scholar</a> service; nearly <a href="https://archive.org/details/texts">30 million ebooks and texts</a> that can be previewed, borrowed, or downloaded; and millions of hours of <a href="https://archive.org/details/tv">archived TV news</a> (dozens of stations for the better part of 10 years) are searchable via full-text indexing of associated closed captions.</p><p>To keep up-to-date on the projects and services of the Internet Archive, and the Wayback Machine, please follow us on Twitter <a href="https://twitter.com/internetarchive">@internetarchive</a> and <a href="https://twitter.com/waybackmachine">@waybackmachine</a> and read our <a href="http://blog.archive.org">blog posts</a>.</p><h4>Additional Resources</h4><p class="post-title"><em><a href="https://www.groovypost.com/explainer/what-is-the-internet-archive-and-what-can-i-find-on-it/?">What is the Internet Archive and What Can I Find on It?</a></em></p><p><i><a href="https://www.wikihow.com/Use-the-Internet-Archive%27s-Wayback-Machine"><em>How to Use the Internet Archive's Wayback Machine</em></a></i></p><p><i><a href="https://osintcurio.us/2021/03/03/using-archive-org-for-osint-investigations/">Using Archive.org for OSINT Investigations</a></i></p><p class="title style-scope ytd-video-primary-info-renderer"><em><a href="https://www.youtube.com/watch?v=Rb_RRZCoTwM">GIJN Webinar: Using Open Source Info to Report from Home</a></em></p><p class="entry-title"><em><a href="https://gijn.org/online-research-tools/">GIJN Resource Center: Online Research Tools</a></em></p><hr><p><a href="https://gijn.org/wp-content/uploads/2021/04/Mark-Graham-thumbnail.png"><img class="wp-image-328041 alignleft" src="https://gijn.org/wp-content/uploads/2021/04/Mark-Graham-thumbnail.png" alt="Mark Graham thumbnail image" width="209" height="209"></a><em><a href="https://twitter.com/MarkGraham"><strong>Mark Graham</strong></a> has managed the <a href="http://web.archive.org">Wayback Machine</a> for more than five years. Prior to that, he was a senior vice president with NBC News Digital. Graham also helped run the first US-Soviet email service; started a project to build the first web-based interface for an online discussion system; and helped run iVillage, an early online service for women.&nbsp;</em></p><p>
	This <a target="_blank" href="https://gijn.org/resource/tips-for-using-the-internet-archives-wayback-machine-in-your-next-investigation/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">
</p>

Best Practices for Investigating Culprits of War, Human Rights Abuses, and Other Conflict

by Rowan Philp • August 7, 2023

GIJN senior reporter Rowan Philip shares accumulated best practices from reporters around the world, on how to investigate culprits of war, human rights abuses, and other conflict.

Case Studies Investigative Techniques

Tips for Getting New or Reluctant Sources to Talk

by Rowan Philp • July 4, 2023

For every human source who assists investigative journalists, there are dozens of officials, victims, and potential whistleblowers with vital information whom reporters never engage.

My Favorite Tools Reporting Tools & Tips

My Favorite Tools: El Salvador’s Jimmy Alvarado on Exposing Corruption

by Andrea Arzaba • June 21, 2023

El Faro investigative journalist Jimmy Alvarado offers his favorite tools and techniques for exposing corruption.

Methodology News & Analysis Reporting Tools & Tips

Tips for Gamifying Your Next Investigation

by Katarina Sabados • June 13, 2023

Gaming and the news have a history, for decades they have been used to increase engagement and reach younger audiences. Here are some tips for getting started with gamification in your next investigation.

Accessibility Settings

text size

color options

reading tools

other

Resource

Topics

Tips for Using the Internet Archive’s Wayback Machine in Your Next Investigation

Read this article in

Archiving URLs

Compare Changes on Different Archived Versions

Deeper Archival Searches

Using APIs with the Wayback Machine

Adding Context to Archived Pages

Browser Extensions

But Wait! There’s More…

Additional Resources

Read other stories tagged with:

Republish this article

Read Next

Investigative Techniques

Best Practices for Investigating Culprits of War, Human Rights Abuses, and Other Conflict

Case Studies Investigative Techniques

Tips for Getting New or Reluctant Sources to Talk

My Favorite Tools Reporting Tools & Tips

My Favorite Tools: El Salvador’s Jimmy Alvarado on Exposing Corruption

Methodology News & Analysis Reporting Tools & Tips

Tips for Gamifying Your Next Investigation

Resource

Topics

Tips for Using the Internet Archive’s Wayback Machine in Your Next Investigation

Read this article in

Related Resources

Investigating Elections: Threat from AI Audio Deepfakes

GIJC23 – The Future of Data Journalism: New Analytical Tools, Data Visualization, and AI

GIJC23 – The Basics of Using Google Sheets

GIJC23 – Using Pinpoint to Organize Unstructured Data

Share

Archiving URLs

Compare Changes on Different Archived Versions

Deeper Archival Searches

Using APIs with the Wayback Machine

Adding Context to Archived Pages

Browser Extensions

But Wait! There’s More…

Additional Resources

Related Resources

Investigating Elections: Threat from AI Audio Deepfakes

GIJC23 – The Future of Data Journalism: New Analytical Tools, Data Visualization, and AI

GIJC23 – The Basics of Using Google Sheets

GIJC23 – Using Pinpoint to Organize Unstructured Data

Related Stories

Best Practices for Investigating Culprits of War, Human Rights Abuses, and Other Conflict

Tips for Getting New or Reluctant Sources to Talk

My Favorite Tools: El Salvador’s Jimmy Alvarado on Exposing Corruption

Tips for Gamifying Your Next Investigation

Read other stories tagged with:

Republish this article

Read Next

Investigative Techniques

Best Practices for Investigating Culprits of War, Human Rights Abuses, and Other Conflict

Case Studies Investigative Techniques

Tips for Getting New or Reluctant Sources to Talk

My Favorite Tools Reporting Tools & Tips

My Favorite Tools: El Salvador’s Jimmy Alvarado on Exposing Corruption

Methodology News & Analysis Reporting Tools & Tips

Tips for Gamifying Your Next Investigation