Tips for Using the Internet Archive’s Wayback Machine in Your Next Investigation
The Internet Archive is a nonprofit library that, this year, is celebrating 25 years of advancing the mission of “universal access to all knowledge.” It is best known for the Wayback Machine — the service I currently manage — which archives and makes available much of the public web at the rate of more than 1 billion archived URLs per day.
There are many ways journalists, researchers, fact checkers, activists, and the general public access the free-to-use Wayback Machine every day. Several thousand articles have been written about us, or reference our services. In fact, in GIJN’s My Favorite Tools series wrap for 2020, several leading investigative journalists identified it as a mainstay of their work.
Following is an introduction for reporters interested in trying out the Wayback Machine for their next investigation.
If you publish an article that references a website and the owners of that site remove key pages, or the site itself, they might be lost forever if they haven’t been archived. Don’t let that happen to you!
Tens of millions of URLs are archived each day by users with the Wayback Machine’s “Save Page Now” service. Anyone can submit URLs and, if you are logged in with a free archive account, you can also ask to archive any “outlinks” — external links within the original page that you want to capture — and to have an overview report of this capture process emailed to you. Another useful feature is that you can download the captured URLs in a WACZ file and review/process it with your own tools.
Save Page Now can do a lot of automated Twitter archiving. For example, you can easily archive up to 3,200 most recent Tweets from any Twitter profile if you insert its URL and check the relevant option.
Here’s the technical bit: If you have a list of URLs you want to archive, add them to “column A” of a Google Sheet and submit that via the “Save Page Now” Google Sheets service, which you can find here. Columns B, C, and D will be populated with a status code, archived URL, and a flag if the URL has been archived by the Wayback Machine before.
Another option is to submit a single URL by emailing it to “email@example.com” and, if you add “capture outlinks” to the subject line, those will be preserved as well. Again, you will get an email report when the process is completed.
Finally, for the more technically proficient, the Wayback Machine provides an API, or programming interface, that will allow for integration into your existing software workflows, or when building new applications, to help automate your work. An example of this is how Meedan — the San Francisco-based technology nonprofit that builds software and initiatives to strengthen global journalism — has integrated its “Check” service with the Wayback Machine.
Compare Changes on Different Archived Versions
Have you ever wanted to discover and display the difference between two versions of the same web page — perhaps to see how a company or individual has changed their site or adapted wording on their page? You can do that with the “Changes” feature.
To try this out, enter any archived URL into the search function on the homepage of the Wayback Machine. Then select the “Change” option.
You will be shown a list of archived versions of various dates and times; these changes are color coded to represent degrees of change from one archived URL to the next.
Next, select any two time-stamped versions of the URL and they will be rendered side-by-side, with the text differences highlighted with blue and yellow text. This feature was used to show how a British blogger and political adviser tried to rewrite history, and is illustrated in the screenshot below.
Deeper Archival Searches
You can use the URLs option of the Wayback Machine to search sub-URLs of any captured URL using keywords and/or mime-types. You can easily filter and sort the results to locate interesting captures.
Specific files and collections of websites have been indexed by our engineers and the Wayback Machine offers a full text-search interface for them. Check out “Collection Search” at the bottom of the Wayback Machine homepage. Highlights include lost websites such as poetry.com, Russian Independent Media and a collection of 749M PDFs. Another place where you can see the services available for collections is the Internet Archive home page. If you would like us to index specific collections of archived material (e.g. matching various URL patterns) please reach out to us at firstname.lastname@example.org.
Using APIs with the Wayback Machine
In addition to an API to support archiving via the “Save Page Now” service, there are also APIs that can be used to query the Wayback Machine to see if specific URLs have been archived. You can read more about them here.
Like most of its services, the Wayback does not put formal caps on the frequency of the use of its APIs. However, it may occasionally implement throttling measures. If you encounter any issues related to the use of the Wayback Machine, send us an email or DM us on Twitter; supporting journalists is a high priority for us.
Adding Context to Archived Pages
We recognize that context and provenance are vital for a more complete understanding of any archive. With that in mind we have started to add context banners to help patrons better understand our archived resources. These types of banners might be used when an archived web page has been removed or when the page has been written about by a known research organization.
The provenance of each of the archived URLs that make up a web page can be critical to an understanding of that page. For example, were certain images on an archived web page captured at same time and date as other elements on the page? You can see that information by clicking on the “About this capture” link at the top-right of every archived URL playback page.
The care and attention we have paid to the integrity of our archives, and the transparency we bring to their provenance over the years, has contributed to the overall confidence people have in the Wayback Machine, which is why evidence stored on the Wayback Machine has been accepted by multiple courts worldwide.
If you would like us to consider adding context to archives that you have created with our “Save Page Now” feature, please contact us.
As you might expect, we have browser extensions available for Safari, Firefox, and Chrome as well as native mobile apps for iOS and Android. And, as a special treat, we partnered with Brave — a search engine — to build native 404 (and other error condition) detection right into their browser for super-easy Wayback Machine support of web navigation experiences.
Above all else, please know that support for the Internet Archive and the Wayback Machine is just an email or Twitter DM away. Please share your questions, requests, bug reports, and success stories. We especially want to hear what you don’t like about our services, or what features you think we should improve on, or add. That way we can work to do a better job supporting journalists’ needs and desires.
But Wait! There’s More…
In addition to archiving much of the public web, the Internet Archive preserves and makes available other collections of materials, including more than 25 million open access scholarly papers through our Internet Archive Scholar service; nearly 30 million ebooks and texts that can be previewed, borrowed, or downloaded; and millions of hours of archived TV news (dozens of stations for the better part of 10 years) are searchable via full-text indexing of associated closed captions.
Mark Graham has managed the Wayback Machine for more than five years. Prior to that, he was a senior vice president with NBC News Digital. Graham also helped run the first US-Soviet email service; started a project to build the first web-based interface for an online discussion system; and helped run iVillage, an early online service for women.