Accessibility Settings

color options

monochrome muted color dark

reading tools

isolation ruler
FBarchive leak Facebook database
FBarchive leak Facebook database

Image: Shutterstock

Stories

Topics

Introducing FBarchive: A New, Searchable Repository of Facebook Whistleblower Documents

In September 2021, the Wall Street Journal began publishing a series of articles exposing the inner workings of Facebook and subsidiaries such as Instagram, including evidence that company insiders knew Instagram made teen girls’ body image issues worse and that Facebook leaders did little to curb recruitment activities of human traffickers and drug cartels.

Much of that reporting was based on a trove of documents and images leaked by former Facebook product manager Frances Haugen, who came forward publicly several weeks after the series published.

(In October 2021, Facebook Inc., the parent company, was rebranded as Meta Platforms Inc., an effort at least six months in the making that some commentators in the news media noted might have had the effect of blunting public backlash following Haugen’s leaks.)

In November 2021, Harvard Kennedy School’s Public Interest Tech Lab received an anonymous drop of information from the Haugen leak, comprising roughly 20,000 images and more than 800 internal Facebook documents, such as chat threads and research, starting from 2016.

As of October 18, 2023, that information is available to the public, in a searchable format, via a virtual tool called FBarchive. Users need to register for a free account to access the archive.

FBarchive is designed to help researchers, journalists, and policymakers understand how, why and when decisions have been made at some of the most influential social media platforms in the world. The project is led by Latanya Sweeney, a technology professor at Harvard who heads the Public Interest Tech Lab.

Sweeney says making these internal deliberations and thought processes public will help policymakers and technology researchers discover solutions to the problem of moderating content on social media platforms that billions of people use.

“We just don’t know how to do moderation at scale — we don’t have the technology, we don’t have the know-how — and that’s something that’s true on all of these platforms where we try to do moderation,” says Sweeney, a pioneer in the field of data privacy. “So, the question is, how should we do that? Can we look at these documents to see where the fault lines are and inspire new technologies, or new technological approaches?”

How to Use FBarchive

Go to fbarchive.org and hit “Enter.” This will bring you to a sign-in page. First-time visitors will receive directions to sign up for a new account via the Public Interest Tech Lab’s MyDataCan platform. Harvard-affiliated users can sign in with their university ID. All other users can click “sign up” to create a username and password.

The primary gateway to accessing the FBarchive materials is a Boolean search bar, meaning certain operators, such as “and,” “or,” or “not” will either broaden or restrict results. Anyone who wants to view a document in FBarchive needs to be logged in.

The search bar is useful for researchers and reporters who already have some focus on what they are interested in — for example, specific keywords or phrases related to body image, gender issues or global conflicts. Journalists and researchers can also get a general sense of what is in the archive by using broader terms — “drug cartels” or “human trafficking,” for example.

Users can also search for information about particular people, such as executives at Meta. The FBarchive team redacted names of people who likely have an expectation of privacy — a software engineer outside of top management, for example. Names of public figures, such as C-suite executives, politicians and celebrities, are not redacted.

To help users understand what they’re reading, Sweeney and her team created a glossary of terms and phrases found in the documents. The “audience problem,” for example, is “a term used internally to describe the years-long trend of declining post numbers on Facebook,” according to the glossary.

“There’s a lot of inside Facebook language in there,” Sweeney says.

Fbarchive database menu bar

When using FBarchive, click the book icon, circled above, to see the glossary. Image: Screenshot, FBarchive

Users can bookmark particular documents and images, and create their own tags, which can be used to curate collections of images and documents. For example, a journalist reporting on how social media affects body image could collect relevant images and documents by adding a “bodyimage” tag to them.

FB archive database search field

To search for a specific topic, enter a phrase and then click the plus button to create a tag and apply it to a document you’re viewing. Image: Screenshot, FBarchive

FBarchive Story Ideas and Research Angles

The FBarchive is full of unexplored investigative story ideas and scholarly research topics. To get you started, Sweeney has offered questions needing more journalistic and academic attention, including the following, among others:

  • Is viral content more likely to increase Facebook’s revenues? How does Facebook handle this tension? Under what circumstances are the needs of human users traded for corporate revenue?
  • At least 95 countries are identified in the Facebook documents. What are the top concerns Facebook considers for people in these countries on the platform? Are the concerns and the way Facebook addresses them similar or different across countries?
  • Violence and political unrest exists around the world and is evidenced within the Facebook documents. What is the nature and extent that Facebook itself plays in the proliferation of these tensions, if any? What role could Facebook play to help reduce these tensions?

Informing Future Regulation

The stories and studies prompted by the archive, along with the content of the archive itself, could inform potential regulation.

For legislators and officials interested in regulating tech, trying to understand how Facebook functions has, so far, been like trying to see what’s going on in a “black box,” Sweeney says.

She likens FBarchive to taking an opaque case off an overheating radio and replacing it with a clear one. Everyone can now see the hot spots inside causing problems.

“I just don’t think policymakers have ever had the opportunity to understand where real leverage points were,” Sweeney says. “They always had to depend on what the tech companies themselves said was possible, not possible. And seeing the inside content gives you a much better sense of, how does this really operate?”

This post was originally published by The Journalist’s Resource and is reprinted here with permission.


Clark Merrefield, The Journalist's ResourceClark Merrefield joined The Journalist’s Resource in 2019 after working as a reporter for Newsweek and The Daily Beast, as a researcher and editor on three books related to the Great Recession, and as a federal government communications strategist. He has been selected for fellowships in juvenile justice and solitary confinement at the John Jay College of Criminal Justice and his work has been awarded by Investigative Reporters and Editors.

Republish our articles for free, online or in print, under a Creative Commons license.

Republish this article


Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

Read Next

Reporting Tools & Tips

New Investigative Tools for Monitoring Social Media Platforms

Social media platforms are among the most difficult sites to scrape for data across the internet. A recent session at NICAR23 unveiled several dynamic new tools — including Junkipedia, a possible CrowdTangle replacement — that can perform a wealth of social media monitoring tasks, from tracking down who is behind harmful ads to identifying conspiracy groups or influencers spreading disinformation.