Leon Yin on Investigating Algorithms YouTube

Bloomberg Technology investigative data journalist Leon Yin has trained his reporting focus on how major social networks' algorithms function. Image: Screenshot, In Old News, YouTube

Stories

•

Topics

» Data Journalism » Methodology

Unlocking the Secrets of Algorithms: Q & A with Investigative Data Journalist Leon Yin

by In Old News • February 20, 2024

Read this article in

Leon Yin is an investigative data journalist recognized for his work on how technology impacts society. He worked on the investigation that uncovered how Google doesn’t allow advertisers to use racial justice terms like Black Lives Matter while allowing the term White Lives Matter. Another investigation found that Facebook hasn’t been transparent about the popularity of right-wing content on its platform. And he worked on a project that uncovered that Google search prioritizes its own products in the top search results over other products. This work was mentioned in the antitrust hearings in July 2020 held by a US Congressional subcommittee.

When we interviewed Leon at the International Journalism Festival, he was working at The Markup, a nonprofit newsroom focused on being the watchdog of the tech industry. He has since joined Bloomberg Technology, where he’s continuing his investigative work.

In addition to his articles, Leon publishes resources that explain how he worked on his investigations. Among the resources is I nspect Element, a guide he put together on how to conduct investigations into algorithms, build datasets, and more.

In Old News: Are you usually working on several projects at once? How do you balance that with learning new skills?

Leon Yin: I say that I tend to like to work on two stories at once that are long term or like if something slows down, I can move to something else, either another long-term something or something with a quicker churn, just to keep moving. Because it’s easy to stagnate sometimes. I think that a lot of times with these investigations, you really just need time, and you need review from others and feedback.

ION: One of your stories was mentioned during a congressional hearing. Could you walk us through how that happened, and is that an outcome you were expecting?

LY: The way that started was my reporting partner, Adrianne Jeffries — she’s been reporting on Google for ages and she had done a lot on the featured snippet, which is the text box that’s been scraped from the web and summarizes something. She’s really fascinated about [that].

We also developed this way of actually staining the web page and getting the real estate, because even though we could tell it was Google, we still need to measure something. And we ended up finding this way to selectively create bounding boxes around different elements of the page, some that were Google and some linked out and we colored a specific color, then took the sum of that color divided by the total real estate of the page and it ended up being roughly 42% of the first page was all to Google.

Image of the web staining technique Leon and his team used to investigate Google search results. Image: Screenshot, leonyin.org

That was a stat that we finally had about this theoretical stat that everyone kind of knows. Google self-preferences itself, it owns the bakery, it makes the bread, of course it’s going to be like, “check out our rolls,” right? But to actually find that number was really powerful. And so, I love doing that thing where there’s common knowledge but no evidence, right?

We go in and fill that void. So when that came out, there was also the Congressional hearing about antitrust, where all of the big tech CEOs were being questioned by Congress and in the opening remarks, I think [then-US Congressman] David Cicilline, brought up our work when they’re questioning Google. And it was just unbelievable. We’re like, my God, how’d this happen? So sometimes our work has this impact and we have no idea it’s going to happen or timing happens quite well. I don’t know if we went to his staff and shared those documents. I don’t know if we always do that. I don’t think we did in that case.

A very similar thing happened with our investigation of Amazon, which was a logical follow-up. How can we do this from another big tech platform that self-preferences itself?

We looked at Amazon private label products, popular Amazon searches, and within days of publication, which is pretty much a year later, [Congress] sent a letter to Amazon leadership [saying]: Under oath, you said that Amazon doesn’t preference its products, but this study says directly the opposite of that. Please explain.

It’s just amazing. When we pitch stories, we always think about that accountability angle. Like, what is the thing that’s being broken? Who’s being impacted? Somebody lied under oath and we can prove that. Or a law is being broken. So I feel that’s the way we frame our stories. Often it just so happens that that happened to be something that an executive said under oath.

We didn’t even know that at the time, but it happened. So that’s some of the impact we’ve had.

ION: How do you figure out if a resource-intensive story is viable?

LY: One thing that helps me and my editor determine if the story’s viable is this kind of checklist I mentioned before. One of the questions off the bat is can you test something with data? How would you get the data, how difficult it’s going to be? What do you have to classify, categorize? We always end up with having to ascribe a value to data because oftentimes the output variable that we’re looking for— like what’s Google, what’s Amazon — it’s not a neat column. We have to figure out what it is. So not only do we have to figure out what it is, we have to figure out what that dataset is, what that universe is like. Where do you stop? What’s enough? What’s a quick test to see if something’s viable?

I’m always thinking about these questions and trying to fill them out and approach my editor with as many as possible before we go fully in. Another thing that we do is a quick test — what’s the minimum viable analysis to prove that something is measurable, something’s afoot? To see if a pattern is available.

That’s actually how a recent story on internet disparities in the United States started, where we were trying to reproduce an academic article that had looked at a ton of different internet service providers in nine states. To do that, we looked at one provider, in one small city. I found there are pockets of fast speeds and slow speeds, which seemed kind of unfair, but also the prices were exactly the same if you’re getting fast speeds or slow speeds. And we joined census data and found that it was really skewed. Lower-income people were getting more slow speeds at the same price.

That quick test not only tells you the viability but whether you have a different story. And so we changed our story angle from being that academic study, which is more of a policy story, to a consumer side story of being: You’re getting a bad deal and it’s happening all across the country.

So within this checklist of questions that we ask, we also think: What’s a quick analysis you can do to show there’s something. That’s something I send my editor and then we discuss. It’s usually clear if you have a story or not. It’s really clear if you have a better story or not. And it’s also clear if you need to kill a story if you have just unanswerable questions of things that are just too uncertain.

ION: How important is it for journalists to show their work?

LY: I think the reason why [The Markup founder] Julia Angwin emphasized the necessity to do that is to build trust. To say that we’re not p-hacking, we’re not making shit up. This is the evidence. And you can see exactly how we got to our conclusions. You can draw your own conclusions.

I think there’s a lot of strength in that. I just keep going back to just traditional journalism. If you have documents or interviews, you would host them somewhere like Muck Rock or on DocumentCloud, right? To say, here’s my FOIA [Freedom of Information Act] request. Here is the information that I got from that request.

Fundamentally, it’s similar [to my work] where I’m making a request not to a government agency but to a website. Here’s what I asked them for, here’s the document that they gave me, and here’s what I did with that document. So I just see it building on traditional journalism. Just at a scale and using a different kind of language of doing so.

It all kind of goes back to the fundamentals. I think that I would love to see more newsrooms invest in this evidence creation of building their own data sets to answer questions. I have seen longer write-ups and methodologies become more popular. Consumer Reports, their investigative team does that. The Washington Post does that. So I’m starting to see more of a change, but it’s still a rarity. Whereas in most of the stories I work on, it’s a necessity. Because we want myself, my reporter, my editor, and whoever we investigate to understand that we have the information down. That we all know exactly what’s being measured, where it comes from, our assumptions, the limitations.

Leon Yin, Bloomberg investigative data journalist

Leon Yin. Image: Screenshot, Twitter/X

It’s all emphatic and necessary to make bold claims that are accurate rather than making bold claims that are actually unsubstantiated by evidence, [where] it’s not clear how you got it or what test you ran. It’s all about the details, but making them really apparent. Of everything important to it to be solid. And we did nothing to exaggerate what you’re seeing. The clearest examples of disparities, of patterns, are so apparent. I feel like it’s our job just to put the data together, follow the pattern, anyone will find the same conclusion. That’s a skill that I’m hoping to impart on others.

I have a tutorial for a skill that I use a lot, which is finding undocumented APIs. APIs are how servers interact with the website and there are a lot of them that aren’t official APIs, but they still work on websites. It’s a subset of web scraping and like almost every story I work on is thanks to finding one of those. To me, data is just another thing you have to fact check. But not anyone is capable of doing that. That’s why I feel like that reviewing stuff is super important just to make sure that everything is accurate, reproducible, explainable. Hopefully those ways all land on the same answer. It assures you’re not cherry-picking and it assures that you’ve chosen a method that’s reasonable. Because we have these long methodologies, we can disclose that. To show that we tried another categorization scheme or we tried another model and the results are interpreted very similarly. Because our data is made public, others could do that too. So that’s a great strength of our work.

This post was originally published on the Substack site In Old News. It has been lightly edited for length and clarity and republished with permission. You can watch the YouTube video of this interview below.

In Old News aims to support a diverse journalism ecosystem through stories and trainings. The organization has trained thousands of journalists in over 65 newsrooms to use their phones for multimedia journalism with the aim of supporting a diverse journalism ecosystem through stories and trainings. It also publishes stories “of journalism from journalists.”

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

Algorithm Cross post data journalism

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>Unlocking the Secrets of Algorithms: Q &amp; A with Investigative Data Journalist Leon Yin</h2> by In Old News for Global Investigative Journalism Network &bull; February 20, 2024 <a href="https://twitter.com/LeonYin" rel="">Leon Yin</a> is an investigative data journalist recognized for his work on how technology impacts society. He worked on the investigation that uncovered how <a href="https://themarkup.org/google-the-giant/2021/04/09/google-blocks-advertisers-from-targeting-black-lives-matter-youtube-videos" rel="">Google doesn&rsquo;t allow advertisers</a> to use racial justice terms like Black Lives Matter while allowing the term White Lives Matter.&nbsp;Another investigation found that <a href="https://themarkup.org/citizen-browser/2021/11/18/facebook-isnt-telling-you-how-popular-right-wing-content-is-on-the-platform" rel="">Facebook hasn&rsquo;t been transparent</a> about the popularity of right-wing content on its platform. And he worked on a project that <a href="https://www.leonyin.org/images/web-assay-flow.png" rel="">uncovered that</a> Google search prioritizes its own products in the top search results over other products. This work was mentioned in the antitrust hearings in July 2020 held by a US Congressional subcommittee.When we interviewed Leon at the International Journalism Festival, he was working at <a href="https://themarkup.org/about" rel="">The Markup</a>, a nonprofit newsroom focused on being the watchdog of the tech industry. He has since joined <a href="https://www.bloomberg.com/authors/AWeGYifVgnI/leon-yin" rel="">Bloomberg Technology</a>, where he&rsquo;s continuing his investigative work.<aside>"How much of Google is Google?&hellip; And so in this process, we kind of develop this hypothesis-driven checklist of how we approach investigations." &mdash; Bloomberg Technology investigative data journalist Leon Yin</aside>In addition to his articles, Leon publishes resources that explain how he worked on his investigations. Among the resources is <a href="https://inspectelement.org/" rel="">I</a><a href="https://inspectelement.org/" rel="">nspect Element</a>, a guide he put together on how to conduct investigations into algorithms, build datasets, and more.In Old News: Are you usually working on several projects at once? How do you balance that with learning new skills?Leon Yin: I say that I tend to like to work on two stories at once that are long term or like if something slows down, I can move to something else, either another long-term something or something with a quicker churn, just to keep moving. Because it's easy to stagnate sometimes. I think that a lot of times with these investigations, you really just need time, and you need review from others and feedback.ION: <a href="https://themarkup.org/google-the-giant/2020/07/29/congressman-says-the-markup-investigation-proves-google-has-created-a-walled-garden" rel="">One of your stories</a> was mentioned during a congressional hearing. Could you walk us through how that happened, and is that an outcome you were expecting?LY: The way that started was my reporting partner, Adrianne Jeffries &mdash; she's been reporting on Google for ages and she had done a lot on the featured snippet, which is the text box that's been scraped from the web and summarizes something. She's really fascinated about [that].<figure>
But she had this greater moonshot story of &mdash; how much of Google is Google? And so in this process, we develop this hypothesis-driven checklist of how we approach investigations. Then the question is, how are we going to test that? How are we going to build the dataset?
Luckily, Google is really easy to scrape. So we scraped the most popular searches from Google Trends. We did that for a few months, got a few thousand popular searches all on an Android or an iPhone X that we emulated using <a href="https://www.selenium.dev">Selenium</a>, which is a browser automation tool usually used to test websites. We used it to collect websites, and then we spent months trying to figure out what Google was. So we looked through source code, and we talked to a bunch of sources in SEO to figure out what it was. 
</figure>We also developed this way of actually staining the web page and getting the real estate, because even though we could tell it was Google, we still need to measure something. And we ended up finding this way to selectively create bounding boxes around different elements of the page, some that were Google and some linked out and we colored a specific color, then took the sum of that color divided by the total real estate of the page and <a href="https://themarkup.org/google-the-giant/2020/07/28/google-search-results-prioritize-google-products-over-competitors" rel="">it ended up being</a> roughly 42% of the first page was all to Google.<figure><figcaption class="image-caption"></figcaption></figure>That was a stat that we finally had about this theoretical stat that everyone kind of knows. Google self-preferences itself, it owns the bakery, it makes the bread, of course it's going to be like, &ldquo;check out our rolls,&rdquo; right? But to actually find that number was really powerful. And so, I love doing that thing where there's common knowledge but no evidence, right?We go in and fill that void. So when that came out, there was also the Congressional hearing about antitrust, where all of the big tech CEOs were being questioned by Congress and in the opening remarks, I think [then-US Congressman] David Cicilline, brought up our work when they're questioning Google. And it was just unbelievable. We're like, my God, how&rsquo;d this happen? So sometimes our work has this impact and we have no idea it's going to happen or timing happens quite well. I don't know if we went to his staff and shared those documents. I don't know if we always do that. I don&rsquo;t think we did in that case.A very similar thing happened with <a href="https://themarkup.org/amazons-advantage/2021/10/14/when-amazon-takes-the-buy-box-it-doesnt-give-it-up">our investigation of Amazon</a>, which was a logical follow-up. How can we do this from another big tech platform that self-preferences itself?<aside>"One of the questions off the bat is can you test something with data? How would you get the data, how difficult it's going to be? What do you have to classify, categorize?" &mdash; Leon Yin</aside>We looked at Amazon private label products, popular Amazon searches, and within days of publication, which is pretty much a year later, [Congress] sent a letter to Amazon leadership [saying]: Under oath, you said that Amazon doesn't preference its products, but this study says directly the opposite of that. Please explain.It's just amazing. When we pitch stories, we always think about that accountability angle. Like, what is the thing that's being broken? Who's being impacted? Somebody&nbsp;lied under oath and we can prove that. Or a law is being broken. So I feel that's the way we frame our stories. Often it just so happens that that happened to be something that an executive said under oath.We didn't even know that at the time, but it happened. So that's some of the impact we've had.ION: How do you figure out if a resource-intensive story is viable?LY: One thing that helps me and my editor determine if the story&rsquo;s viable is this kind of checklist I mentioned before. One of the questions off the bat is can you test something with data? How would you get the data, how difficult it's going to be? What do you have to classify, categorize? We always end up with having to ascribe a value to data because oftentimes the output variable that we're looking for&mdash; like what's Google, what's Amazon &mdash; it's not a neat column. We have to figure out what it is. So not only do we have to figure out what it is, we have to figure out what that dataset is, what that universe is like. Where do you stop? What's enough? What's a quick test to see if something's viable?I'm always thinking about these questions and trying to fill them out and approach my editor with as many as possible before we go fully in. Another thing that we do is a quick test &mdash; what's the minimum viable analysis to prove that something is measurable, something's afoot? To see if a pattern is available.That's actually how a recent story on <a href="https://themarkup.org/still-loading/2022/10/19/dollars-to-megabits-you-may-be-paying-400-times-as-much-as-your-neighbor-for-internet-service" rel="">internet&nbsp;disparities in the United States</a> started, where we were trying to reproduce an academic article that had looked at a ton of different internet service providers in nine states. To do that, we looked at one provider, in one small city. I found there are pockets of fast speeds and slow speeds, which seemed kind of unfair, but also the prices were exactly the same if you're getting fast speeds or slow speeds. And we joined census data and found that it was really skewed. Lower-income people were getting more slow speeds at the same price.That quick test not only tells you the viability but whether you have a different story. And so we changed our story angle from being that academic study, which is more of a policy story, to a consumer side story of being: You're getting a bad deal and it's happening all across the country.So within this checklist of questions that we ask, we also think: What's a quick analysis you can do to show there's something. That&rsquo;s something I send my editor and then we discuss. It's usually clear if you have a story or not. It's really clear if you have a better story or not. And it's also clear if you need to kill a story if you have just unanswerable questions of things that are just too uncertain.ION: How important is it for journalists to show their work?<aside>"I would love to see more newsrooms invest in this evidence creation of building their own data sets to answer questions." &mdash; Leon Yin</aside>LY: I think the reason why [The Markup founder] <a href="https://twitter.com/juliaangwin?lang=en" rel="">Julia Angwin</a> emphasized the necessity to do that is to build trust. To say that we're not p-hacking, we're not making shit up. This is the evidence. And you can see exactly how we got to our conclusions. You can draw your own conclusions.I think there's a lot of strength in that. I just keep going back to just traditional journalism. If you have documents or interviews, you would host them somewhere like Muck Rock or on DocumentCloud, right? To say, here's my FOIA [Freedom of Information Act] request. Here is the information that I got from that request.Fundamentally, it's similar [to my work] where I'm making a request not to a government agency but to a website. Here's what I asked them for, here's the document that they gave me, and here's what I did with that document. So I just see it building on traditional journalism. Just at a scale and using a different kind of language of doing so.It all kind of goes back to the fundamentals. I think that I would love to see more newsrooms invest in this evidence creation of building their own data sets to answer questions. I have seen longer write-ups and methodologies become more popular. Consumer Reports, their investigative team does that. The Washington Post does that. So I'm starting to see more of a change, but it's still a rarity. Whereas in most of the stories I work on, it's a necessity. Because we want myself, my reporter, my editor, and whoever we investigate to understand that we have the information down. That we all know exactly what's being measured, where it comes from, our assumptions, the limitations.It's all emphatic and necessary to make bold claims that are accurate rather than making bold claims that are actually unsubstantiated by evidence, [where] it's not clear how you got it or what test you ran. It's all about the details, but making them really apparent. &nbsp;Of everything important to it to be solid. And we did nothing to exaggerate what you're seeing. The clearest examples of disparities, of patterns, are so apparent. I feel like it's our job just to put the data together, follow the pattern, anyone will find the same conclusion. That's a skill that I'm hoping to impart on others.I have a tutorial for a skill that I use a lot, which is finding undocumented APIs. APIs are how servers interact with the website and there are a lot of them that aren't official APIs, but they still work on websites. It's a subset of web scraping and like almost every story I work on is thanks to finding one of those. To me, data is just another thing you have to fact check. But not anyone is capable of doing that. That's why I feel like that reviewing stuff is super important just to make sure that everything is accurate, reproducible, explainable. Hopefully those ways all land on the same answer. It assures you&rsquo;re not cherry-picking and it assures that you've chosen a method that's reasonable. Because we have these long methodologies, we can disclose that. To show that we tried another categorization scheme or we tried another model and the results are interpreted very similarly. Because our data is made public, others could do that too. So that's a great strength of our work.This <a href="https://inoldnews.substack.com/p/unlocking-the-secrets-of-the-algorithm">post</a> was originally published on the Substack site <a href="https://inoldnews.substack.com">In Old News</a>. It has been lightly edited for length and clarity and republished with permission. You can watch the YouTube video of this interview below.<hr><img class="size-full wp-image-1327679 alignleft" src="https://gijn.org/wp-content/uploads/2024/02/In-Old-News-Substack-newsletter-logo-e1707857508717.png" alt="In Old News Substack newsletter logo" width="145" height="139"> 
<a href="https://inoldnews.com/about">In Old News </a>aims to support a diverse journalism ecosystem through stories and trainings. The organization has trained thousands of journalists in over 65 newsrooms to use their phones for multimedia journalism with the aim of supporting a diverse journalism ecosystem through stories and trainings. It also publishes stories &ldquo;<a href="https://inoldnews.substack.com">of journalism from journalists</a>.&rdquo; 

	This <a target="_blank" href="https://gijn.org/stories/secrets-algorithms-q-a-leon-yin/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

Tips for Investigating Algorithm Harm — and Avoiding AI Hype

by Rowan Philp • July 25, 2023

In a panel discussion at the IRE23 Conference, experts shared tips on how reporters can identify and investigate algorithmic harm and AI bias, and hold the human masters of these systems accountable.

Data Journalism Top 10 Editor's Picks News & Analysis

Editor’s Pick: Top 10 Data Journalism Projects from 2023

by Eunice Au and the GIJN team • December 15, 2023

Featuring stories on climate change, pollution, algorithms, war and conflict, the Turkey-Syria earthquake, and a deadly migrant shipwreck in the Mediterranean.

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Algorithm Bias, Mapping Femicide, India’s Deadly Cold, Bisexual Superman

by Eunice Au, Laura Dixon and Connected Action • March 10, 2023

GIJN’s weekly round-up of the best in data journalism looks into algorithmic bias in welfare investigations in Rotterdam, mapping the rise of femicides in Europe, and the deadly impact of cold temperatures in India.

Reporting Tools & Tips

Digging Up Hidden Data with the Web Inspector

by Smaranda Tolosano • July 28, 2021

Many reporters never notice the “inspect element” option below the “copy” and save-as” functions in the right-click menu on any webpage related to their investigation. But it turns out that this little-used web inspector tool can dig up a wealth of hidden information from a site’s source code, reveal the raw data behind graphics, and download images and videos that supposedly cannot be saved.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

Unlocking the Secrets of Algorithms: Q & A with Investigative Data Journalist Leon Yin

Read this article in

Read other stories tagged with:

Republish this article

Read Next

Methodology

Tips for Investigating Algorithm Harm — and Avoiding AI Hype

Data Journalism Top 10 Editor's Picks News & Analysis

Editor’s Pick: Top 10 Data Journalism Projects from 2023

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Algorithm Bias, Mapping Femicide, India’s Deadly Cold, Bisexual Superman

Reporting Tools & Tips

Digging Up Hidden Data with the Web Inspector

Stories

Topics

Unlocking the Secrets of Algorithms: Q & A with Investigative Data Journalist Leon Yin

Read this article in

Related Resources

Step-By-Step Guide for Journalists on the Basics of Google Sheets

Investigating the Israel Hamas Conflict

Tipsheet for Using Ocean Data in Your Investigations

Investigative Journalism and Digital Threats in 2024 Elections

Share

Related Resources

Step-By-Step Guide for Journalists on the Basics of Google Sheets

Investigating the Israel Hamas Conflict

Tipsheet for Using Ocean Data in Your Investigations

Investigative Journalism and Digital Threats in 2024 Elections

Related Stories

Tips for Investigating Algorithm Harm — and Avoiding AI Hype

Editor’s Pick: Top 10 Data Journalism Projects from 2023

Data Journalism Top 10: Algorithm Bias, Mapping Femicide, India’s Deadly Cold, Bisexual Superman

Digging Up Hidden Data with the Web Inspector

Read other stories tagged with:

Republish this article

Read Next

Methodology

Tips for Investigating Algorithm Harm — and Avoiding AI Hype

Data Journalism Top 10 Editor's Picks News & Analysis

Editor’s Pick: Top 10 Data Journalism Projects from 2023

Data Journalism Data Journalism Top 10

Data Journalism Top 10: Algorithm Bias, Mapping Femicide, India’s Deadly Cold, Bisexual Superman

Reporting Tools & Tips

Digging Up Hidden Data with the Web Inspector