Stories

•

Topics

» Data Journalism

The Twitterverse Of Donald Trump in 26,234 Tweets

by Lam Thuy Vo • January 9, 2017

We wanted to get a better idea of where President-elect Donald Trump gets his information. So we analyzed everything he has tweeted since he launched his campaign to take a look at the links he has shared and the news sources they came from.

Getting the Data

so @lamthuyvo just taught me scraping tweets for code-illiterate dummies & now i am just wandering around saying things like 'Mainframe'

— Charlie Warzel (@cwarzel) December 8, 2016

To do this kind of data analysis, we needed an archive of @realDonaldTrump tweets. We started by scraping his feed using Twitter’s API, but Twitter limits the scraping of tweets to roughly 3240 at a time, which represents less than half of his account’s output since he launched his presidential run.

We were able to procure a fuller corpus from developer Brendan Brown’s Trump Twitter Archive, who had found a nifty workaround for that problem by scraping tweets for set time frames and adding them up in the end. Brown’s data, available as csv or json, only went to November 9th, 2016. We completed his data set to Nov 17, 2016 by scraping the remaining tweets directly.

Here’s how:

Start by getting developer oauth credentials from Twitter: https://apps.twitter.com/
If you don’t already have Python installed, start by getting Python up and running. Also, if you don’t already have a favorite package manager, you should also make sure you have pip.
Install tweepy:`pip install tweepy`
Copy tweet_dumper.py to wherever you keep your scripts. Edit it to include your developer oauth credentials at the top and the username you want to scrape at the bottom. (Thank you to Quartz Things reporter David Yanofsky for the original script.)
Run it with run `python scrapername.py`to generate a csv of tweets.

Here is what you’ll find in the resulting CSV:

id: every tweet has a unique ID that you can use to reconstruct the tweet’s URL. The schema is “https://twitter.com/TWITTERHANDLE/status/IDNUMBER.” For instance, to access Donald Trump’s tweet with ID 805804034309427200, you would head to: https://twitter.com/realDonaldTrump/status/805804034309427200
created_at: this will give you the date and time the tweet was created. For example 2016-12-01 15:57:15
favorites: number of times the tweet was favorited — note that if the entry is a retweet, it will not be shown.
retweet: how often the tweet was retweeted
retweeted: whether the tweet was a retweet (true) or not (false)
source: how the tweet was posted, eg. “Twitter for iPhone” or “Twitter Web Client”
text: the content of the tweet

Parsing the Tweets

The question we wanted to ask about Trump’s tweets was this: is there anything to learn from the URLs that @realDonaldTrump circulated during his campaign?

For that we needed the actual URLs. In Google Spreadsheets, we used a regular expression to extract strings that started with “http”. We expanded these links using the node.js expand-url module.

Install dependencies with `npm install async expand-url`
Copy your URL array into url-expander.js and run it using this command `node url-expander.js`
Paste the output into a new csv and merge that with your original spreadsheet.
Use more Google Spreadsheets regex to zero in on the domain names

We added this data back into the larger spreadsheet and stripped the links down to their root URL, again using Spreadsheets regex capabilities. This finally allowed us to group root URLs and count them using pivot tables.

This is by no means the only (or even the best) way to extract a list of domain names from a corpus of tweets (you could always extract all the links programmatically, using Python or MySQL), but it was our strategy given the time and resources we had.

We then modified one of Mike Bostock’s d3.js graphics for our needs, styled it to fit the BuzzFeed look-and-feel, and allowed our audience to explore the data using a zoom function. If you want to learn more about D3, O’Reilly has an excellent primer.

Public Figures and Social Data

The biggest question this project brought up was that of the importance of social media for public figures.

When a president-elect makes official announcements on Twitter, do they become important public documents? If yes, should we be able to access an archive of them tweets, beyond what a private company has decided to provide? Shoot me your thoughts at lam.vo@buzzfeed.com.

This article first appeared on Buzzfeed News and is reproduced here with the author’s permission. It was also cross-posted on Source.

Lam Thuy Vo is an Open Lab Fellow for BuzzFeed News and is based in San Francisco. She is a German-born Vietnamese reporter who codes, writes, and creates visuals.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Republish our articles for free, online or in print, under a Creative Commons license.

Read other stories tagged with:

data scraping Trump tweets tweet analysis

Republish this article

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License

Material from GIJN’s website is generally available for republication under a Creative Commons Attribution-NonCommercial 4.0 International license. Images usually are published under a different license, so we advise you to use alternatives or contact us regarding permission. Here are our full terms for republication. You must credit the author, link to the original story, and name GIJN as the first publisher. For any queries or to send us a courtesy republication note, write to hello@gijn.org.

<h2>The Twitterverse Of Donald Trump in 26,234 Tweets</h2> by <a href="https://twitter.com/lamthuyvo">Lam Thuy Vo</a> for Global Investigative Journalism Network &bull; January 9, 2017 We wanted to get a better idea of where President-elect Donald Trump gets his information. So we <a href="https://www.buzzfeed.com/charliewarzel/trumps-information-universe?utm_term=.yod0rnngM#.ol5erNNz9">analyzed everything he has tweeted since he launched his campaign</a> to take a look at the links he has shared and the news sources they came&nbsp;from.<h3>Getting the Data</h3>To do this kind of data analysis, we needed an archive of <a href="https://twitter.com/realDonaldTrump">@realDonaldTrump</a> tweets. We started by scraping his feed using Twitter&rsquo;s API, but Twitter limits the scraping of tweets to roughly 3240 at a time, which represents less than half of his account&rsquo;s output since he launched his presidential&nbsp;run.We were able to procure a fuller corpus from developer <a href="https://www.washingtonpost.com/news/the-intersect/wp/2016/09/22/a-look-at-the-170-times-donald-trump-has-tweeted-about-the-losers/">Brendan Brown</a>&rsquo;s Trump Twitter Archive, who had found a nifty <a href="http://trumptwitterarchive.com/howto/all_tweets.html">workaround for that problem</a> by scraping tweets for set time frames and adding them up in the end. Brown&rsquo;s data, available as csv or json, only went to November 9th, 2016. We completed his data set to Nov 17, 2016 by scraping the remaining tweets&nbsp;directly.Here&rsquo;s&nbsp;how:<ol>
<li>Start by getting developer oauth credentials from Twitter: <a href="https://apps.twitter.com/">https://apps.twitter.com/</a></li>
<li>If you don&rsquo;t already have Python installed, start by <a href="http://docs.python-guide.org/en/latest/starting/installation/">getting Python up and running</a>. Also, if you don&rsquo;t already have a favorite package manager, you should also <a href="https://pip.pypa.io/en/stable/installing/">make sure you have <code>pip</code></a>.</li>
<li>Install tweepy:`<code>&lt;b&gt;pip install tweepy`&lt;/b&gt;</code></li>
<li>Copy <a href="https://github.com/buzzfeed-openlab/big-picture/blob/master/scripts/tweet_dumper.py">tweet_dumper.py</a> to wherever you keep your scripts. Edit it to include your developer oauth credentials at the top and the username you want to scrape at the bottom. (Thank you to Quartz Things <a href="https://twitter.com/yan0">reporter David Yanofsky for the original script.</a>)</li>
<li>Run it with run&nbsp; `<code>&lt;b&gt;python scrapername.py`&lt;/b&gt;</code>to generate a csv of&nbsp;tweets.</li>
</ol>Here is what you&rsquo;ll find in the resulting&nbsp;CSV:<ol>
<li>id: every tweet has a unique ID that you can use to reconstruct the tweet&rsquo;s URL. The schema is &ldquo;https://twitter.com/TWITTERHANDLE/status/IDNUMBER.&rdquo; For instance, to access Donald Trump&rsquo;s tweet with ID 805804034309427200, you would head to: <a href="https://twitter.com/realDonaldTrump/status/805804034309427200">https://twitter.com/realDonaldTrump/status/805804034309427200</a></li>
<li>created_at: this will give you the date and time the tweet was created. For example 2016-12-01&nbsp;15:57:15</li>
<li> favorites: number of times the tweet was favorited &mdash; note that if the entry is a retweet, it will not be&nbsp;shown.</li>
<li>retweet: how often the tweet was&nbsp;retweeted</li>
<li>retweeted: whether the tweet was a retweet (true) or not&nbsp;(false)</li>
<li>source: how the tweet was posted, eg. &ldquo;Twitter for iPhone&rdquo; or &ldquo;Twitter Web&nbsp;Client&rdquo;</li>
<li>text: the content of the&nbsp;tweet</li>
</ol><h3>Parsing the&nbsp;Tweets</h3><a href="https://gijn.org/wp-content/uploads/2016/12/trump-tweet-profile.jpg"><img class="alignright wp-image-27183" src="https://gijn.org/wp-content/uploads/2016/12/trump-tweet-profile-336x317.jpg" alt="trump-tweet-profile" width="300" height="283"></a>The question we wanted to ask about Trump&rsquo;s tweets was this: is there anything to learn from the URLs that @realDonaldTrump circulated during his&nbsp;campaign?For that we needed the actual URLs. In Google Spreadsheets, we used a <a href="https://support.google.com/docs/answer/3098244?hl=en">regular expression</a> to extract strings that started with &ldquo;http&rdquo;. We expanded these links using the <a href="https://nodejs.org/en/">node.js</a> expand-url&nbsp;module.<ol>
<li>Install dependencies with&nbsp; `<code>&lt;b&gt;npm install async expand-url`&lt;/b&gt;</code></li>
<li>Copy your URL array into <a href="https://gist.github.com/lamthuyvo/d7cae77f9b4aa01d79e8b92e117732cb">url-expander.js</a> and run it using this command&nbsp; `<code>&lt;b&gt;node url-expander.js`&lt;/b&gt;</code></li>
<li>Paste the output into a new csv and merge that with your original&nbsp;spreadsheet.</li>
<li>Use more Google Spreadsheets regex to zero in on the domain&nbsp;names</li>
</ol>We added this data back into the larger spreadsheet and stripped the links down to their root URL, again using Spreadsheets <a href="https://support.google.com/docs/answer/3098244?hl=en">regex capabilities</a>. This finally allowed us to group root URLs and count them using pivot&nbsp;tables.This is by no means the only (or even the best) way to extract a list of domain names from a corpus of tweets (you could always extract all the links programmatically, using Python or MySQL), but it was our strategy given the time and resources we&nbsp;had.We then modified <a href="http://bl.ocks.org/mbostock/7607535">one of Mike Bostock&rsquo;s d3.js graphics</a> for our needs, styled it to fit the BuzzFeed look-and-feel, and allowed our audience to explore the data using a zoom function. If you want to learn more about D3, <a href="http://chimera.labs.oreilly.com/books/1230000000345">O&rsquo;Reilly has an excellent primer</a>.<h3>Public Figures and Social&nbsp;Data</h3>The biggest question this project brought up was that of the importance of social media for public&nbsp;figures.When a president-elect makes official announcements on Twitter, do they become important public documents? If yes, should we be able to access an archive of them tweets, beyond what a private company has decided to provide? Shoot me your thoughts at <a href="mailto:lam.vo@buzzfeed.com">lam.vo@buzzfeed.com</a>.<hr><a href="https://gijn.org/wp-content/uploads/2016/12/Lam-Thuy-Vo.jpg"><img class="alignleft wp-image-27178" src="https://gijn.org/wp-content/uploads/2016/12/Lam-Thuy-Vo.jpg" alt="lam-thuy-vo" width="180" height="180"></a>This article <a href="https://www.buzzfeed.com/lamvo/the-twitterverse-of-donald-trump-in-26234-tweets?utm_term=.nd2ekrr8z#.bnex4bbZm">first appeared</a> on Buzzfeed News and is reproduced here with the author's permission. It was also cross-posted on <a href="https://source.opennews.org/en-US/articles/twitterverse-donald-trump/">Source</a>.<a href="https://twitter.com/lamthuyvo">Lam Thuy Vo</a> is an Open Lab Fellow for BuzzFeed News and is based in San Francisco. She is a German-born Vietnamese reporter who codes, writes, and creates visuals.
	This <a target="_blank" href="https://gijn.org/stories/the-twitterverse-of-donald-trump-in-26234-tweets/">article</a> first appeared on <a target="_blank" href="https://gijn.org">Global Investigative Journalism Network</a> and is republished here under a Creative Commons license.
	<img id="republication-tracker-tool-source" src="https://gijn.org/?republication-pixel=true&amp;post=657947&amp;ga=UA-21528033-17">

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

by Pınar Dağ • November 3, 2023

Knowing where to look for data — and accessing it via scraping data from websites — can be a powerful force multiplier for investigative journalists.

Data Journalism

Data Journalism Top 10: Measuring Mask Use, Parental Interruptions, Childbirth Woes, India’s Low Death Rate

by Eunice Au & Connected Action • July 23, 2020

How widespread is mask use in your country? Our NodeXL #ddj mapping from July 13 to 19 finds The New York Times mapping the odds of people encountering other mask wearers in the United States, two university professors quantifying the number of interruptions a parent suffers on average every hour while working from home, the Committee to Protect Journalists talking to data journalists about the struggles of reporting on COVID-19, and openDemocracy documenting cases of mistreatment of women in labor around the world since the pandemic started.

Data Journalism

Data Journalism Top 10: COVID-19 Racial Inequity, Cash for the Connected, Africa’s Silent Epidemic, Amazon Safety

by Eunice Au & Connected Action • July 16, 2020

The coronavirus pandemic has upended the lives of people around the world, but some communities are especially hard hit. Our NodeXL #ddj mapping from July 6 to 12 finds The New York Times analyzing data that reveals Black and Latino people have been disproportionately affected by COVID-19, The Washington Post highlighting that business relief funds for the pandemic have gone to the rich and well-connected, and Bloomberg looking at more than 120 US businesses that say the coronavirus helped force them into bankruptcy.

Data Journalism

This Week’s Top 10 in Data Journalism

by GIJN & Connected Action • February 22, 2018

What’s the global data journalism community tweeting about this week? Our NodeXL #ddj mapping from February 12 to 18 finds @MattLWilliams discussing the ethics of publishing Twitter content, @MaryJoWebster explaining several common “dirty data” problems and @MediaShiftOrg showing examples of the powerful impact of small data teams in newsrooms.

Accessibility Settings

text size

color options

reading tools

other

Stories

Topics

The Twitterverse Of Donald Trump in 26,234 Tweets

Getting the Data

Parsing the Tweets

Public Figures and Social Data

Read other stories tagged with:

Republish this article

Read Next

Guides Tipsheet Data Journalism GIJC23

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism

Data Journalism Top 10: Measuring Mask Use, Parental Interruptions, Childbirth Woes, India’s Low Death Rate

Data Journalism

Data Journalism Top 10: COVID-19 Racial Inequity, Cash for the Connected, Africa’s Silent Epidemic, Amazon Safety

Data Journalism

This Week’s Top 10 in Data Journalism

Stories

Topics

The Twitterverse Of Donald Trump in 26,234 Tweets

Related Resources

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Reporter’s Guide to Investigating Cryptocurrency

Tipsheet on Partnering with Civil Society Organizations and Non-Governmental Organizations

Share

Getting the Data

Parsing the Tweets

Public Figures and Social Data

Related Resources

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Reporter’s Guide to Investigating Cryptocurrency

Tipsheet on Partnering with Civil Society Organizations and Non-Governmental Organizations

Related Stories

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism Top 10: Measuring Mask Use, Parental Interruptions, Childbirth Woes, India’s Low Death Rate

Data Journalism Top 10: COVID-19 Racial Inequity, Cash for the Connected, Africa’s Silent Epidemic, Amazon Safety

This Week’s Top 10 in Data Journalism

Read other stories tagged with:

Republish this article

Read Next

Guides Tipsheet Data Journalism GIJC23

No Coding Required: A Step-by-Step Guide to Scraping Websites With Data Miner

Data Journalism

Data Journalism Top 10: Measuring Mask Use, Parental Interruptions, Childbirth Woes, India’s Low Death Rate

Data Journalism

Data Journalism Top 10: COVID-19 Racial Inequity, Cash for the Connected, Africa’s Silent Epidemic, Amazon Safety

Data Journalism

This Week’s Top 10 in Data Journalism