Extracting Historical Data from Twitter using Python

Disclaimer: Dear computer science and more computer literate friends, I want to apologise in advance for what you will see. I have probably violated many of your sacred rules when talking about and writing code, for this I am very sorry but these are the lowly efforts of a Politics and Economics BA student so please be kind :) Here are the python flavoured fruits of my labour

For one of my modules this year (British Government and Politics – BGP) I embarked on a project where I looked at how twitter users react to what leaders of political parties say at the leadership speeches during their respective autumn party conferences. This required some access to historical twitter data which is rather expensive unless you have already collected it. Wanting to avoid said costs I built a piece of python code that will extract tweets from an HTML file.

Twitter allows you to record data live or back approximately a week pretty easily using its search and streaming APIs. If set up correctly this will return a completely accurate set of all the tweets for the given parameters. However, looking for historical data is a bit more difficult.

For historical data you have two options. Firstly the paid services listed below, which judging by the prices are more aimed as commercial solutions. These prices are not within reach of the average politics student. I even asked Topsy Pro and Gnip if they could run some relatively small searches for a student who is working on an academic project (which I am) and they did not seem terribly interested. These commercial solutions and not within the realms of possibility for many students or academics and even many small to mid-sized organisation may wish to use some alternative.

Name Gnip                                                   Topsy Pro Datasift
Website http://gnip.com/ https://pro.topsy.com/ https://datasift.com/
Pricing $500 USD minimum fee* $12,000 USD per year* $3,000 USD per month minimum**
Service Create a search query and they run it for you. Gnip send you the data and analysis is left to user. Run search queries, deliver detailed analysis. Allow export of data. Run search queries, deliver detailed analysis. Allow export of data.

*prices quoted in email correspondence sales staff    **price displayed on website

The alternative option which I went for was to use twitter’s search.twitter.com page detailed below. The disadvantage of this method is that it seems to return approximately 25% of the tweets that historical data retrieval does (based on a comparison of results for the same term – “Clegg” over the same period of time on Topsy Analytics Pro and my method).

The code uses Python and the module BeautifulSoup4 to create a CSV file from a complete HTML file. The details recorded in the CSV are the username, the permalink, the time and date, the text of the tweet and any URLs linked.

Method to scroll down page on twitter

Why yes, that is a penknife balanced on the down key. I said home-made OK?

In terms of getting an HTML of the search results from twitter you need to load the entire list of tweets you want to export. This can be achieved by scrolling down the page until all of the available tweets have been loaded. As the image shows I used a rather home-made method, but I’m sure you can use some kind of macro instead if you prefer. For saving the complete HTML I found that using Google Chrome (Version 27.0.1453.94 m) was the most effec

tive where it comfortably handled over 7,000 tweets. I also tested Microsoft Internet Explorer and Mozilla Firefox, they both crashed when handling more than 2,000 tweets. There are some slight differences between the complete HTML saved by browsers so the code needs to be changed a little, currently the version on pastebin works with Google Chrome.

Finally, I would encourage anyone who is interested in doing something like this to try it. I have no prior experience of Python, and yet I managed to put together some useful code. The only distantly related experience that I have is some HTML, CSS and some use of Lua (definitely not for Computercraft). The only reason I could do this was because of the wealth of tutorials, videos and blogs about how to use Python. Doing this was fun and highly satisfying, so I encourage anyone to have a go!

So, what was the point in all of this? I will present what I found in a separate post. For now, here is a taster of what I managed to produce using the records of tweets. Coming soon: analysis!

3 MMA of autumn 2012 speeches TPM

Tweets Per Minute (TPM) three minute moving average during each speech starting at the black line (0) and ending with the vertical line in the colour for that speech’s TPM. It also gives 15 minutes before and after each speech.

About these ads

2 thoughts on “Extracting Historical Data from Twitter using Python

  1. Andy

    Interesting article! :) Although I am trying to extract tweets (like you) from years gone by from various users to do a content analysis on them. Will Python be a suitable tool for this?



    1. Simon Hudson Post author

      Hi Andy, thanks, glad you found it interesting.

      What sort of content analysis are you hoping to do? I’m working on a dissertation at the moment where I analyse sentiment. I’m far from an expert in this kind of stuff but from the little I have read and my limited experience, python is very effective for handling large datasets and manipulating them.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s