Disclaimer: Dear computer science and more computer literate friends, I want to apologise in advance for what you will see. I have probably violated many of your sacred rules when talking about and writing code, for this I am very sorry but these are the lowly efforts of a Politics and Economics BA student so please be kind :) Here are the python flavoured fruits of my labour
For one of my modules this year (British Government and Politics – BGP) I embarked on a project where I looked at how twitter users react to what leaders of political parties say at the leadership speeches during their respective autumn party conferences. This required some access to historical twitter data which is rather expensive unless you have already collected it. Wanting to avoid said costs I built a piece of python code that will extract tweets from an HTML file.
Twitter allows you to record data live or back approximately a week pretty easily using its search and streaming APIs. If set up correctly this will return a completely accurate set of all the tweets for the given parameters. However, looking for historical data is a bit more difficult.
For historical data you have two options. Firstly the paid services listed below, which judging by the prices are more aimed as commercial solutions. These prices are not within reach of the average politics student. I even asked Topsy Pro and Gnip if they could run some relatively small searches for a student who is working on an academic project (which I am) and they did not seem terribly interested. These commercial solutions and not within the realms of possibility for many students or academics and even many small to mid-sized organisation may wish to use some alternative.
|Pricing||$500 USD minimum fee*||$12,000 USD per year*||$3,000 USD per month minimum**|
|Service||Create a search query and they run it for you. Gnip send you the data and analysis is left to user.||Run search queries, deliver detailed analysis. Allow export of data.||Run search queries, deliver detailed analysis. Allow export of data.|
*prices quoted in email correspondence sales staff **price displayed on website
The alternative option which I went for was to use twitter’s search.twitter.com page detailed below. The disadvantage of this method is that it seems to return approximately 25% of the tweets that historical data retrieval does (based on a comparison of results for the same term – “Clegg” over the same period of time on Topsy Analytics Pro and my method).
The code uses Python and the module BeautifulSoup4 to create a CSV file from a complete HTML file. The details recorded in the CSV are the username, the permalink, the time and date, the text of the tweet and any URLs linked.
In terms of getting an HTML of the search results from twitter you need to load the entire list of tweets you want to export. This can be achieved by scrolling down the page until all of the available tweets have been loaded. As the image shows I used a rather home-made method, but I’m sure you can use some kind of macro instead if you prefer. For saving the complete HTML I found that using Google Chrome (Version 27.0.1453.94 m) was the most effec
tive where it comfortably handled over 7,000 tweets. I also tested Microsoft Internet Explorer and Mozilla Firefox, they both crashed when handling more than 2,000 tweets. There are some slight differences between the complete HTML saved by browsers so the code needs to be changed a little, currently the version on pastebin works with Google Chrome.
Finally, I would encourage anyone who is interested in doing something like this to try it. I have no prior experience of Python, and yet I managed to put together some useful code. The only distantly related experience that I have is some HTML, CSS and some use of Lua (definitely not for Computercraft). The only reason I could do this was because of the wealth of tutorials, videos and blogs about how to use Python. Doing this was fun and highly satisfying, so I encourage anyone to have a go!
So, what was the point in all of this? I will present what I found in a separate post. For now, here is a taster of what I managed to produce using the records of tweets. Coming soon: analysis!