GameTweets: Analyzing Twitter data
With two other students I have completed a research project investigating tweets send in The Netherlands. This research investigates if the number of tweets about a certain video game is linked to the number of sold copies in Europe. The used data is provided by Twiqs.nl, and contains most of the tweets from The Netherlands from late 2010 until late 2015. The tweets are mostly in Dutch, because that is the target language Twiqs is collecting.
The tweets are provided as tar files that have an hour of Twitter data inside them. On each line of the file contained in the archive there is a JSON string, which has all the tweet data inside (user data is also included). We have used the CTIT computing cluster at the University of Twente for going through the data with MapReduce. The data contains around 200 million tweets, which is a couple terabytes in size. The final MapReduce job that computed our results has put the CTIT cluster to work for around 8 hours.
To visualize the results we have made a webpage where the results are displayed using interactive graphs. Follow the link above to go to this website.
For the detailed results and process check out the paper.
Code on GitHub
The code written for the MapReduce job and processing the results for display on the website can be found on GitHub.