2.5 Million to 5,000 Tweets: Sifting Through the Noise

By Angela Okune
iHub Research
  Published 16 Jul 2013
Share this Article
2.5 Million to 5,000 Tweets: Sifting Through the Noise
One of the first projects to be housed under iHub Research’s new Data Lab is our IDRC-funded research on Developing a Framework for the Viability of Election-centered Crowdsourcing. In the first phase of this research, we’ve built a Kenya-specific spam filter that sifts through crowdsourced data from the elections to pull “newsworthy” events out of raw Twitter data collected during the elections (March 3, 2013 – April 9, 2013). We had initially set out to find if crowdsourced data has characteristics inherent within it that can help to validate the information. We’ve found instead that before even looking at the validation question, most news agencies and organisations need to grapple with the sheer volume of crowdsourced data (This is explained in greater detail in a recent post by Patrick Meier, What is Big (Crisis) Data? Much of the crowdsourced data is irrelevant noise. If an organisation or individual has no capacity to sort the irrelevant from the relevant, using crowdsourced information becomes very difficult indeed. We experienced this challenge firsthand when we collected over 2.5 million tweets during the 2013 KE elections. We used a third-party Twitter application called DataSift to capture and store tweets using Kenyan election-related keywords (e.g. kill, dead), user names (e.g. @UhuruKenyatta, @RailaOdinga), place names (e.g. Kawangware, Mathare, Kisumu), and hashtags (e.g. #KenyaDecides). In the past weeks, we’ve used a variety of data mining and machine-learning techniques to filter the irrelevant (non ‘newsworthy’) information out. As a result of this process, we built a spam filter, able to accurately boil down our data from 2.5 million tweets to 5,000 ‘newsworthy’ tweets related to an event or activity from the Kenyan elections that can be verified.
Click to zoom image
The implications of this work are great. Building upon the superb work done (and being done) by Chato, Aditi, Patrick Meier, and others, we’ve developed a tool that can be used in election scenario (where there is much higher likelihood of false information and rumours to be spread) by media agencies and organisations to quickly find the RELEVANT information. Now we are looking at the filtered data to run cross-comparisons between different news sources (traditional media, Twitter, and Uchaguzi) about the type of information disseminated on the different channels. We have conducted 85 in-depth interviews with citizens in 3 hotspot locations to also better understand the relationship between the online space and the on-the-ground ‘reality’. We look forward to sharing more results soon.  
comments powered by Disqus