Umati Project: Challenges of Capturing Relevant Data

By Angela Okune
iHub Research
  Published 16 Jul 2013
Share this Article
Umati Project: Challenges of Capturing Relevant Data
As part of our Umati Project, which recently released its final report from Phase 1, we have been collecting online hate speech found on social media, forums, and online newspapers. The process, which was initially envisioned to be an automated process, turned out to be quite manual due to the nuanced nature of the speech being collected and the lack of corpus available of Kenyan hate speech. The project therefore had to rely on 11 new media monitors to collect and code data. We had two sets of monitors, six for the weekday (Swahili/Sheng, Luo, Luhya, Kikuyu, Kalenjin, Somali; all monitored English) and five (all of the aforementioned languages except for Somali) for the weekend. Both groups worked from 8 am to 5 pm, with one hour allocated for lunch. The methodology is further detailed in the final report.

Several challenges emerged as a result of the manual data collection process. These included the possibility for correct misses/false alarms, as detailed by the signal detection theory. In other words, making sure that the monitors correctly categorized “dangerous speech”. We had to clean the data numerous times on different accounts and still encountered errors in categorization each time.    

The monitors also often displayed fatigue and varying levels of productivity as a result of task dullness. After staring at a computer for hours on end, it is no surprise that monitors productivity would fluctuate wildly, spiking and crashing. Most of these are challenges inherent in the use of humans to collect online data systematically. As a result of noting these challenges, we are interested in developing a more streamlined, and efficient process to collect online data systematically using machine learning and data mining techniques. The development of this tool will make up Phase 2 of the Umati Project. Phase 2 will use humans to calibrate the machine as the machine ‘learns’ the nuances of online dangerous speech, but eventually, the tool should be able to run with very little human effort. This will result in a low-cost system that can be scaled to other countries and contexts.
comments powered by Disqus