By Sidney Ochieng
As part of the ongoing Umati Project, we took time to examine how Twitter users responded to the unfortunate attack on Garissa university last year. In a previous post, we looked at the sentiment of tweets around the attack, some of the users that were driving conversation, the nature/content of the conversations, and made some inferences about the audiences engaging on the topic. However, when looking at data off Facebook, which was collected from pages and groups, we had to apply different methods and approaches to analyzing data.
Communication on Twitter differs from Facebook in various ways. Key among them is how conversations occur around a particular topic. Thus, a new tweet can be in reaction to any previous tweet on the timeline and the conversation around a hashtag is largely homogeneous and continuous. On Facebook, however, conversations manifest around posts in the form of comments and replies. Individual posts are independent of one another, thus comments and replies around one post are rarely in reaction to a different post (or if they are, this can only be derived from having the context, or if a new post is tagged to a previous one. These aren’t accessible on one continuous timeline of events as in the case of Twitter).
The graph above shows the moving average for Facebook(compare to the twitter graph before). The many and sudden changes in the sentiment indicate that doing a time series analysis on Facebook data to find dangerous speech is difficult, complex and subject to a lot of noise while the Twitter graph has a distinct dip that make it easy to analyze.
We used a modified bag-of-words model to analyse the data and create a tool called “Umatex”, which acts as a filter for dangerous speech. The bag-of-words model is a simplified representation used in Natural Language Processing; a text is represented as a bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity(Wikipedia). Multiplicity is, in simpler terms, the number of times a word appears in a set of documents. Umatex, therefore, is a module within the Intelligent Umati Monitor that removes noise from collected hate speech statements and ranks them accordingly.
To create Umatex, hate speech statements collected from the first phase of the Umati project were analyzed, based on the hypothesis that there were certain features common to the all the statements collected. These common features were then examined vis-a-vis the Umati framework for categorizing online inflammatory speech. According to the Umati framework, a dangerous speech statement:
- targets at a group of people based on their common affiliation and not a single person
- may contain one of the hallmarks/pillars of dangerous speech
- contains a call to action
Targets a group of people and not a single person
Dangerous speech is harmful speech that calls on the audience to condone or take part in violent acts against a group of people. Such speech is directed at a group, or at a person as part of a group: a tribe, religion, etc.
It is important to note that an ugly or critical comment about an individual - a politician, for example - is not hate or dangerous speech unless it targets that person as a member of a group. As noted in our previous reports, during emotive periods, it is not uncommon for negative statements to be made against politicians and other influential personalities.
With this in mind, a bag-of-words was created for certain categories under which people are grouped in Kenya: tribe, political affiliation, religion, region of origin and sexual orientation. Therefore, the ‘tribe’ bag for example looked for words (and their variations) like Kamba, Kikuyu, Luhya, Luo, Kisii, Kalenjin, Giriama, Somali, etc. while the ‘political affiliation’ bag has the words (and their variations) like ‘Jubilee’, ‘ODM’, ‘PNU’, ‘CORDian’, ‘CORDed’, ‘Chupilee’ etc. Each bag has a particular weighting. This means that any tweet or Facebook post which contained any word in the tribe bag as well as one or more other bags could be considered inflammatory provided their combined weight passes a certain threshold.
May contain one of the hallmarks/pillars of dangerous speech
Three hallmarks common in dangerous speech statements are:
- Comparing a group of people with animals, insects or vermin
- Suggesting that the audience faces a serious threat or violence from another group (“accusation in a mirror”)
- Suggesting that some people from another group are spoiling the purity or integrity of the speakers’ group.
Of these three, it is was easiest to build a bag-of-words for the first hallmark - comparing a group with animals, insects or vermin. Given the highly contextual nature of the other two hallmarks, it would be difficult to use the same model. It is, however, not impossible and will be an avenue explored in future to make the algorithm better.
Contains a call to action
Dangerous speech often encourages the audience to condone or commit violent acts on the targeted group. The six calls to action common in dangerous speech are calls to:
- forcefully evict
While building a bag-of-words for this category would seem straightforward, there are various nuances of language and context to consider. There are numerous ways to show discrimination or to make a call to loot, to beat or kill in the various languages used in Kenya. For example, taking the word kill, you could have destroy, rid, massacre, execute, terminate and, one that is sometimes used in Kenya, finish.
Each bag-of-words category created is assigned a weight using data from the first phase. If a word from a bag appears in a sentence, the weight of that bag is added to the overall weight of the sentence; for example, if the bag has a weight of 0.5 and a sentence contains 3 words from that bag the weight of that sentence will be 0.5*3 = 1.5. This is done for each bag and the total weight of the sentence will be the sum of the weights from all bags.
If the sentence meets a certain threshold weight, which is currently at 4, it is then considered to be potentially dangerous speech, everything else is dropped as noise.
The method described above is used to filter out noise. It is not a method that will automatically lead to identifying dangerous speech texts. Rather, it is a method to be used along with human input; a human would still have to manually go through the text omitted by Umatex, for instance, to ensure that significant data is not filtered out altogether. Speech, in general, is highly contextual, and context is something that is difficult to teach a computer. As we have found and previously noted, dangerous speech is both an art and a science.
The purpose of Umatex is to help reduce the workload of human data coders. In Umati Phase II, we have collected several gigabytes of data; millions of individual pieces of text and related metadata. It would be fairly expensive, not to mention inefficient, to hire annotators to go through it all. Umatex is able to quickly and efficiently sort through this data reducing its size by a factor of more than 10, while guaranteeing that a certain percentage of dangerous speech in this use case remains in the filtered text(This percentage is currently 70% based on tests against data collected in Umati phase 1) . Human coders will then be able to go through this reduced dataset to monitor for dangerous speech.
From Facebook data collected around the Garissa attack, there were statements of discrimination against people of the Islamic faith and what may be considered a call to evict them by destroying their places of worship. There was a direct call to evict and a comparison of a group of people to animals. Some of the statements took a tribal tilt, more specifically, “accusations in a mirror” that is, the suggestion that one group faces a threat from another.
Umatex was also run on Twitter and in line with the conversational analysis from part 1 some of the dangerous speech found there was from accounts not associated with Kenyans. Unlike Facebook conversation from this dataset, on Twitter our data shows that it mostly focuses on discrimination of the Muslim and Somali community. Also unlike Facebook all the dangerous speech was in English.
Most of the tweets got no reactions in the form of retweets(amplification) or replies, other than the one which got an outsized reaction of 800+ retweets and several replies including what we define as KOT(Kenyans on Twitter) cuffing.
Umatex is proving to be a valuable tool in reducing the workload of data coders. However, additional work to improve it is needed, and ongoing. We are also working on expanding the bags-of-words and creating new ones. There also are other techniques that can be used to increase accuracy. All this needs to be benchmarked against human coders and be tested on multiple datasets, an ongoing process.