The Apriori Algorithm

By Chris Orwa
Data Science Lab
  Published 14 Dec 2015
Share this Article

The Apriori algorithm is a simple procedure for generating rules from a data set. These rules are in the form of IF – THEN. For example,  IF dark clouds appear THEN it will rain. In order to come to such a conclusion you have to observe how many times dark clouds resulted in rain. Then assign a ratio to the observation, out of 100 observations when there were dark cloud, 80 times it rained (80/100).

How it Works

First, the algorithm identifies the most occurring values in a data set and proceeds to find other values that occur when the aforementioned values occur. The algorithm then calculates the ratio of the first occurrence to the second one; this ratio is referred to as the confidence limit and in the above example = 0.8. Confidence limit is used to find how reliable a rule is, and in practice the apriori algorithm will produce several rules and rank them by confidence limit.Now, let’s expand the above example. Take into account all type of cloudy conditions (nimbus clouds, clear sky, cirrus clouds and dark clouds).

Assuming 365 days in a year in which 100 = dark clouds, 50 = clear sky, 70 = cirrus, 145 = nimbus clouds. We calculate another ratio of the number of occurrences of a value within the whole data sets, in the example above, for dark clouds it's 100/365 = 0.27. We refer to this ratio as the support; it shows how popular an observation is within the whole data set. At this point, it is worth noting that the name apriori means deduction from observation and is suppose to mimic the way we reason while making assumptions.

One of the main uses of the algorithm is in Market Basket Analysis. By finding the support and confidence for which items are bought together retailers can be able to make cross and up-sells to improve revenues. Well, we had other ideas with the algorithm - how do we apply it to textual data.

The Application

Since the apriori algorithm forms the basis of association mining , we figured it could be useful in finding highly relevant pieces of information from a large text corpus. So, during an event on Twitter, we tracked a keyword or hashtag, then broke-down all tweets into individual words and applied the apriori algorithm. The results were words related to the keyword/hashtag ranked by confidence limit.

 Next, what’s the threshold for the confidence limit for choosing highly relevant words? We settled to using the average distribution of a word in a corpus by getting the number of unique words/total number of words.  This worked quite well from eye balling results. To convert into a tool, we added a dynamic SQL query that runs over the database and returns only tweets that mentioned the keyword/hashtag AND the highly correlated words, and we christened this tool ‘The Slicer’.

The Slicer

The slicer helps us to quickly analyze high volume data sets especially from events on Twitter. Instead of going through hundreds of thousands of data-points, just analyze the highly relevant ones. Benchmark results shows this technique and tool compresses/slices ;) data by 90 percent.

4591067065_91794f0065_b(Image Courtesy of Cesara Harada)

While running analysis via external APIs most vendors have a rate limit on API calls or charge per API call. The slicer came in handy in this situation when we had to perform sentiment analysis over an external API. Instead of sending 100,000 data points for analysis, we slice the data and run 10,00 data points through the sentiment analysis API and we were still within our rate limit

of 50,000 API calls per day. One concern was whether the results were representative, so we run sentiment analysis on the whole data set and the sliced one.

sentiment

It’s a near matched for analysis on both data sets

The tool is available under an open source license here:

https://github.com/iHub/snapeshot/blob/master/slice.R

comments powered by Disqus