By Guest blogger, Elvis Bando
A few weeks ago, @chrisorwa
, showed me some datasets that he had been working on. I got an adrenaline kick by just looking at the data. Mostly because it was challenging, then again, the prospects of cracking the data was even more motivating. In a previous blog, Chris wrote about trash sourcing, basically extracting information from trash, which was the basis of his project #Saiclique. When he invited me to join him in the project, I realized that some of the data he had were in a format I could not manipulate (we use different core analysis softwares, I use Rapid Miner, he uses Weka) so we had to start data entry process again. We started here:
We ended up getting the following from the cards (we did slightly over 1000 cards):
From this, I generated a beautiful dataset:
I thought the serial must be a concatenation of 4 sets of 4 digits, so I split the data into that. Running a DBSCAN clustering algorithm on Rapid Miner gave me the following:
The tall column is 0526, which was the second set in my 4-4-4-x data split. None of any other configurations had such strength. What does 0526 even mean? Just to confirm that 0526 meant something, I ran a frequency analysis of each digit in the entire serial and using Benford’s Law (the first digit is always 1, 30% of the time), I narrowed the cluster to 2-6-3-x configuration.
Not to bore you with my train of thought, after numerous other modelling and analysis, the data finally spoke, here is the transcript:
Safaricom serializing system seems to be similar to, or based on descriptions of a patented system found at http://www.freepatentsonline.com/5504808.html
If true, then the card serial number contains information about a card, the date and time it was produced and unique identifier. The rest of the information are called by the code from the system (the called info could be the amount of talk time, the expiry etc). The serial is therefore the only unique identifier of a particular card and show whether or not it has been used or not.
An analysis of cards produced in 2010 and earlier indicate that they were sequential for most parts. The initial two digits was 10 throughout the year indicating, probably the year of production. The remaining parts were sequential. The change of this system was probably because they would have run out of state space. At that time, the serial was a 13 digit number, as opposed to the current 17 digits.
Safaricom prepaid card serial number is organized into
The batch number is a two numeral number running from 01-99. It is splits the batch of cards produced each hour to an approximately 10,000 cards. This ensures that they are easily identifiable in case there is theft or a problem.
ManDate is the date of production of the cards. It is written in the format yy-mm-dd. It is exactly 2 years to the expiry date.
Time is the approximate hour in which the cards were produced. It runs from 000-220 (with increments of 10, so we have 010..020…100..110..). In each hour, 99 batches of cards are produced (see Batch#).
Finally, there is the serial part which is a sequential number. The data I have may be inconclusive but it shows that each day (Time), about 1 million cards are serialized. All cards are serialized the same way, so there is no telling the value of a card from the serial (damn!).
The dataset could possibly have more information. This may be limiting in the current analysis as variables such as the location of collection of the cards, and the date of collection. This can possibly give a good picture of economic indicators, customer spending and possibly zone spending regions.
The writer is the team leader, Doban Africa Ltd. For more information or access to the raw data, contact Chris Orwa @chrisorwa