[Part 1] Architecture for Scale: How Africa's Talking Did It

By Joy Kendi
  Published 14 Dec 2016
Share this Article

Africa’s Talking is a mobile technology company headquartered in Nairobi, Kenya. It provides communication APIs to developers, allowing them to communicate with mobile phone subscribers through telcos to get services such as USSD, SMS, airtime and voice calls. They currently operate in 6 markets in Africa (Kenya, Uganda, Tanzania, Rwanda, Malawi and Nigeria).

Started by Sam Gikandi and Eston Kimani in 2010, Africa’s Talking has grown in leaps and bounds to be one of the largest PRSPs in the region. They currently operate in 6 African countries, have more than 50 connections with 16 telcos, more than 1000 active developer accounts pushing requests to their API, handles more than 5 million client API calls per day, which then translates to more than 20 million API calls from the telcos.

Sam Gikandi, Africa’s Talking CEO, was one of the speakers at DevCraft 2016 where he shared their journey scaling their API to match the market needs.

The beginning:

When Africa’s Talking started, they were focusing on e-commerce in Africa, which they thought would be huge in Africa. Initially, they had built client-facing applications such as SMS Leopard, until a visit to a hub (remains unnamed) in 2012, and were advised to build communication APIs for developers as there was a bigger opportunity in that market. They only had three months to do this as at that point, they had no product-market fit, zero active clients, zero API calls/day, a small team (2 devs, 1 intern), and no VC investment.

Given all these constraints how did they manage to build a functional API and turn it into a business? Given the time constraints, Africa's Talking decided to keep it simple. The initial setup consisted of one 1 GB server on Rackspace that came with 1 CPU core. True to many startup technology stack choices, they built the first version of the platform on Zend because one of the co-founders knew Zend. The only telcos they had then were Safaricom and Yu (since closed). This architecture was enough to get them their first batch of clients.


Fast forward to 2013

In 2013, Africa’s Talking finally had product/market fit and people were using their APIs. By then they were also profitable as their numbers were getting better. 2000 users had signed up and out of these 400 were active accounts. On a daily basis, they had an average of 500 API calls and were connected to all the telcos in Kenya (Safaricom, Airtel, Yu, and Orange). Additionally, a UI/UX developer had joined as a new staff member.

Some problems are good problems

At this point, Africa’s Talking was still running on a 1GB instance and as a result, there was an increased strain on resources. To resolve this they upgraded to three 4 GB instances and separated the web server from the database and website. Each now ran on its own instance. Client requests (API calls from developers) were separated from telco requests (Africa’s Talking to telcos). The new architecture looked as depicted in the diagram below.

After the separation After the separation

Unfortunately, this fix didn’t last for long. The good problems quickly turned into challenges that dared sink the young startup.

The tricky thing with RAM

How do you run into challenges with RAM? The explanation might be a bit confusing but I will try anyway. First, every request to the server took about 40MB of RAM. Given the 4GB rule, if there are more than 100 concurrent calls, you have to start swapping in memory and as a result, everything comes to a halt. Seeing that PHP had poor threading support, the only option was to scale up and add load balancers. This seemed like a short-term solution.

Second, long running requests like a client sending a message to 1000 users at the same time would eat up a lot of memory and bring the API to a halt. To solve this, Africa’s Talking introduced enqueue parameters for clients. This wasn’t enforced but largely encouraged. Fortunately, the largest clients were internal and so they could enforce it to a large extent.

The third challenge with RAM came up due to long running response callbacks. A client, for example, would want incoming messages sent to their server. The said client server could be slow or even unavailable. To solve this, Africa’s Talking decided to enqueue all telco requests and run cron jobs to do clean up.    

MySQL’s turn

RAM was no longer a nightmare once the fixes mentioned above were implemented. However, this wasn’t enough as MySQL was also acting up. At some point, Africa’s Talking had 300 million rows in the database! This was because with PHP they had to store everything in the database. This included records that they didn’t necessarily need in the database. An attempt at solving this challenge was creating scripts that would periodically clean the database. However, with 300 million entries in the database, this would at times take a full week. By that time there would an equal amount of data to be cleaned. Additionally, there was no TTL (Time to Live) for records so again they introduced scripts written in Twisted to clean the database but this still took up to a week. Analytics on the dashboard couldn’t be done and this affected fundraising as the numbers couldn’t be provided to the investors.

When it rains, it pours

PHP might have helped Africa’s Talking start out quickly but they soon faced challenges when they started scaling quickly.

PHP has poor threading support. If a client requests you to send 10 messages to 2 networks, you’d have to send one to network 1, then wait to send to network 2. There’s no way to send the messages at the same time.

Additionally, testing became more difficult with increased complexity of the system. Something was always bound to break. Being a small team, it was easy to enforce standards. This solved a lot of the testing challenges but it just wasn’t enough.

The decision point

By mid-2014, everything was grinding to a halt. Everyone thought Africa’s Talking were doing well but that wasn’t the case. What used to be the fastest API in the market was becoming unusable. The SMS table had 300 million records and as a result, clean up jobs would take a week, updates would take minutes, and dashboard queries were hanging. Because of the deluge of client and telco requests, there were lots of downtime and dropped requests. Servers were also swapping all the time. To make matters worse, fundraising was not going well.

At this point, they had to decide whether to raise money and scale vertically. This would involve getting more servers, more RAM, a huge MySQL instance, run a farm of Apache servers.

The other option was to rewrite the entire application and engineer it so that they could scale horizontally. With this choice, Africa’s Talking would have had to reevaluate every technical decision they had made. The programming language, the storage layer, the web server, the Queuing algorithm, analytics, and monitoring would all have to be reviewed. It would be a hard reset!

What would you have done? To find out what Africa’s Talking did, watch out for the second installment of this article. Can’t wait? Check out Sam Gikandi’s talk at DevCraft 2016 below.



comments powered by Disqus