You know those SMSs/e-mails you get from your favourite brands at which you have shopped? The ones that let you know about some exclusive deals and amazing offers that have been hand-picked or customized for you? Well, in Capillary, the technology behind the scenes is our Bulk Messaging System which enables the brands, i.e. our clients, to send highly personalized, exciting and irresistible offers to you! These SMS/Email campaigns are not only high in volume but they must also reach you at a specific time.


What follows is the first of a three post series, detailing the evolution of our in-house bulk messaging system. In this post we introduce our use case and share our experience of using MySQL as a queue. Needless to say, this version of the module required a complete overhaul when we hit scalability issues.

Campaign Management Tool

Fig 1 : Explains the workflow of a campaign.

First, a marketing manager creates a template like “Hi {{first_name}}! Stop by our store {{store_name}} and redeem this {{voucher}} to get 5% off on your bill!”, and specifies the list of customers to be targeted. This template is used to create a personalized message by replacing each of the tags present with values specific to every customer. Eg : {{first_name}} has a value “John” for user 1 , “Susan” for user 2 etc. These messages are finally delivered to the customers by our communication engine, which routes the same to third party gateways. In this whole process, each message goes through various state changes before reaching the customers. The lifecycle of every message is tracked in order to enable the monitoring of the campaign by our clients. Each state change is handled by an independent module or part of a module that processes the messages in first-in-first-out policy and the subsequent module picks up from that point. This is well elucidated in the figure shown above(Fig. 1).

This resembles a multiple queue based processing system, where we first used MySQL as a our queueing solution when the scale was small, and later evolved into a much more sophisticated publisher-subscriber model.

 

Server Configurations :

  • Daemon Service : EC-2 M2-Large machine with 8 core system.
  • Database : Highly optimized DB configuration for Innodb set with M2-X Large machine.

 

MySQL as Queue

It’s a known fact that using MySQL as a queue is an anti-pattern [1], and it does not scale well. In any queueing system three steps are performed to process a message :

  • INSERT.
  • READ.
  • DELETE.

This means that any implementation of a queue will have to allocate 1/3rd of its resources for each of the above steps.

As per our knowledge, none of the RDBMS implementations, including MySQL, are optimized for this kind of workloads. Generally, SQL servers are very efficient at insertions, updates, and selects, but rarely all three on the same table. Selections are made faster if the fields are indexed, but, the trade-off being slower, inefficient inserts and deletes.

We decided to go with MySQL, even though we were aware of the above limitations. MySQL has been our primary datastore since the beginning and we wanted to be restrictive in the number of technologies we adopted to optimize our dev-ops efforts.

The first version of our messaging system comprised of two components:

  • De-Templatization Engine – Performed tag replacement by polling different services to fetch tag values and did string replacement.
  • Delivery Server – Forwarded the de-templatized messages to the communication engine for delivery to the customer.

Both the components ran in a single JVM and messages were transferred between de-templatization engine and delivery server using MySQL as a queue. Delivery Server also used MySQL as a state machine to hold temporary states(PICKED, SENT, FAILED, DELIVERED, etc).

During the initial years, most of our customers were from the Asia Pacific region, where SMS is the preferred communication channel. Since the SMS payload is pretty small (typically 140 bytes), the system scaled reasonably well. Performance tuning of the system was done by configuring different batch sizes of SQL queries. With small batches (up to 1000) and using servers of configurations mentioned above, the system could process up to 500 messages (SMS) per second at peak.

As we expanded into other markets, the focus shifted towards EMAILs and we started facing severe performance issues as the payload size increased from 140 bytes to 40 KB(HTML formatted EMAILs). This resulted in an I/O bottleneck on MySQL servers as a single query containing 1000 messages would result in 40 MB (1000 recipients X 40 KB) of data transfer from disk. This number increased to 400MB when multiple campaigns used to run in parallel. To reduce data transfer, we tried different compression algorithms native to MySQL but that led to heavy CPU usage. We had to archive the tables very frequently to minimize the performance degradation.

Over time, there was a significant gap in the throughputs which we achieved for different channels(~ 500 mps for SMS & ~127 mps for EMAIL).

Of course, it failed to scale.

As the number of campaigns increased and we had to process millions of messages for each template, it was evident that MySQL queues would no longer work for us and we would have to move to a dedicated queue implementation.

Apart from the points mentioned above, few more bottlenecks arose over time:

  • Delivery service used to slow down as the tables grew in size due slow SQL queries.
  • We ran into many table-locking issues [2] when multiple campaigns were executed in parallel.
  • Sharing a database between applications ( De-Templatizer & Delivery ) was not a great idea. It’s just too tempting to put an amorphous shared state in between, and before you know it, you’ll have a hugely coupled monster.
  • Adding to our maintainability issues, our huge monolithic module was a mix of CPU heavy processes (string replacement) and I/O intensive processes (SQL queries). The I/O intensive processes had to deal with different I/O rates for different tasks, e.g databases or communication engine. It was impossible to fine tune our system to optimize for both CPU intensive and I/O intensive processes simultaneously.

This list can go on but the need of the hour was clear – to set this beast aflame. And, out from the ashes, we set forth to pull out a more elegant creature, keeping in mind all the lessons learned as we built it. 🙂

Key Takeaways :

Even though MySQL queues don’t work well for a high scale system, it’s not a bad option if the expected throughput is low. It allows for rapid development if engineers are conversant with SQL queries and also the learning curve is low. The maintenance cost remains low as the team has to manage a single technology.

Modelling the system as a DFG  :

After a careful analysis of the module’s workings, a few independent parts stood out :

  • To create a new template and persist the same.
  • To fetch users whom we need to target.
  • To de-templatize the template based on each user’s corresponding values.
  • To deliver the message to the in-house notification module.
  • To accrue the final delivery states of the message.

This system overcomes a lot of drawbacks faced by the earlier version. The functioning as opposed to the earlier one, can be compared to a data flow graph, ie., a DFG. Each independent part of the system can be modeled as an Actor within the DFG. The rate at which an Actor passes the data to the next Actor can be modeled as Tokens in the DFG notation. Each Actor can roughly be classified as either a CPU Intensive or an I/O intensive task and can be appropriately tuned. Separating out the tasks in this way allowed them to be run in different machines satisfying their individual needs. For example, we can use a higher core machine for a compute intensive task, as opposed to an IO intensive task for which we can opt for machines with low latencies. Now, for each unit of data that is taken in by the first actor in the DFG, the output can be tuned so as to avail the maximum efficiency of the system. The subsequent inputs and outputs can be adjusted as well, keeping with the processing capabilities of each of the actors.

dfa veneno - New Page (1)

 

Fig 2. A simple DFG for the system.

Queue configurations

If the throughput of the nodes are determined before hand, we can always opt for a bounded buffer in the DFG channel. But, in this case some of the Actors are I/O intensive (depends on external systems for e.g) and the rate of consumption is not constant, thus requiring an unbounded queue at peak load. The catch here is to ensure that we don’t lose our messages in transit, thereby, necessitating the use of a queue that is persistent.

To this end, the whole module is now divided into three different services, which communicate with each other only through a queue( specifically RabbitMQ ). The modules now completely follow the publisher-subscriber model, so when each process completes a task assigned to it, it would publish its results into a queue in RabbitMQ for the next process to consume.

We christened this system ‘Veneno’, aptly named for the speed it achieves in processing messages targeting more than 1Mil customers at a time. Here, in Capillary, we like cars, we like speed, we like high efficiency, so Veneno seemed like the most obvious choice for a name!

In the posts following this, we will go in depth into the highly scalable module that has now replaced the earlier version.So stay tuned to know more about the inner workings of Veneno!

[1] – This is a very good resource to read up more on using MySQL as a queue.
[2] – This explains the  MySQL transaction lock workings in detail.