SignalVine

Challenge

SignalVine is a two-way text messaging platform that makes it easy for students and instructors to communicate concerning appointments, deadlines, and engage the campus community. Signal Vines messaging is not only a communicator. It’s a data-driven approach to messaging. Based on data from clients’ CRM or SIS system, SignalVine can target the right students with the right personalized text messages at scale.

With the growing need for delivering the best quality to their clients, the company wanted to investigate new ways to ingest large amounts of data in a performant and monitorable manner.

As SignalVine is a messaging service, it needs to process a lot of data even before serving it back to users, for which there was a home-grown data processing pipeline. Our responsibility was to research the matter and find a top-notch solution for the pipeline component in order to keep up with the constantly growing user database. There were also a few future-proofing upgrades to be considered in the area of error handling (in comparison to the existing solution).

Solution

We started off with an analysis of the already existing proof of concept implemented using Kafka Streams. The client had chosen this particular framework because of its out-of-box support for stateful stream processing, which seemed like a good remedy for this kind of project. Although it at first appeared to be working pretty well, we decided to still look for alternatives.

The reason behind this was mostly that Kafka Streams might not always be the best fit when it comes to functional programming and Scala in general. So we started looking into the two most prominent streaming libraries in the community: Akka-Streams and fs2. The former is a part of Akka, the well-known actor model framework and the latter is a library built atop a typelevel stack. It looked like these would both better suit our needs, but after giving it some thorough thought we decided to go with the fs2 solution. Our reason for this was mostly that it :

is purely functional
is fully flexible
seamlessly integrates with a typelevel stack, most importantly with the cats-effect module

Most of the data came from either Kafka or RabbitMQ, which could be easily handled because of the already existing libraries. Eventually, it was decided that everything would be moved to Kafka, which simplified our work, hence we used the fs2-kafka library for this task.

SQL based data pipeline

The existing solution was based on a set of SQL queries and predefined stored procedures responsible for the execution of different data aggregations. The main reason why this wouldn’t perform as expected is that each record consumed from a Kafka topic was at first stored in a separate SQL database and later resulted in quite an expensive set of database operations, usually working on a rather big subset of the whole database.

Stateful streaming

As mentioned above, before we joined the project, the client had already come up with a pretty straightforward solution – stateful stream processing. When working with data streaming there is usually no intermediate state – meaning that you simply take some data, process it and later store it somewhere or forward it for further processing. In the case of stateful streaming, there’s an additional step which involves an update of the auxiliary state, which is basically a cache for values that are required for the calculation of the output data. This cache needs to be reliable, persistent and fast – which led us to incorporating Redis. Needless to say – the final result of the processing must also be cached since each time it’s updated based on incoming records. On top of that, to ensure data consistency all the records must be processed at exactly the same time and in a specific order. In the case of Kafka, this can be achieved by properly setting the partition keys and later grouping data based on this key, which guarantees the ordering. To avoid reprocessing of the records, we introduced a metadata containing the most recently processed offset for the given partition/topic which was used to filter out the already processed ones.

Here’s a summary of how our solution worked under the hood:

We grouped data coming from multiple Kafka topics into a single stream
While processing consecutive batches, each batch got grouped by a common id
Each group was then processed in parallel by:
1. Loading the previous aggregation state from the cache
2. Filtering out the already processed records based on the metadata
3. Calculating updates to the state based on the specific records and caches.
4. Caching and publishing the latest result’s state to the Kafka topic
Committing the entire batch after it is fully processed

Result

Results of the delivered POC were exactly as expected. The service performed very well, and we achieved nearly real-time processing in most of the cases.

Our solution achieved the following key benefits:

Simplified data processing,
Significantly improved performance, as there is now no need to work on a large subset of the data every time
Produced consistency of data, thanks to a more properly managed offset processing.
Made the code less error-prone and improved error detection

Online Learning Messaging

Personalized text messages at scale

Challenge

Solution

Result

Challenge

Solution

SQL based data pipeline

Stateful streaming

Result

Let’s talk about your project

Learn more

E-learning Platform Scalability Solution

E-learning Platform Optimization

Scalac and Bexio: Delivering Exceptional Tech Expertise