Battle Tested, Event-Driven Patterns for your Microservices Architecture by Natan Silnitsky
Scalac invited Natan Silnitsky, who works as a senior backend infrastructure developer at Wix.com, to speak at its “Between Business & Tech” conference.
In his excellent presentation, Natan spoke about how Wix leverages tailored models to build highly efficient event-driven, distributed microservice environments. Wix is a website-building platform with more than two hundred million registered users. Consequently, it operates an extensive software architecture consisting of two thousand individual services managing more than 2.5 billion Kafka messages per day.
Natan also outlined how companies can move from monoliths to event-driven microservice frameworks to handle increases in traffic and storage requirements.
He then defined numerous models that can be used for microservices implementation, including “consume and project,” “event-driven from end to end,” and “zero-latency key-value store.”
Microservices and Databases: An Ongoing Problem
Traditionally, software engineers utilized monoliths that contained all application functionality in a single unit. In recent years, however, developers have recognized the benefits of distributed microservices systems that enable individual services to be scaled separately and optimized to fit specific requirements.
Microservices often utilize one extensive database like monolithic applications. A single database requires a mutable state shared by every service, which causes competition between services, increasing latency and diminishing performance. Additionally, teams often use relational databases that aren’t optimal for microservices.
However, development teams can use encapsulation to better manage microservice databases. Encapsulation is the combining of linked data and processes into one isolated system. In this approach, each microservice has its own database. Natan believes this offers excellent flexibility, reduces conflicts, and eliminates the need for a mutable shared state.
Improving on Microservices Communication
In the classic request-reply model favored by many organizations, one service sends a Hypertext Transfer Protocol (HTTP) or Remote Procedure Call (RPC) request to collaborating services.
Diagram of a request-reply microservice model.
Each request requires the collaborating service to fetch the needed information and return the result from its encapsulated database separately. As a result, this model can be slow and vulnerable if the producer service goes down.
However, if development teams use a message broker between microservices, requests become more reliable. This is because the brokers store replications of all requested data.
This system produces and consumes messages, completely changing the nature of the workflow. Control shifts from the sender to the receiver, or “consumer side”.
Natan believes this is a more robust approach because even when the producer is down, consumers can still intake prior events, and the consumer functions at its own pace and doesn’t have to call an online server.
Diagram of an event-driven microservices model.
Natan used the example of an Apache Kafka message broker – a distributed log broker that divides messages into partitions – to further illustrate his point. Partitions scale in a linear fashion, allowing developers to increase a system’s processing rate.
Representation of a distributed log broker system.
Partitions function as committed log files that store events or messages at the end of the file, which constitutes a much faster system than alternative types of data storage.
Here’s how a log broker system works:
- The message producer creates the message at the end of the partition. Data can remain there indefinitely, with a typical default of one week.
- Consumers pass through each partition sequentially.
- Once consumers finish processing a message, they commit it. This increases efficiency and throughput while maintaining knowledge of which index to start from.
This data model is immutable, resilient, and easy to debug because developers only append data. What’s more, processing always succeeds eventually because the system sequentially retries until a particular message is processed and ready to commit.
Diagram of an event-driven microservice model with partitions.
This kind of event-driven model executes communication between services through an intermediary Kafka broker. Each service is in control of its own topic. The topics are essentially storage units that connected services can access easily.
These services can consume and produce messages, thus reducing coupling and increasing flexibility. What’s more, developers are able to cache some information in the connected service, allowing for the full decoupling of parts of the microservice infrastructure and further boosting performance.
Interested in microservices?
See other articles from the Between Business & Tech: Episode 1: Microservices
Event-Driven Microservice Patterns
After covering the ways in which event-driven models and log brokers overcome traditional issues associated with microservices, Natan turned his attention to event-driven patterns.
Consume and Project
At Wix, a high-throughput MetaSite service stores all website metadata and installed applications.
Many microservices at Wix, including those responsible for managing user ecommerce stores and booking systems, require information from MetaSite.
Initially, the MetaSite service utilized a traditional database with a large MetaSite object that was serialized for the database schema. This database processed many read and write requests, often reaching up to one million requests per minute.
A diagram of the original MetaSite service model.
Wix’s team wanted to reduce the request load on the service to increase reliability and performance and thus avoid risking database failure.
Here’s how they did it:
- Developers took a portion of the data requests (generated by site installed applications) to see how they could remove them from the MetaSite service.
- Once site applications had been successfully modified, the corresponding metadata was also updated. The team then produced the MetaSite object to the Kafka broker with the changes.
- Wix’s team created a new service, called the “reverse lookup writer”, that only consumes events and messages regarding updates to the MetaSite. However, a filtering mechanism means that the reverse lookup writer only receives updates about site installed apps.
- Finally, the team used another service, a “reverse lookup reader,” that was responsible for drawing data from the database. This service made it possible to use optimized queries to fetch information about installed applications.
A representation of the consume and project system.
Consume and project architecture has the following advantages:
- Messages are produced instead of served, significantly reducing the load on the original database table.
- Projected information allows for highly-optimized queries like “materialized views” to be applied to specific data.
- Splitting tasks between a writer and reader service isolates scaling for a read-only database. The writes can then be stored in an original master database, creating a more resilient and scalable system.
Natan believes that by adding more services and utilizing Kafka, the MetaSite service can be converted from a monolithic architecture to a more distributed microservices system.
Event-Driven From End to End
After outlining the way a “consume and project” architecture functions, Natan described an event-driven end-to-end microservice model that allows events to have “browser-to-service interaction.”
Natan illustrated a development use case to help depict a typical event-driven model. Specifically, that of creating a long-running asynchronous process for a service that imports contacts from various platforms, like Gmail and Outlook, into the Wix customer relationship management (CRM) system.
This service could be used, for example, by a business that wishes to store contact information in one place so that it can send emails via the Wix platform.
To create this architecture, developers begin by implementing a “contacts jobs service.” This service is responsible for creating import jobs. It creates separate jobs for importing data from platforms like Gmail and Outlook and for segmenting large contact lists.
In addition, a “contacts importer service” would be set up to consume requests and implement import actions.
A long-running async business process model.
In the example above, the services are producing a request to Kafka. But requests can’t be sent back because the overall operation is asynchronous.
Because of this, the browser isn’t aware of when an import is completed. To overcome this issue, the browser must subscribe to receive WebSocket notifications from Wix’s WebSockets service, delivered when the import process is at a particular stage of completion.
For example, when a browser requests to import comma-separated values (CSVs) from the front-end service, it will also provide a WebSocket channel ID which makes it possible to send results back to the browser. The import job requests coming into Kafka will then include the channel ID.
When the consumer finishes processing a job, it updates the WebSocket service. The WebSocket service then sends the notification back to the browser in a decoupled format. The WebSocket service won’t notify the browser that multiple services are part of the importing business logic. For the user, this process is seamless.
Diagram of an “event-driven from end to end” process.
Notably, you can use Kafka to take job import requests and replicate them across different data centers internationally. This ensures that the best data center for the requirements can handle import operations. Consequently, it’s a wholly distributed global operation that is event-driven from end to end.
There are several advantages to using an event-driven system:
- This system doesn’t require polling. Statuses aren’t kept in a database with a browser repeatedly asking what’s happening. A complete event-driven flow returns notifications when jobs are completed.
- It’s scalable because it’s easy to create more import job workers as required.
- Kafka topics are easily replicable, simplifying cross-DC requests and processing.
Zero-Latency Key-Value Store
The zero-latency key-value store is a method that makes designated information readily available.
Development teams use a Kafka broker feature called “compact logs” to implement this system. The logs create a key and a value for each record, but Kafka won’t delete stored messages after one week. Instead, it will remove messages when there is an outdated value for a key.
A zero-latency key-value store representation with topics.
In the diagram above, “K1” (key one) has been updated to the value of “V2″, meaning that Kafka can remove the obsolete value of “V1” for “K1”.
The development team at Wix created a data structure called a “key-value store reader” that has an in-memory map. On service startup, it reads all the partitions from a compacted topic sequentially and creates an in-memory representation of the information.
In the diagram below, the “K1” value, “V1”, has changed to “V2” in the map. As processing progresses, keys are added incrementally.
A zero-latency key-value store model with an in-memory map.
When the startup phase finishes, there is a complete representation of every key value stored in memory. This process is best suited to data sets that are limited in size.
Memory is updated using the “key-value store writer” data structure, which has a simple producer that creates new key values. Other services with “key-value store readers” have consumers to take in information.
There are two main benefits of using the zero-latency key-value store method:
- There is zero-latency key-value storage for dynamic configurations like time zones.
- Compacted topics are still topics so other services can consume their updates regularly.
Nata emphasized that this system is not suited to large general-purpose datasets because it will take up too much memory.
Events in Transactions
“Events in transactions” is a model that guarantees that no matter how many times Kafka may accidentally process the same message, it won’t alter the state.
Natan explained how developers can automatically implement this process using new Kafka features.
Development teams occasionally encounter situations where messages are consumed multiple times because of a system restart or failure, which can lead to the same event being processed more than once.
Making each part of the system workflow more idempotent is one way to prevent accidental duplicates. Idempotent delivery ensures that the producer only sends a message once to each destination. For instance, the producer at the beginning of a flow can attach an offset to a message that alerts Kafka that it already has it.
An “events in transactions” model.
Kafka creates a transaction between the consumer and producer at the start. The consumers are downstream of the producer. Therefore, the consumer side in this system only views messages after the service in the middle successfully produces and consumes them.
The challenges of implementing an “events in transactions” system are as follows:
- This system requires larger quantities of code. A greater number of messages in a transaction leads to more latency because downstream consumers have to wait for the producer to process all the messages for a transaction.
- Kafka’s idempotence does not cover database updates and third-party calls. However, it’s possible to deduplicate writes to the database using the Kafka record offset.
Natan was clear that he wouldn’t recommend turning to an “events in transactions” model on a regular basis because of the overheads and complexity. Instead, its best use is for critical flows where exactly-once processing is mandatory, such as payments processing.
Natan shared some further resources on the subject of microservices:
- Natan has Twitter, Medium, Slideshare, and a website with previous talks, blog posts, and more.
- “The Data Dichotomy” by Ben Stopford
- “Shared database” by Chris Richardson.
- Many of the patterns Natan described have been partially implemented by Wix’s Apache Kafka SDK client for the Java Virtual Machine, Greyhound. It’s open-source and offers various features that enable parallel processing and consumption and resilient retries.
Wix developers have employed event-driven patterns to significantly improve their microservices’ decoupling processes, resilience, and scalability.
Picking the best possible model for your application will reduce the risk of system malfunctions, optimize processing speed and efficiency, and save substantial amounts of time and resources over the long term.
Contact Scalac today for assistance in creating an effective microservice framework. We would love to help you find the ideal software solution for your business.