Building a High-level SDK for Kafka: Greyhound Unleashed

Oct 1, 20204 min read

Over the past 5 years the Wix backend services group has been increasingly utilizing and reliant upon Apache Kafka for inter-service communication. In fact, over 300 of our developers use Kafka every day.

My team is using Kafka to provide abstractions for patterns that are common in Wix microservices. We achieve that via Greyhound - our open sourced Scala/Java High-level SDK for Kafka we developed in-house and recently released it to OSS. In this post we’ll discuss the motivation behind creating Greyhound and will look into some of its main features (internal and public) and how they help us with our daily operations - from writing code to monitoring in production.

How it all started

The story of adopting Kafka at the Wix.com started out quite modestly, back in 2015, as an experimental alternative to services using ActiveMQ as a message queue.

One of the first decisions we made when adopting Kafka was to wrap the common producer and consumer functionality in a declarative, lightweight library, and only allow access to Kafka via that. Our reasoning was that developers shouldn’t be dealing with `while(true)` loops and offset commits - instead they should be providing a callback to handle events.

We named that new library "Greyhound” - back then its first versions provided mostly boilerplate reduction and a nice opinionated DSL for consuming and producing events.

Getting deeper into the weeds

More and more common use cases quickly became apparent, and work on Greyhound steadily continued. We got inspired by what folks at Uber had done and implemented a common retrying mechanism. Next, type safety was introduced, providing developers with an interface which would expect types instead of byte arrays and strings.

Right from the beginning, Greyhound set the message protocol to a JSON object containing two fields: {payload: String, headers: Map[String -> String]}. It automatically added metadata to events, such as server name, timestamps, etc. Remember, at that point Kafka didn’t have built-in headers or timestamps, and providing these features at a common infra proved helpful, as developers could focus on logic and worry less about mechanics.

Present day - handling billions of events

It’s now nearly impossible to find a Wix backend service that doesn’t use Kafka - whether directly or via middleware libraries. Kafka has become the backbone for hundreds of services and we use our own Greyhound backoffice to view topics, consumer lags, for setup replication, to configure topics, and much more.

Our Kafka clusters process over 1 billion events per day and we’re still growing massively - beyond Wix traffic growth. In fact, that one billion mostly comprises business events and we plan to migrate analytical events to Kafka soon, which will increase these numbers by an order of magnitude.

Peeking under the hood

Greyhound is special in many regards - it features different producer strategies:

Concurrent message processing
Consumer per-message
Retries
Batch processing
Automatic request context propagation (over async boundaries)
Declarative APIs
Common boilerplate reduction
and much more...

We’re going into greater detail about Greyhound’s inner workings in a separate article, so stay tuned. For now though, let’s talk about just a few things that make it special.

First, Kafka’s consumer API expects developers to deal with polling records and committing offsets, which has nothing to do with our business logic. This is where Greyhound comes in - a consumer API that takes care of messy wiring and invokes custom code when receiving records, thus allowing developers to focus on writing the actual business logic instead of wasting time dealing with boilerplate.

Why not just using Kafka Streams?

We know that Kafka Streams provides high-level streaming APIs and it is great for apps that require stream processing. Though in this case Greyhound does not address the same problem, it offers a simple abstraction for the “edge consumer” - a process that is able to execute actions on 3rd party or unreliable services.

Those include things like email sending services, webhook dispatchers, some legacy system updaters, etc. These are typically unreliable - the IO they perform can be slow or sometimes fail. This is why we would want these tasks to be parallelized efficiently.

Now, while Kafka Streams requires a separate consumer process for each “parallelism unit”, Greyhound uses a single Kafka consumer to subscribe to multiple partitions and process them in parallel using threads/non-blocking I/O.

Quick example. There’s a consumer group we’ll call “mainframe-updater”. It consumes a topic, called “orders-completed”, and updates some legacy system. That legacy system (for reasons no one understands) requires strict ordering. When things are set up like that, on average, latency for such action is around 250 milliseconds and the throughput of the topic is somewhere around 1000 messages per second.

I guess by now you can guess where this is headed - to effectively process this workload we’ll need at least 250 partitions (every partition can handle 4 messages a second, which multiplied by 250 gets us to 1000) and they all need to process records simultaneously.

In Kafka Streams this will require 250 consumers (configuring num.stream.threads actually creates a separate consumer process per thread). This would put a lot of burden on the app, as well as the brokers.

With Greyhound, we can use a single consumer process to handle all that workload. It won’t spawn unnecessary threads, and it will be able to leverage non-blocking user code when possible (All APIs in Greyhound are non-blocking).

So what is Greyhound's claim to fame?

Unlike many use cases of Kafka, which is used for handling analytics data, Greyhound was built to allow Kafka to be used as a “classic” message broker that handles business events that you less likely would want to lose, and have better guarantees that messages will be delivered and processed correctly.

Check it out on GitHub

Did we mention that there’s a beta open source version of Greyhound available now? We’re soon publishing another article dedicated to Greyhound that’s going to dive much deeper into its inner workings and ways to use it. But if you want to check it out right now, see its github page or check out downloads on maven central.

This post was written by Noam Berman

For more engineering updates and insights:

Follow us on: Twitter | Facebook | LinkedIn
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple, Spotify or Google