A Kubernetes Based System for Serverless Asynchronous Load

Aug 2, 20227 min read

Updated: Feb 6, 2023

Velo by Wix enables website creators to extend their site by writing JavaScript code to add custom functionality and interactions to their sites. Velo provides a full server-side runtime system based on Node.js. The runtime is hosted on Wix's cloud services and provides serverless experience - users do not need to worry about any server setup or maintenance.

The requirements for such a runtime are rather unique as it needs to support Wix’s scale of hundreds of millions of websites, must have a response time under 100 ms at 50th percentile for the users’ requests, while also taking into account resource consumption and cost. Another crucial requirement is maintaining complete isolation between the runtime of different sites - since it enables execution of arbitrary code written by a website creator, it must be executed in an isolated sandboxed environment and not affect runtimes of other websites or the overall system.

Photo by Rodrigo Abreu on Unsplash

Initial Implementation

The initial implementation consisted of a Kubernetes cluster which holds a pool of workers (pods in deployment) that are up and running and waiting for user traffic. When a request comes in, a Broker service selects an arbitrary pod from of the pool and prepares it to handle traffic for that site:

The pod is excluded from the pool (deployment) by changing one of its labels
The site’s code and dependencies are made accessible to the pod
The request (and following requests to the same site) is forwarded to that pod

This allows for very short “cold start” in which the first request for the site gets a response within less than 100 milliseconds. Subsequent requests hit the “warmed” pod and get a response even faster.

The pod serving the requests is automatically killed after a certain number of minutes so that it doesn’t consume resources for too long. Also, once a pod leaves the deployment, Kubernetes schedules a new pod to reconcile for the configured number of replicas, thus the worker pool size is maintained.

The pain

This architecture worked well for us. Almost.

We discovered that there are peak times, most of which at round hours, when there is a spike of requests - each to a different site and thus each requiring a new pod from the pool. During these spikes the pool of waiting workers quickly becomes exhausted as Kubernetes is not able to schedule new pods fast enough. Also, calls against Kubernetes API server become increasingly slow.

Increasing the size of the worker pool (deployment replica count) did not have a positive impact. On the contrary, Kubernetes APIs even seemed to perform worse. We suspected it was due to increased load on the master nodes, however as we are using Google-managed Kubernetes, namely GKE, we don’t have control over the master nodes which would allow us to troubleshoot and tune them.

Update: The first ever Wix DevCon is here!

Get an unparalleled access to Wix devs, product announcements, industry thought leaders, demos, live coding and much more!

* September 07-08, 2022, NYC

* Register here → http://wixdevcon.com

The culprit

A short investigation made it clear that the source of this traffic is recurring/scheduled jobs, which is one of the features offered by Velo. A user can define a job (a function) to be invoked periodically based on a Cron expression or a specific configured time. Most of our users conveniently use a Cron expression to specify a job should run at a specific round hour in the day, or sometimes every round hour.

This usage pattern generated momentary bursts of requests going to many different sites, most of them not warmed up due to user interaction traffic. Our worker grid could not produce new pods fast enough (Kubernetes autoscaling takes time to react) thus causing many jobs to fail as no worker pod was available to serve them. This also affected non-job-related user traffic during peak times, causing a noisy neighbor problem.

Once we understood the nature of these requests, we realized they actually do not require a quick response - there is no user waiting for them to complete. Rather, the requests often invoke functions that take longer to run as they execute periodic tasks that may involve iteration over a collection of elements and invoking APIs on each. We then understood the pattern for this type of traffic is very different from what we designed our grid for, especially since having low “cold start” time appeared to not play a role here at all.

We decided to handle this traffic separately and have two grids: one was what we already built, for synchronous / user traffic, and another for asynchronous traffic, in which the triggered request does not expect an immediate response.

Available technologies

Searching for a solution, we explored relevant technologies in the Kubernetes ecosystem. We first considered using Kubernetes jobs. A kubernetes job creates one or more pods, and will continue to retry executing pods until a specified number of them successfully terminate.

The first idea was to spawn a Kubernetes job for each job execution (request). However, this approach could suffer from the same problem of too many pods that need to be created at peak times, overloading the cluster and making Kubernetes API server unresponsive.

Another approach was to configure a single Kubernetes job with an infinite number of pods that would read a message (job) from a queue as suggested here. However, this approach has several drawbacks which made it unfit for our case, such as having a static configuration of job parallelism and not dynamic parallelism according to load.

We also considered KEDA, Kubernetes event-driven autoscaler, to autoscale Kubernetes jobs based on queue length (number of pending messages). However we thought the same issue with overwhelming Kubernetes during spikes could occur here as well. Finally, we came to the conclusion that existing Kubernetes constructs simply do not precisely fit our needs and we would need to create our own solution.

An asynchronous grid

We looked at how to design our system around a task queue where each task in that queue would be a job invocation request. We want to consume work from the queue whenever resources in the grid are available to handle it.

Additionally, we wanted to monitor that work consumption to make sure it happened fast enough to be able to maintain our SLA. Even though our SLA for jobs is quite loose, we still wanted jobs to start processing close enough to their scheduled time.

Implementation

We built a system composed of a “Jobs Dispatcher” service that is invoked wherever a job is scheduled to execute and a pool of pods (a Kubernetes deployment) that handles these jobs asynchronously. Google Pub/Sub is the messaging system that enables their indirect interaction.

When a job is scheduled to execute, the Jobs Dispatcher encapsulates it in a message and publishes it to a Google Pub/Sub topic. At the other end, the pods in the deployment each consume a single message from the Google Pub/Sub subscription and process it.

Let’s dive a little deeper into the structure of a pod in the deployment: a pod consists of a web server application container whose role is to execute the custom code written for the job and a sidecar container called “Job supervisor”. The Job supervisor:

Waits for messages to be available on the Pub/Sub subscription
Once messages are available, pulls a single one
Detaches the pod from the deployment (by changing one of its labels)
Downloads and makes the custom code associated with the job’s site available to the application container
Triggers the job execution on the application container via REST call
Acknowledges the message consumption on the Pub/Sub subscription

The pod is killed after 5 minutes so that it doesn’t consume resources for too long. If the job isn’t done by then, it fails. 5 minutes serves as a soft limit. For sites with a specific need for longer execution time this limit may be raised. Once the pod is detached from the deployment, a new pod is created and scheduled by Kubernetes to maintain the deployment’s configured replica count. Google Pub/Sub associates an acknowledgement deadline with a subscription. If a pulled message is not acked within the deadline, the message is made available for consumption again (using a configured exponential backoff policy), up to a certain number of retries. Google Pub/Sub also enables configuring a dead letter queue, to which messages that weren’t acknowledged after several attempts are forwarded.

These features helped us build a robust system that is resilient to temporary failures. And the monitoring tools help us get alerted quickly whenever something goes unrecoverably wrong (like messages ending in the dead letter queue).

Overall, the system we built solved the issues we had with the async jobs traffic. When the grid is overloaded, the requests are not dropped. Rather, they are stored in Google Pub/Sub until subscriber pods come up and are available to handle the work. We refrain from too much load on Kubernetes API server: we use a smaller deployment size that bounds the number of pods being scheduled by Kubernetes at any given time.

Additionally, the new grid is deployed on a different node-pool, causing no more noisy neighbor issues that affect other types of traffic.

Conclusion and future improvements

We discovered our initial architecture was not the best fit for all traffic patterns we have. We were able to identify the type of problematic traffic and separate it to a different system that was purposely designed for it.

We built a grid that is inherently different from the original synchronous grid. One of the key differences is that an async grid pod handles a single request (job) whereas a sync grid pod handles many requests for the same site.

The reasoning behind this stems from the nature of the Wix sites’ traffic: many sites are active only for short time intervals. While a site is active, it requires a running pod when only the first request allocates a new pod from the pool and subsequent requests use the same pod. Jobs, on the other hand, execute even when the site has no active traffic. Therefore each job invocation requires a pod from the pool with no pod reuse, making the grid distributed and stateless.

As a next phase we would like to grow and shrink our Kubernetes cluster size ahead of time to prepare for the coming load. Since we know in advance the jobs schedule time, we can predict the expected load.

A possible approach is to use application-specific metrics, reporting the amount of jobs in the coming next minutes, and a Horizontal Pod Autoscaler (HPA) based on the reported metric.

The HPA will grow/shrink our worker pool (deployment) and will adjust its capacity. In turn, it may indirectly trigger the Cluster Autoscaler to horizontally grow the cluster to accommodate the expanding pool by adding new cluster nodes before the actual load hits.

This post was written by Vered Rosen

For more engineering updates and insights:

Follow us on: Twitter | Facebook | LinkedIn
Join our Telegram channel
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple, Spotify or Google