From Chaos to Control: How We Rebuilt Reliability in a World of Microservices (Wix Engineering’s Battle-Tested Playbook)

Microservices Reliability Playbook, 2025: You can find the link to download the full PDF at the bottom of this post

The Microservices Trade-Off

“When you have a monolith, you’ve got one big problem to solve. Switch to microservices, and now you’ve got 99 smaller problems - plus a distributed system.” - Yoav Abrahami

At Wix, we’ve spent over a decade building and scaling one of the most complex and high-traffic platforms on the web. As our user base and product offering expanded, the pressure on our systems grew as well - forcing us to rethink our architecture.

Our monolithic architecture, which served us well in the early days, began showing its limits. Development slowed down. Teams became blocked by one another. A change in one area could have unintended consequences in another. To keep up with our scale and velocity, we embraced microservices.

Empowering Teams Through Architecture

This transition wasn’t just a technical decision - it was an organizational one. Microservices enabled us to break apart our platform into modular, independently deployable units. Each team could now take full ownership of a specific domain: develop it, deploy it, and monitor it, without needing to coordinate with the entire engineering org.

This autonomy gave us speed, flexibility, and the ability to innovate at scale. But it also introduced a new dimension of complexity: how do you ensure system reliability when everything is distributed and constantly changing?

When Complexity Becomes Risk

As more services came online, we began to experience a new class of challenges. These weren’t massive, visible failures - but subtle, cascading ones.

A small bug in one service could silently break a critical user flow elsewhere. A slight increase in latency might ripple through dozens of systems before anyone noticed. Deployments became harder to manage, incident resolution more difficult, and observability gaps grew.

What we realized is this: microservices solve the problem of organizational scale - but introduce operational risk as the new bottleneck.

Building Our Microservices Reliability Framework

To regain control, we focused on the live Wix sites. Wix sites include the serving of the site HTML, JS and CSS, as well as the media files used, and key services such as ecomm checkout and user login.

Another big area of Wix is the system used to build websites, including the Wix Editor(s), the Business Manager to manage users, products, payments and domains, and a lot of other concerns important for running a company, but not directly related to serving a live site.

We decided to implement the Split by SLO Pattern, defining the Wix Public Segment and the Wix Editor Segment (in 2008). The Wix Public Segment has higher SLO requirements, ending up with key services running on two different clouds, on two different stacks.

Learning from this first implementation of reliability architectural pattern, we continued to implement different reliability patterns across different systems, complemented with a framework that allows us to measure, predict, and improve the reliability of every component in our architecture.

This framework became the foundation for how we design, monitor, and evolve our services today. And after years of refining it through hands-on experience, we’re making it available to the broader engineering community, introducing: Microservices Reliability Playbook.

What’s Inside the Playbook

A breakdown of the four key software risks: malfunctions, security, change, and load/latency
A formula to predict system reliability using network hops and number of artifacts
An explanation of how to define and measure SLOs: availability, latency, error rate, and correctness
Alerting strategies that minimize noise while catching real reliability issues
A catalog of architectural patterns including Reader-Writer, CQRS, Fallbacks, Circuit Breakers, Multi-Writers, and more
Real-world examples, like the commerce cart use case, that illustrate how reliability breaks down - and how to build it back up

This guide, written by Yoav Abrahami, is for engineers, architects, and team leads who want to build systems that don’t just scale - but that keep working under pressure. If you're building microservices today or planning to make the move, this is the kind of deep, practical thinking that will help you avoid costly mistakes and improve your systems' reliability over time.

We’ve seen firsthand that the shift to microservices brings real benefits - but only when backed by solid architecture and thoughtful design. This playbook shares the mindset, tools, and patterns that have helped us scale Wix’s infrastructure while staying resilient.

Download the full playbook here:

Microservices Reliability Playbook - Full PDF

This guide was written by Yoav Abrahami

You can follow him on X

More of Wix Engineering's updates and insights:

Follow us on: Twitter | Facebook | LinkedIn
Join our Telegram channel
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple or Spotify