The Great Rewrite - How Wix is Preparing to Rewrite 100s of Systems

Abstract

Whether you are breaking down a monolith or rewriting a legacy backend service, all companies need to handle the challenge of rewriting systems. At Wix, we found ourselves in the process of rewriting hundreds of services and needed to plan accordingly.

If you missed the first part of this article where we …. - you will be able to find it here "The Great Rewrite - How Wix is Preparing to Rewrite 100s of Systems - Part 1".

In our previous post, we presented the reasons why companies rewrite their services, described the challenges both in general and specifically within Wix. Later, we described what led us into writing guidelines, as well as how we decided which general approach to use with respect to data synchronization between the systems.

Photo by Hanson Lu on Unsplash

In this post, we will deep dive into each step of the guidelines we created:

Defining the scope
System remodeling
Code rewrite
Ensuring backwards compatibility with the old API
Rollout to new tenants
Data migration
Compare
Rollout to existing tenants

Defining the scope

We first need to define the scope of the migration.

Choosing which services to rewrite

Creating guidelines for defining the project scope proved to be the most challenging aspect for us as guideline writers. There is no magic formula; you need to know your domain.

A scope that is too big leads to a large and time consuming project, making it harder to generate business value. Every new feature would need to be added to both systems (sometimes managed by 2 different teams), which is challenging when the team working on V2 constantly needs to catch up with the additions to V1. As a result, this might clash with the principle of getting to production as fast as possible in order to face any issues arising from real traffic.

On the other hand, a scope too small can lead to the wrong design and will include more throwaway work due to integrations between the old system and the new one. Should you choose to split the rewrite into phases, we suggest doing the following:

Treat each phase as a standalone rewrite and go through all the stages explained here.
Choose a whole business flow or entities at the edges to reduce the amount of integrations between the old system and the new one.

The diagram below demonstrates the concept of rewriting an entire system. We decided to use a properly modeled microservices system as an example, however the same approach would work for a monolith or any other system.

The diagram depicts a simplified version of an eCommerce system. Rewriting the entire system would mean creating a V2 for every service. These services would then only need to integrate between themselves.

Alternatively, one might decide to rewrite only the Orders service, as shown in the following diagram:

Above we can see the pros and cons of each option.

If we only rewrite the Orders service, the project scope becomes smaller but it comes at a cost. The Cart service, which is an older service that we would rather not change, now has to integrate with OrdersV2. Assuming that the goal is to eventually rewrite the cart service as well, such an integration is considered throwaway work.

By rewriting the entire system we avoid additional work, but the scope of the project becomes much larger, extending the period of time until it can get production traffic.

Parity only vs adding new features

Once you decide which services to rewrite, an important consideration is whether you are aiming for feature parity or planning to include new features in the rewrite. You may even choose to initially exclude features from the new system; we will explain later how this can be done.

Generally speaking, rewriting a system is a good time to remodel and therefore a good opportunity to add new features that would otherwise require data migration.

Tension with product management

The above choices around scope can cause tension between product management and developers:

Developers prefer to take as much time as needed in order to design and implement the perfect system, while product managers often want to finish the rewrite as fast as possible so they can continue to deliver new features.
Developers often want to rewrite a larger part of the system to ensure a cleaner final product. However, product managers would rather do the rewrite in smaller segments.
Developers prefer feature parity, whereas product managers would rather have features get added to the new system “along the way”.

As always, the ideal outcome lies somewhere between evaluating the cost against the potential gains, and making informed decisions.

System remodeling

Proper modeling of the system in terms of entities and services is a key principle in microservice design. This is especially true at Wix, where we use the API first paradigm. As a result, at Wix, the end result of the modeling phase includes the complete API of the entities, with their exact properties and endpoints.

Code rewrite

Write the code for the new services as a “green field”. When first thinking about a rewrite, we often just consider the code rewrite phase. However, this phase is actually just a small part of a much bigger picture. As mentioned earlier, Wix has invested in server infrastructure over the last 4 years in order to simplify this process, making it possible to set up a new service within a few hours and focus only on business logic, rather than writing boilerplate code.

Thanks to this infra, we’ve seen effort estimations dropping significantly from weeks to days and even hours.

Ensuring backwards compatibility with the old API

Once you have the new system written, it becomes time to roll it out, which means to route production traffic to it. Before this can be done, the existing API must be considered. The assumption we made (that applies to the vast majority of services at Wix) is that the service usually has multiple clients, and transitioning these clients to work with the new API isn’t easy.

In many cases, these clients are external, making it that much harder to get them to update their code. We therefore decided that the correct approach is to create a dedicated proxy service that ensures backward compatibility. This proxy:

Reroutes all traffic from the V1.
Contains the routing logic determining whether to reroute each request to V1 or V2
Contains the logic for converting V1 requests calls into V2 and converting V2 responses into V1.
Accepts events from V2 and publishes them as V1 events.

Why create a separate service for the proxy?

Technically speaking, it’s possible to use the old system as the API proxy. However, we have several reasons to favor setting up a separate service:

Once the migration is done, you would want to shut down the old system and the proxy service would remain operational until all clients have moved to the new API.
The old system’s code is often outdated, not maintained or based on old technology. As a result, you would probably prefer to avoid making changes to it.
Although it may seem like the proxy doesn’t involve too much code, in reality it manages multiple aspects:
Routing logic, which determines whether specific requests are routed to the old or new system.
Conversion of API calls between V1 and V2.
Conversion of business events between V2 and V1.
Comparison between the systems, meaning that during rollout, the proxy can invoke a read on both systems upon a read call, and then return the result from the old system while comparing the results and report. This helps identify issues without affecting real users.
Since the proxy is built using new technology, it is much easier to provide infrastructure. At Wix, we’ve developed a ready-to-use proxy service infrastructure, where the only requirement is the implementation of the conversion between V1 and V2 .

Rollout to new tenants

We highly recommend, when possible, to open the new system to new tenants first. For Wix, this means new sites will use the new system, while the existing sites will continue to use the old system. The advantages of such approach include the following:

It allows getting to production faster by pushing the concern of data migration to the 2nd phase. This means getting real user traffic sooner, which would help surface issues at an earlier stage.
While the new system will most likely have some bugs, these bugs will not impact existing users. For Wix, this means that existing users, which could be businesses with millions of daily visitors, won’t be affected.
It provides the possibility to introduce new features exclusively for new users before the full data migration is completed.
By deciding to scope down and not support certain V1 features (at least initially), the time to production can be reduced even further.
It provides better team morale as the team gets to see their hard work in production sooner.

This is done by the API proxy routing all requests for new users to V2. Note that since rollback isn’t possible in this case, any bugs found at this stage need to be resolved quickly.

Data migration

Once the system is operational for a while to all new tenants, the next step is to migrate the existing tenants` data. This means ensuring that all data present in the old database also exists in the new one.

This migration process typically includes 2 stages.

Lazy migration - A process where, upon writing to the old system, the same data is simultaneously written to the new one. This is often done by a consumer that listens to V1 change events.
Eager migration - A script that copies all existing data.

It’s important to note that the above method assumes that your system can’t afford any downtime, otherwise an eager-only migration would occur. For Wix, this means we can’t afford to shut the systems down even for a few minutes, let alone days or weeks required to copy over huge amounts of data that we have.

The below diagram demonstrates these stages:

We decided to take a different approach and use CDC (we chose to build an infra on top of Debezium) in order to simplify the process above.

Instead of developers having to handle eager and lazy migrations separately (increasing the risk of corruption), they only have to write a single migration function that converts between the data structure of the old DB and the structure of the new DB.

The CDC connector first streams all the existing data (eager migration) into Kafka where the consumer picks it up and writes it to the new DB. After streaming all the existing data, the connector continues to stream ongoing changes (lazy migration) via the same function.

Compare

To boost confidence in the new system prior to rolling it out to existing users, we recommend comparing the read results between the old and the new systems.

During this “compare” phase, for every read call to an existing client, we do the following:

Make a call to V1.
Return the result.
Simultaneously, in the background, make a call to V2 and compare the results. As explained previously, we already have the logic needed to convert V1 requests to V2, as well as V2 responses to V1.
Report any discrepancies to a central dashboard, allowing developers to detect and fix any bugs.

Rollout to existing tenants

Once the data is fully copied to the new system, you can gradually start rolling out the new system to existing tenants gradually.

Conclusion

While migration is a complex process, at Wix we were able to come up with guidelines to make it easier. Following these guidelines has already allowed many services to successfully migrate to our new infrastructure. These guidelines include syncing data only in one direction, building a proxy service to ensure backward compatibility, and rolling out to new tenants first in order to gather real user feedback as soon as possible.

You can review part 1 of this post here "The Great Rewrite - How Wix is Preparing to Rewrite 100s of Systems - Part 1"