Comment: This article is the 3rd part in a series of articles. You can find the previous posts in this series here:
6 Challenges We Faced While Building a Super CI Pipeline - Part I by Shay Sofer
Auto Scaling CI Agents At Wix - Part II by Etamar Joseph Weinberg
In previous installments of this series, we talked about how our build server struggled to keep up with the growth of our company. When our CI system was under load, builds would not start, negatively impacting the productivity of our developers.
We described why we decided to migrate to Buildkite and discussed how we built a scalable CI system on top of Buildkite that improved our system performance.
How we reduced the time builds spend in queue, when the system is under load, from ~40 minutes to just seconds.
We also explored the bits and bytes of the autoscaler.
In the final chapter of our series, it’s time to talk about migration. How did we safely and seamlessly migrate ~10K builds/day to the new system with no downtime?
Photo by Javier Allegue Barros on Unsplash
Our work is now complete. We have built a very promising and scalable system. So let's roll it out to everyone and see how it goes, right? Wrong.
Wix treats its DevEx (Developer Experience) tools as products that require the same level of care, attention, and resilience as our user-facing products. We do not want to “YOLO rollout” of a new CI system.
The importance of planning your rollout phases and knowing the goals for every phase cannot be overstated.
Phase 0: Plan your migration
Our migration was successful due to two key decisions:
Adding and removing repositories from the new system should be a one-click process.
We need the ability to run repositories in parallel. Run builds on both the existing and new systems simultaneously.
The first decision is self-explanatory. You opt-in a repository and something goes wrong. You need time to investigate. So you opt-out the repository and investigate without blocking anyone.
The second decision allows us to run our new CI system in parallel with our legacy system.
Think about that for a moment.
We will have the ability to opt-in everyone in parallel mode! As a result, we can see how our new system works on a real "Wix scale", without affecting anyone.
The system receives traffic, builds are triggered, agents are scaling in and out, and water is flowing down the pipes.
What can be done to achieve both goals? We created a mechanism so that every repository could be in one of three states. Running on the legacy system, on the new system, or in parallel mode.
Let’s do a quick breakdown of what it really means.
Given a repository is in:
Legacy mode => A single build will start on the existing CI system.
New System mode => A single build will start on the new system.
Parallel mode => Two builds will start.
A build on the existing CI system.
A build on the new system, with an additional flag that specifies it as a dry-run build.
What’s a `dry-run` build?
Because we must make sure we ignore the side effects of the build that we triggered on the new system.
For example, we should not report Github status checks or release an RC based on those builds.
Marking a build as dry-run can be easily done by adding additional metadata or an environment variable to the build, and reading that value when we process the build result. If that value is true, we do not perform any side effects.
So the configuration of our repositories looks like the following JSON:
It can be stored in Github, so we can’t “lock ourselves out” and we’re always able to to rollback to the previous configuration.
The last thing you want to happen is to opt-in a repository to the new system, see that things do not work as expected, and then not have the ability to rollback because the new system broke the configuration as well.
Phase 1: Dogfooding
We are ready to flip the switch for our first repository! Can we just pick a random repository and add it to the new system? Which metrics would we like to collect?
The backend git repositories of Wix are generally divided based on their business domains.
We can opt-in the CRM team, the eCommerce team, or any other team - but dogfooding would probably be a better idea. Wix uses dogfooding a lot, and the current migration is no different.
We first migrate the repositories of our own teams (DevEx) and then start collecting data.
Let's consider what metrics we want to measure:
How many build agents are running in total?
How many total/busy/idle build agents are there per queue?
How many jobs are waiting in the queue? For how long?
Total build time.
How many jobs failed with invalid exit codes?
In addition, we grouped most of those metrics by queue name. Why? We need to be able to see usage patterns, set up alerts, and observe performance. We might want to raise an alert if our queue time in the main queue is greater than X seconds (Where X is a very small number), since those builds are critical.
In contrast, we can be more tolerant when it comes to our background-queue and alert only when the queue is longer than a few minutes.
So after the first phase, only our internal repositories (let’s denote them as `bazel-macros` and `ci`) are running on the new system. All of the other repositories are unaffected.
Phase 2: Run everything in parallel!
Our teams’ repositories are using the new CI system we’ve built. It's time to increase traffic.
Do you remember how we created the ability for repositories to run builds in both build servers? Let’s do that! As we slowly add repositories, we monitor how our services react, how agents handle the increased load, and closely monitor the queue-time metric (actually, we closely monitor all metrics).
We created the dream of every engineer.
We run our real traffic on a secondary system without affecting anyone. This is massive! We have the luxury of testing the load, seeing how everything will actually work with minimal risk, without any side effects.
That’s real load, real traffic where we can test the system side-by-side without the users even knowing that additional builds are running on a new system, as that’s completely hidden from them. (No side effects, remember?)
You can’t always afford to do that. Due to various reasons, you may not be able to run both systems simultaneously. However, if you can, the extra work is definitely worth it!
We’re slowly moving the rest of the repositories to parallel mode, and the second phase of our migration is completed.
Phase 3: Our first users
There is no better feeling than seeing your hard-worked-for system in action.
Make sure you do the following before onboarding them:
Communicate in advance what's going to happen. This involves moving someone's cheese. Users will need time to adjust to a new system. Make sure they understand the benefits and gains the new system will bring them. It's more than "just a CI system". It's a tool the developers use every day, multiple times a day.
Meet with team leaders and engineering leads. Make sure they know you are expecting everything to run smoothly, and assure them that if there are any hiccups, rolling back is simple and easy.
Prepare an easy-to-follow onboarding document. Be as clear as possible about what is changing and why it is changing. Create a FAQ page that is updated regularly.
Have a dedicated Slack channel for feedback.
Be proactive in seeking feedback. Schedule a meeting and invite power users. People won't complain if there are no blockers, but there may be other issues that can be addressed.
Gradually move the repositories from parallel mode to run on the new system exclusively. Keep monitoring, keep measuring.
And eventually, everyone will be onboarded to the new system. Since we made sure rolling back is a breeze, we can sleep quite well at night.
Wrapping up
This was the tale of Wix’s migration to a new build server.
In this post, we discussed our migration plan.
We began by migrating our own teams' internal repositories (dogfooding).
Afterward, we gradually migrated Wix to the new system, and all of the builds were running in parallel, side by side with the old system. We designed a mechanism so that phase will not create side effects, allowing us to test and monitor things without affecting production.
Finally, repositories began running exclusively on the new system, but not before properly communicating the changes, ensuring the ability to rollback quickly so no one would be blocked, and actively gathering feedback.
You can find the previous posts in this series here:
6 Challenges We Faced While Building a Super CI Pipeline - Part I by Shay Sofer
Auto Scaling CI Agents At Wix - Part II by Etamar Joseph Weinberg
This post was written by Shay Sofer
You can follow him on Twitter
For more engineering updates and insights:
Join our Telegram channel
Visit us on GitHub
Subscribe to our YouTube channel