As applications grow and evolve to be more and more complex, the schema also evolves to support new business logic. Updating a MYSQL schema is hard. It can be a lengthy process that requires monitoring, and if you are not sure of what you are doing, it can possibly cause downtime.
Photo by Hoover Tung on Unsplash
Why is updating the MYSQL schema so hard?
When updating the table structure, e.g., adding or removing columns, MYSQL supports three online Data Definition Language (DDL) algorithms to help with that. Some table changes don’t support all three algorithms, and some changes run faster and have less impact in a production environment.
MYSQL online DDL algorithms:
COPY - Creates a new shadow table with the desired alterations, migrates the data completely to the new shadow table, and then swaps the tables before dropping the old one.
INPLACE - Tries to avoid rebuilding the table while allowing concurrent Data Manipulation Languages (DMLs). It logs DMLs to a temporary log, which is then applied after the ALTER operation finishes. It’s important to note that if the MYSQL has replicas, there may be a replication lag because alter commands and DMLs need to run on them as well.
INSTANT - Does not rebuild the table. It only updates the table’s metadata without altering the table’s data. Concurrent DMLs are allowed. This algorithm is available from MYSQL 8 and is the new default.
When updating the schema, we would obviously prefer to run every command with the INSTANT algorithm, as it is the least impactful among the three. Each algorithm supports a range of changes. For example, the INSTANT algorithm, starting from version 8.0.29, allows adding a column at any position within a table, not just at the end. However, altering the data type of a column requires the COPY algorithm.
Let’s say we need to update the schema of a production table, which has a relatively small size of a couple of MB. Adding a new column will have no impact on the application at all. On the other hand, changing the column to a different data type can block DMLs to the table for 1-2 seconds, causing a small lag on its replicas.
What if we were to run the same changes on a considerably larger table, with table size in the tens of gigabytes, hundreds of gigabytes, or even terabytes? How long do you think MYSQL will block DMLs? What will be the replication lag? How would production cope with these issues??
We have conducted extensive research on this topic at WIX. Our primary goal was to determine the threshold for these algorithms, and identify the point at which we start seeing degradation in our applications.
We benchmarked the COPY & INPLACE algorithm in a 5 node MYSQL cluster environment, where each node had 4CPU core and 32GB ram. We ran a dummy application that worked with the cluster at 1000 RPM. We can see from the results in the table below that as the table size grows, the blocking DMLs and replication lag also increase.
How companies handle schema migrations
Each company has its own way of handling schema migrations. Some companies create tables with numerous columns of various data types to handle the majority of schema change requests efficiently. Other companies have downtime windows where they can run schema changes on the cluster without disrupting production.
Companies like Wix that don’t have downtime windows, use open source tools such as gh-ost and pt-online-schema-change to run migrations without interrupting production.
Wix has an amazing Database Administrator (DBA) team that handles all schema change requests. Each request is analyzed to ensure an optimal structure without causing degradation in production. As a result some requests have a lot of back-and-forth and can take a couple of days until the final structure is agreed upon. On average the DBA team handles around 25 change requests per week, occasionally resulting in a substantial backlog. This directly affects the velocity of developers’ in releasing new features.
Automations!
We realized that we needed to change our approach. As our services expand, we anticipate a surge in schema change requests. The only solution to supporting large-scale changes is through automating schema migration.Thus, we developed an automated workflow. This workflow receives the desired schema, identifies the changes between the production schema and the desired schema, and then applies these changes to the cluster.
There are two ways to initiate the automated workflow. One is through a dedicated UI, where developers can view the current table structure and request a change. The other method is fully automated and triggered from code changes. This event occurs when merging their code into master, prompting the extrapolation of the schema and the beginning of the automatic migration process.
Once the workflow is initiated, the schema modification process involves two essential steps: the Diff Job and the Validation Job.
The Diff Job uses an open-source CLI tool, Skeema, to generate the ALTER command by comparing the differences between two schemas, and identifying the specific alterations needed to transition from the current production schema to the desired schema. The Diff Job analyzes these alterations, such as creating new tables, adding indexes, or removing columns, and then determines whether the Skeema-generated command can safely run directly on the MYSQL cluster or if a migration is required.
The Validation Job thoroughly reviews the proposed schema modifications, ensuring alignment with the established rules set by Wix’s DBA team. These rules serve as best practice validation for developers to help identify problematic aspects within the schema and reduce back and forth discussions.
Changes that are considered safe and meet the predefined standards, are automatically implemented on the MySQL database. However, any detected violations or changes requiring further review are escalated to the DBA team to handle the request manually. Misconfigurations are categorized as Fatal or non-Fatal validation errors. While Fatal errors force developers to revise the schema before resubmission, non-Fatal errors are documented for future evaluation.
As of today, this process automatically handles the majority of schema. We’ve successfully increased developers’ velocity and lightened the repetitive manual workload for the DBA team.Our future plans involve integrating this process with a third-party schema migration tool, such as gh-ost or pt-online-schema-change, aiming to handle all schema changes in Wix.
Summary
Updating a MYSQL schema presents challenges, especially considering its potential impact on production environments. With a growing number of services relying on database support, the DBA team has inevitably become a bottleneck, as all changes must pass through them. Recognizing the imperative for scalability, in-depth investigations have identified the thresholds at which MYSQL's online DDL algorithms: COPY, INPLACE, and INSTANT demonstrate performance degradation in application environments.
To address these challenges, we've implemented an automated schema migration workflow that utilizes the Skeema open-source tool. This tool aids in generating ALTER commands and validation rules, ensuring seamless schema alignment.
Our research has empowered us with a substantial number of migrations that can be executed, without the need for additional tools. This process, accessible through a dedicated UI or triggered by code merges, has significantly boosted development speed and curtailed manual efforts for the DBA team.
Future efforts involve implementing additional tools to support schema migration, particularly for changes with potential production impact, thereby automating the entire schema lifecycle at Wix.
This post was written by Lior Altarescu
More of Wix Engineering's updates and insights:
Join our Telegram channel
Visit us on GitHub
Subscribe to our YouTube channel