Photo by Shiro hatori on Unsplash
When you need to store millions of databases with multiple collections efficiently, what do you do? This is exactly the question we asked ourselves a year and a half after having launched Velo (then called Wix Code). Velo is a serverless code platform for site builders, which lets developers add frontend and backend code to their Wix sites.
Part of Velo is a product called Wix Data. It allows anyone with a Wix website to own a document database. You can use it in multiple ways - via our visual tools to bind content in your collections to forms, tables, and other components on your site, or just access it via API from code running within your site. Think of it as “Database as a Service”, not unlike Google’s Firestore.
Behind the scenes, user data is stored in MongoDB. Which works really well, but as we are soon to find out - everything has its limits.
Back at the time of Wix Code, every user collection was stored in a dedicated MongoDB collection in one of our server clusters. Each site had (and still has) mapping information, which would tell the cluster and the database where the site data is located. Collections themselves were named according to the following pattern:
{site}@{dev/public}@{collection name}.
Each user effectively had (and still has, in fact) two databases - DEV, used for development, and PUBLIC - used in the live site. Using this pattern allowed us to query all user collections simply by issuing a listCollections with a simple regex pattern.
But then our product met success and we started growing. The first problems began at around 100k collections.
Even for some low-level operations we would basically list all the physical collections, which involved us querying all of the user collections via a regex query. Naturally, as the number of collections in a single database grew, listing them all started to become slower and slower. At the time all data was stored in a single cluster, all collections were split into a total of 10 databases. The fix here was quite easy - we need to split all the collections into more databases. And that is what we did.
But at the same time another problem had been manifesting - our backups instance started having problems. It was getting really really slow at creating backups and one time even crashed. See, in MongoDB every collection gets its own file handle. And so as the system tried to make backups, it would open a number of handlers which was equal to the number of collections we had.
This meant we had to:
Increase process file handle limit, just to stop the OS from killing an instance because the limit would be exhausted;
Tune the file handle garbage collection, as Mongo was still having troubles and slowing down as soon as the number of open file handles grew into 10s of k.
Soon after the backups issue was resolved, we saw this problem manifest in other places - like spinning up a new replica, which didn’t always succeed on the first or even on the second try. Needless to say, neither the engineering team nor the DBAs enjoyed that rush of cortisol induced by just trying to add more capacity to the system.
The obvious easy solution was to scale out with more instances, which is exactly what we did at the time. But we were lucky. Our product was growing in popularity and that meant we had to keep adding new instances every time we reached 100k new collections. And a typical website had way more than one.
A year and a half after the launch, we had lots of MongoDB instances that were sitting mostly idle. Some sites are really big and receive lots of visitors, many others follow a different pattern. A new artist portfolio typically has a collection with around 30 works, which is also similar to a number of items a boat rental company has in its boat inventory collection. Given this and probably more importantly - the enthusiasm of our DBAs having to keep up with our success and having to spin up more MongoDB clusters (did I mention that Wix Data is global and each cluster has quite a few instances all over the world?) - we had to do something.
We gathered in the DBA room in one of our Tel Aviv offices and rather quickly a simple and elegant solution sprang into existence. We came home, did the design, ran some testing and the results were more than encouraging.
Introducing “Multi-tenant Storage”
In the next few months we implemented what we called “multi-tenant storage”.
While MongoDB wasn’t great at storing a million of collections (MongoDB actually made some progress in this area since then), it’s great at storing a few with millions of documents. And that was the basis for our solution. Storing data from multiple sites in a single collection. Thus document primary key has become:
{ site <- site identification,
database <- sandbox or live,
collectionName <- the name of the user collection,
_id <- the id of the item
}
That meant our queries by _id were still fast, we could store multiple tenants in a single collection without overloading MongoDB and could also easily add indexes for everyone on the common part of the schema. For example, we added an index on our default sort key (item creation date). On top of that we implemented a tenant allocation system which allocates new tenants in the least used collections.
By keeping collections fairly small (<50GB) we are also able to avoid increased read and write latency.
We also kept the ability to relocate tenants to dedicated storage to accommodate larger sites that could benefit from such isolation. For example, we can apply custom indexes or even move a tenant to a dedicated cluster.
This will sound quite obvious - but it’s important to know your tools, their strengths and weaknesses. By adjusting our storage to make use of the strengths of MongoDB we are now able to host 1 million sites and many more collections per MongoDB cluster running a few replicas around the globe (on quite a few r4.2xlarge AWS instances).
This post was written by Giedrius Graževičius
For more engineering updates and insights:
Visit us on GitHub
Subscribe to our YouTube channel