At Doctrine, we started with a single PostgreSQL database 3 years ago. With more data, we faced the usual challenges of scale and high availability with a relational database. This article shows how you can scale your favorite relational database to several terabytes of data.
A single database
Following a “release early” logic, we first started with a simple architecture: a single production PostgreSQL database (on AWS RDS), used both for the application (our Node.js backend) in real-time and for all non-real-time usage such as data loading and data analysis.
Here I want to talk about issues related to the production environment only, but everything is a mirror of this in development, so we also had a development database following the same logic.
We used this simple architecture when there was less than 10 engineers in the company. It worked because we had little data so any manipulation would use few resources.
However, some problems started to appear. It often happened that data scientists, who wanted to load new data or perform analyses, slowed down significantly the production database, making the product almost unusable. When this happened, we had an emergency procedure which basically consisted in listing all running SQL queries and killing as fast as possible those which we thought were the cause of the heavy resource usage.
As a first workaround, we fine-tuned the loading script to add things like “wait 1 second” in order to slow down data loading/analysis and leave free resources for the application.
However, this “solution” was just a quick fix and could not scale as more engineers joined the company, as we loaded more data and performed more intensive analyses.
Analysis of the problem
To analyze the problem, we thought about the different use cases of our database and identified 2 main use case categories:
These are all requests made by the application backend in real-time when users use the product. For this use case, the important properties are:
- Low latency (we want the autocomplete to run in less than 100 ms for instance)
- High availability (we want more than 99.99% of user requests to succeed)
Non-real-time usage includes different use cases:
- Scripts written by data scientists which load and analyze the new legal data every day
- New scripts written by data scientists which need to run once on a complete data set (heavy one-time computation)
- Experiments of new algorithms on data by data scientists
- Analysis of product usage by product managers
- Non-real time computation of custom recommendations for users, based on their product usage (for instance to send them the newsletter)
- Sales operations (to know which customers to call for instance)
The needs here are different:
- Heavy I/O usage (around 10 x the application)
- High CPU usage (around 15 x the application)
What resources are we fighting for? There are 3 main scarce resources:
- I/O to disk
For each of these resources, we had experienced a situation in which a non-real-time query (data loading script for instance) had depleted the resource, which had made the application too slow compared to the quality of service we want to provide for the user, or even completely unusable (for example in the case of an “out of memory” error).
Why not use priorities?
After analyzing the problem, we thought that we could introduce priorities. If all real-time queries were given priority access to CPU and I/O, it could solve the problem without adding much complexity to the infrastructure. However, I/O priorities are simply not available in Postgres, and CPU priorities are available through an extension that cannot be installed in the PostgreSQL provided by AWS RDS.
Introducing a read-only replica
Our first attempt to implement a solution was to introduce a read-only replica for our database (using PostgreSQL physical replication proposed by AWS RDS).
The idea was that the replica machine would be used only for the application, which means it would not be affected by heavy loads by data scientists for instance.
This approach only partially solved the problem, because the application was still dependant on the master for a lot of queries:
- Writes have to be performed on the master, which involves a significant number of cases in the application (update the users’ search history, session, etc.).
- Some reads also need to be performed on the master because of replication delay: Replication is asynchronous, so when you INSERT some data in the Master DB and the DB tells you that the query has succeeded, you may not see the new data if you query the replica. This is annoying because it means you have to perform a lot of read queries on the master when you want the modification to be visible immediately. In practice this is a pain because developers need to think a lot about the DB infra to know in each case if they have to use the master or replica when performing read queries.
The split in 4 servers
For the reasons explained above, we were not happy with the previous solution. To solve the problem once and for all, I thought that we should actually use specific servers for the real-time usage. But since both the application and data scientists need to write data in some tables, and since we cannot have more than one master, the only solution was to introduce another database. So we basically divided the tables in 2 categories:
- Tables written by data scripts and read by the application (basically the legal content our users can read on Doctrine.fr)
- Tables written by the application (like sessions, user history, user account settings, events, performance measures, etc.)
The idea is then to move all tables in category 2 to a new database (which we call “Webonly”), to which only the application has access. Since data scientists also need to access this data, we created a replica of this database for them. So in the end, each “side” has its own database, and the “other side” has a replica of it:
To facilitate usage by data scientists, we added a last trick to the architecture shown in the above schema: We added foreign tables using PostgreSQL foreign data wrapper in the Main master, linked to tables in Webonly replica, so that data scientists can continue to query the same tables in Main transparently, and the requests are automatically forwarded to the Webonly replica.
To switch to this architecture, we moved all tables for which the application needs write access from the Main to the Webonly DB (see appendix for how we did that).
This time, this new architecture really did improve our reliability massively. It eliminated conflict when using the following resources:
- I/O reads
However, one problem remained: heavy writes on Main DB. When writing data to the Main master, all the bytes written to disk are transferred through the network and then written by the replica to its own disk. Therefore, a massive write to the master hard disk results in an identical massive write in the replica hard disk. In practice, when data scientists loaded a lot of data and saturated the I/O bandwidth of the master with writes, it resulted in the replica I/O bandwidth to be saturated too, which slowed down reads to “Main replica” too.
The switch to Aurora
To solve this remaining problem, we decided to switch to the AWS Aurora PostgreSQL-compatible edition. This is basically a version of PostgreSQL modified by AWS where they replaced the data access layer to use a distributed storage infrastructure they created. Instead of each server using its own disk, it uses a distributed file system that is shared between the master and replicas. This does the magic trick, because it means that writes to the master do not need to be replayed by the replicas. The replicas will simply use the updated data from the shared file system. In practice there still needs to be some propagation of information to guarantee transactional consistency and to “invalidate” the disk cache (the replica needs to know that the disk content has changed). More information about how this technology works can be found in the videos here and there.
This new architecture solved the heavy writes problem. It then looked like we had finally reached a perfect infrastructure and problem was solved. However, we discovered a new issue: When a table is being vacuumed on the master, there is a lock in the replica which prevents it from being read. The lock is “stronger” than on the master, in which it does not prevent reads. Vacuums are rare, but we want 99.99% availability, so we cannot tolerate even a few ~ 30 seconds lock every week.
The Multiquery system
We were able to solve this final problem with a system we created: Multiquery. The idea is very simple: When we want to read data from the Main DB, we send the query to both the master and replica, and we use the first response we receive. This is great because:
- If data scientists are heavily querying the master, the replica will still respond rapidly. The master will respond long after, but the user will get the fast response from the replica.
- If the replica is locked because of a vacuum, the master will give an answer even if the replica is still locked, and we will use the response from the master.
- If any of the two is down, we will be able to have no downtime even before the AWS systems detects the failure and reroutes the query to the working DB.
Here is a drawing showing this updated architecture:
We have also set up some monitoring to see which database “wins the race” between the master and replica:
This shows the proportion of queries for which each DB was faster, during the last 24 hours. The bottom part (blue) is when the master is faster and the top part (green) is when the replica is faster. As can be seen, the replica is often faster, which is expected since the replica is “reserved” for the application, while the master is used by everyone else. But it is not 100% green which means that our system results in faster response times than just always using the replica.
This Multiquery system was the last nail in the coffin of our problems for a long time and we did not make any significant DB infrastructure changes between March 2018 (deployment of Multiquery) and May 2019.
The write-only data optimization
However, with scaling our user base, we had a new problem. It started to be painful to query the data in the “Webonly replica” database. For instance, interactions with the product (called events) are recorded to make statistical analysis of user behavior. This is the kind of data that you typically sent to product analytics software (like Mixpanel or Amplitude). This is a lot of data and there are several pains associated to it:
1. When accessing the data indirectly, ie. by requesting Main DB which forwards queries to Webonly replica though Foreign Data Wrapper (FDW), it is very slow because:
- The data needs to be transferred through the network.
- The query planner is less smart when using FDW, and sometimes it does not use an available index (which an access without FDW would use), making the query several orders of magnitude slower.
2. An alternative is to read directly from the Webonly replica, however:
- It makes it impossible to store the results in another table (since it is a read-only DB).
- It makes it impossible to use joins with data tables in Main.
3. It is impossible to change the structure of the data because the tables are very big, so rewriting entire columns would cause heavy I/O on Webonly Master, recreating one of the problems we wanted to solve in the first place (I/O bandwidth saturation).
So we thought again, and we realized that we could refine our classification as 2 kinds of data by distinguishing 2 sub-cases for data accessed in write mode by the application:
- Read/Write data: data which is written by the application for its own needs, like for example sessions, user history, user folders, account settings. The data is read back, updated and sometimes deleted by the application itself.
- Write-only data: This data is written once and for all by the application, which does not care about it afterwards. This data can for instance be user behavior data and performance measures like loading speed. This is rather big compared to the first category (like at least 10x bigger).
Our problems are with the second type of data. So we decided to move it back to the Main DB. But to avoid going back to our initial problems of resource conflict, we decided to perform all writes through a queue system and execute the queries from an independent daemon:
We did not go back to our initial problems because, if we heavily use the Main master and slow it down massively, it will just make the daemon process requests more slowly, but it won’t slow down the app which will continue to push the queries to the queue at the same speed.
Because the data is never read back by the application, we do not care that it is written asynchronously. This is the key point which makes such a solution possible.
So to summarize, here is what we do with each kind of data:
- Data only read by the application (and created by data scripts) => Stored in Main, read from the application from both master and replica with Multiquery.
- Read/Write working data of the application => Stored in Webonly DB.
- Data written but not read any more by the application => Sent to a Redis queue, then written asynchronously to the Main master DB.
We started with a simple and typical architecture and we gradually scaled it to integrate more data. It was an interesting discovery for us that we don’t necessarily need to change of database engine as we scale. By splitting the data on specific servers and developing the right tools for our needs (Multiquery, background query runner deamon), we were able to build an architecture fitted to our current scale.
- Breaking PostgreSQL at Scale (FOSDEM 2019 video)
- Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
- AWS re:Invent 2018: Deep Dive on Amazon Aurora with PostgreSQL Compatibility
How we migrated tables from one DB to another without downtime
To move a table from one database to another without any downtime, even though the application needs read/write access to it in real-time, we used the following procedure:
- Create a table with the same structure in the new database
- Deploy a new version of the application which continues to read the table from the old database, but writes in both the old and the new database (ie. we change UPDATE/INSERT/DELETE but not SELECTs).
- Dump the content of the old table.
- Load the dumped content in the new table (along the content added since step 2 by the application).
- If the app UPDATEs (or DELETEs) in the table, it is a bit trickier because some data inserted before step 2 might have been updated after 3 in the old table (but not in the new one, since there was no row to update). In that case, we need to compare them and propagate differences (we used foreign data wrappers to be able to JOIN the tables and compare them).
- Deploy a version of the application which now reads from the new table (but continues to write to both).
- Deploy a version of the application which read and writes only from the new table.
* This step could be combined with the previous one if we deploy the new version of the app “all at once”. But in fact we use rolling deployments, ie. we deploy our app to some production servers by batches, so at some point both version n and n+1 can be served to users, and we don’t want inconsistencies caused by a new version of the app which does not write any more in the old table, and an old version which still reads from it.
- Drop the old table.