Optimising relational databases with zero downtime

These are the metrics that will help you estimate the optimisation success.

You hardly need optimisation that will slightly reduce CPU usage on the database server, given that it will multiply the app response time tenfold.

Once we started collecting additional application-level metrics for UDB, we managed to better understand which tasks make up 80% of the workload and therefore need to be examined first of all, and which of them are rarely used or not used at all.

An in-depth study of the most frequently performed task (retrieving users in accordance with the specific criteria) showed that even though the system retrieves all available user data, in 95% of cases it only needs user_id.

Extracting this case into a separate API method, that retrieves only one column from the table, helped us benefit from using the covering index, and reduce the CPU load by 5%.

Analysing another frequent task we found that even though it was executed with every HTTP request, the retrieved data was very rarely used in reality.

We switched this request to the lazy loading mode.

The main purpose of the metrics in case of optimisation is to better understand your database and to find its dead weight.

It doesn’t make sense to spend a lot of time and effort optimising requests that make up less than 1% of your load profile.

If you don’t have the metrics that let you understand your load profile, you need to collect them.

Such optimisations on the code side helped us eliminate about 15% of CPU usage out of 80% consumed by the database.

Testing the ideasIf you are planning to optimise an overloaded database by changing its structure, you should start by checking your ideas in the test bed: even those of them that look very promising in theory can fail to bring positive results in reality (and sometimes can even lead to a negative result).

I don’t think it’s something that you want to find out after the time-consuming data migration on production.

The closer your test bed configuration is to the prod configuration, the more reliable the result will be.

Another important thing is to provide the right workload for the test bed.

Processing random or identical requests may lead to an unreliable result.

The best option is to use real requests from production.

For UDB, we were logging every 10th API read request (including the properties) as a JSON-log in a file.

After 24 hours we had collected 65 Gb log containing 700m requests.

We didn’t add write requests to the test, as compared to the read requests, they are insignificant and do not influence our workload much.

However, it may be different in your case.

If you want to load your test bed with the write requests, you will have to collect every single request, as skipping some of them may lead to consistency errors in the test bed.

The next step is to reproduce the log in the test bed correctly.

We used 400 workers on PHP, run in our script cloud.

They were reading logs from the fast queue, and consistently handling requests, whilst the queue was filled with the new script at a fully controlled pace.

We were using the х10 speed for testing: that, multiplied by the number of collected requests (every 10th) resulted in the same number of RPS as on production.

Such rates allow you to run 24 hours of prod load within 2.

5 hours in the test bed, with all the loading drops.

This is how our first test in the request log looked after 12 hours, being launched at the x5 speed (50% of the prod load):You can use the same tools for the stress test, increasing the speed (and RPS, as a result) until the database in the test bed starts to lapse.

This will give you a clear understanding of the workload amount your database can manage.

Another important thing to do after testing the new scheme is to run a benchmark test in the original database structure.

If its results differ from the current metrics on production too much, I recommend finding out the reasons in the first place.

Probably your test server was misconfigured, and the testing results are unreliable.

You should also make sure that the new code is working fine.

It doesn’t make much sense to test the performance of the requests that don’t work correctly.

For this purpose you can use integration tests, checking if the new and old API return the same values for the same API requests.

After getting all the results you just need to choose the variants with the best price-value balance and launch the new scheme on production.

Changing the schemeFirst of all, I want to point out that changing the database scheme without DB server downtime is always complicated, expensive and risky.

Thus if you can stop the load while changing the structure — just do it.

Unfortunately, we could not allow this with our UDB case.

The second aspect defining the complexity of changes is the scale.

If all changes required for the tables don’t go beyond mere alterations (for example, adding a couple of new indices or columns), they can be made via standard processes, such as pt-online-schema-change and gh-ost, or by slave alter with the following switching.

We managed to achieve great results by means of vertical sharding, splitting a huge table into ten smaller ones with different columns and indices, as well as the different data format.

Such a transformation cannot be made with typical tools.

What can be done in this case?We have applied the following algorithm:Make both an old and a new scheme with the valid data available at the same time.

Make sure that both schemes write the data, and that the data is consistent in both versions.

We will discuss this point in detail later.

Gradually switch all reading to the new scheme, controlling the load.

Turn off writing to the old scheme and delete it.

Main advantages of this approach:Safety: you can instantly roll it back, up to the very final stage (just switch reading back to the old scheme if something goes wrong);Total load control during data migration;No need for the heavy alteration of a huge table in the old scheme.

But there are some disadvantages too:You have to store both schemes on the disk in the course of migration (this can be a problem if your capacity is small, and the migrated table is very big);A lot of temporary code to support migration, that will be removed afterwards;Reading two schemes simultaneously may lead to internal DB cache overflow; we had some worries that the old and the new versions will start competing for the RAM, which could lead to service degradation (in reality it did create some additional workload, but since migration was only carried out during off-peak hours, it didn’t cause any problems for us).

As the main challenge of this algorithm is point one, let’s discuss it in more detail.

SyncMigrating static data is not much of a problem.

However, what do you do if you can’t just pause all writing while migrating the database?There are several ways to sync the new scheme: migration with playing the log and migration with the idempotent write.

Migrating data snapshot and playing the log of the following changesAny transaction that updates the database needs to be logged in a specific table through the triggers or at the app level, also, you can use replication binlog for this purpose.

Once you have such a log, you can open the transaction and migrate data snapshot, remembering its position in the log.

The only thing left is to start applying the collected log to the new scheme.

This is how the popular MySQL backup tool, Percona XtraBackup, works.

Once the new scheme loads the log up to the current entry, you get to the crucial point: you need to pause writing to the old scheme for a little while, and once you make sure that all available log is applied to the new scheme and all data between the schemes is consistent, you can start write to both sources at the app level.

Here are the main drawbacks of this approach: you will need to store the operations log which may create an additional load, the switching process itself is complicated and also there is a risk of breaking the writing if the schemes prove to be inconsistent for some reason.

Idempotent writingThe main idea of this approach is to start writing to both schemes simultaneously before all changes are fully synchronized and to finish migration of the remaining data later on.

This method is normally used for filling new columns in the large tables.

Synchronous writing can be done either in database triggers or in the source code.

I recommend doing it in the source code, as you will have to make a code that will be writing data to the new scheme anyway, and migration on the code side will give you more control.

An important thing to remember is that unless the migration is over, the new scheme will be inconsistent.

This may lead to a situation where the update of a new table leads to the breach in the database constraint (foreign keys or unique index), whereas from the current scheme perspective, the transaction is absolutely correct and must be completed.

Such situations may lead to the rollback of some good transactions due to migration.

The easiest way to avoid it is to either add an IGNORE modificator to all writing requests in the new scheme or to catch the rollback of such transactions and to launch the version without writing to the new scheme.

Synchronization by means of idempotent write is done in the following way:Start writing to the new scheme in parallel with the old scheme in the compatibility mode (IGNORE).

Run the script that gradually examines the new scheme and fixes the inconsistencies.

After that, the data in both tables must be synchronized.

However this may be disrupted by potential conflicts in paragraph 1.

Run the data consistency checker — open the transaction and read the entries in the new and the old scheme sequentially, comparing their compliance.

If there are any conflicts left, complete synchronization and get back to paragraph 3.

Once the checker has shown that the data is synchronized in both schemes, there should be no further divergence, unless you missed something.

Wait for some time (a week, for example) and run the control check.

If it shows that everything is ok, the task has been completed and you can switch reading.

ResultsAs a result of changes in the data format, we have managed to downscale the main table from 544 Gb to 226 Gb, which reduced the load and increased the amount of useful data in the core memory.

Totalled up, with all the mentioned approaches in place, we have managed to reduce CPU usage of the database server from 80% to 35% at peak time, from the beginning of the project.

Judging by the results of a later load test, with the current growth rate, we can keep using our hardware for at least three more years.

Splitting a table into smaller ones has simplified the process of any further database alteration and significantly sped up the scripts that collect data for the BI team.


. More details

Leave a Reply