Question

We are using AWS RDS and we upgraded from MySQL 5.6.41 to MySQL 8.0.21 last December.

Since mid-January, we noticed some problems causing our replica to be stuck after running migrations adding foreign keys to new tables referencing our main table (in the sense that it is queried for almost all API calls).

Typically, we can have the following Django migration:

    migrations.AddField(
        model_name='artifactpresenter',
        name='event',
        field=models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='pittsburgh.Event'),
    ),

This would translate to some kind of ALTER TABLE ... ADD CONSTRAINT ... FOREIGN KEY ... REFERENCES ... DDL query (verified by checking the SQL generated by the Django ORM).

When running this migration on production, it would execute fine on our master that mostly processes write requests and a small portion of our read queries. However, our replica will suddenly hang on the referenced table (pittsburgh.Event in the above migration).

More specifically, here is what we can see from RDS Performance Insights:

enter image description here

As you can see, all the read requests to this pittsburgh.Event model would just hang on the MDL_context lock (a.k.a., the table metadata lock).

When this happens, our entire production pretty much goes down and our only solution at the moment is to restart the replica.

We first thought this was coming from multi-threaded replication that we inadvertently enabled. However, the issue persisted after we disabled multithreaded replication.

After digging more, I was actually able to reproduce the issue with 3 sessions:

  • In one session, I start a transaction and read something from the pittsburgh.Event table, but I do not commit or rollback the transaction yet: I just leave it opened
  • In another session, I run a CREATE TABLE statement adding a foreign key constraint referencing pittsburgh.Event (autocommit enabled): this statement will just hang, waiting to acquire the metadata lock on the pittsburgh.Event, but it is already acquired by the transaction in the first session
  • In a third session, I try to read something from the pittsburgh.Event table: this one will also hang because MySQL will apparently queue the query until the second session can acquire the lock on pittsburgh.Event (write prioritized over reads)

I tried to reproduce on MySQL 5.6, 5.7, and 8.0:

  • on MySQL 5.6 and 5.7, the second and third session never hangs, even if the first session did not commit yet
  • on MySQL 8.0, what I described above occurs

Clearly, this seems to be a change in how MySQL 8.0 implements metadata locks. When digging through the MySQL change log, I found this in the MySQL 8.0.3 release changelog

MySQL now extends metadata locks, as necessary, to tables that are related by a foreign key constraint. Extending metadata locks prevents conflicting DML and DDL operations from executing concurrently on related tables. This feature also enables updates to foreign key metadata when a parent table is modified. Previously, foreign key metadata, which is owned by the child table, could not be updated safely.

If a table is locked explicitly with LOCK TABLES, any tables related by a foreign key constraint are now opened and locked implicitly. For foreign key checks, a shared read-only lock (LOCK TABLES READ) is taken on related tables. For cascading updates, a shared-nothing write lock (LOCK TABLES WRITE) is taken on related tables that are involved in the operation.

If LOCK TABLES is active for a table in a foreign key relationship, ALTER TABLE ... RENAME is not permitted for that table. This is a temporary restriction, lifted in MySQL 8.0.4 by the patch for Bug #26647340.

I'm not 100% certain this is the cause of our problem, but the description seems to clearly match what I'm observing. If this is the case, it seems that this behavior is apparently expected for MySQL 8.0.

From there, I'm not sure what to do and how to handle this problem. We can't really downgrade to MySQL 5.7 (RDS only allows you to upgrade, and manually trying to re-build the database at an older version using some backup would incur way too much downtime and sounds quite risky):

  • Are there some configuration options we can use to disable this new behavior? We have been running on MySQL 5.6 for years without issues, and we run migrations infrequently: it does not appear as if this new mechanism is solving any of our problem (in fact, it is causing us a lot of troubles)
  • If we can't disable, what are some appropriate way to mitigate this? I saw some people playing with the lock timeout, but it does not seem to solve anything and I'm not even clear whether it would work on a replica
  • Some recommendations are to track down the long-running transaction that can cause this hang. But how to prevent this to be reintroduced later on with a mid-size engineering team where you can't oversee all code changes? Again, we used 5.6 for years without having any issue with this, so it's suddenly a lot of burdens to oversee any database query change or implement some monitoring to track this down
  • We are also planning to improve our database requests routing to detect some problems with the replica and stop routing requests there. That would prevent our application to go down completely for too long and will improve our overall reliability, but it's again just going around the problem without solving it
Was it helpful?

Solution

You have two alternatives to resolve this issue:

  • COMMIT your transactions promptly. Even if the transaction only does read-only SELECT queries, it holds a metadata lock until the transaction is committed. This blocks DDL queries including the "indirect" DDL when a CREATE TABLE makes reference to the Event table.

    Note that there are other similar cases, for example CREATE TRIGGER for a given table doesn't change that table, but still acquires a metadata lock. This case of metadata lock has been implemented in several past MySQL 5.x versions.

    Keep your transactions as short as possible.

  • DROP the foreign key constraints. The locking caused by foreign keys often hinders concurrent access to the tables. Not only for the metadata locking issue that you're dealing with today, but another example is when you UPDATE a child table, it places a shared row lock on the parent table's row referenced by the child rows you updated. This blocks statements that require exclusive locks, such as UPDATE/DELETE on those rows in the parent table.

    It's very common for sites that use MySQL to eventually give up using foreign key constraints, even though they are so useful for ensuring data integrity. They cause too many problems when you have to handle concurrent queries.

    This does mean that the database will have no data integrity enforcement. That's unfortunate, and I feel bad for giving this suggestion because I am in favor of data integrity, and there's just no reliable substitute for enforcing data integrity in the database.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top