Question

I'm running PostgreSQL 11.5. I have a simple table of registers.

create table register(
  id serial primary key,
  name text not null
);

This table is small (~6000 rows), relatively heavy read (100s queries/s), almost no write. I wanted to add a self-referential foreign key:

alter table register
  add column leader_id integer references register(id);

Given the size of the table I assumed this would be a rather mundane change. It ran fine locally and on the staging server. However, when I ran it against production, the table locked up. Logs show ALTER and various SELECT statements taking ~10m to complete.

2020-08-04 00:01:15 UTC:10.0.2.101(59588):postgres@db:[21609]:LOG: could not receive data from client: Connection reset by peer
2020-08-04 00:02:39 UTC:10.0.2.101(59558):postgres@db:[1795]:LOG: could not receive data from client: Connection reset by peer
2020-08-04 00:02:39 UTC:10.0.2.101(59558):postgres@db:[1795]:LOG: unexpected EOF on client connection with an open transaction
2020-08-04 00:02:39 UTC:10.0.2.101(59578):postgres@db:[18376]:LOG: duration: 456741.453 ms execute <unnamed>: alter table register
add column leader_id integer references register(id);

2020-08-04 00:02:39 UTC:10.0.2.101(59578):postgres@db:[18376]:LOG: could not send data to client: Broken pipe
2020-08-04 00:02:39 UTC:10.0.2.101(59578):postgres@db:[18376]:FATAL: connection to client lost
2020-08-04 00:02:39 UTC:10.0.1.227(52906):db@db:[3365]:LOG: duration: 456635.839 ms statement: SELECT register.deleted_at IS NOT NULL AS deleted, register.client_id AS register_client_id
FROM register
WHERE register.id = 123 AND register.account_id = '22781BD1-F37A-4ACE-9A3D-CBF3464AFB43'::uuid
2020-08-04 00:02:39 UTC:10.0.1.227(52906):db@db:[3365]:LOG: could not send data to client: Connection timed out
2020-08-04 00:02:39 UTC:10.0.1.227(52906):db@db:[3365]:FATAL: connection to client lost
2020-08-04 00:02:39 UTC:10.0.1.227(52904):db@db:[3364]:LOG: duration: 456656.956 ms statement: SELECT register.deleted_at IS NOT NULL AS deleted, register.client_id AS register_client_id
FROM register
WHERE register.id = 234 AND register.account_id = 'A6D8395C-63E8-40A8-A0AE-4F19B1DA5509'::uuid
2020-08-04 00:02:39 UTC:10.0.1.227(52904):db@db:[3364]:LOG: could not send data to client: Connection timed out
2020-08-04 00:02:39 UTC:10.0.1.227(52904):db@db:[3364]:FATAL: connection to client lost

What's going on here that is causing the table to lock up? How can I safely work with a self-referential foreign key?

Était-ce utile?

La solution

The quick thing to try is set lock_timeout=1000;, then try your ALTER a bunch of times. Most likely you'll hit the timeout each time, but you might get lucky and that'll save you from having to hunt down the actual fix, which is, I'm sorry to say, not nearly as easy.

Almost certainly what is going on is that the other processes keep their transaction open after performing their SELECT on register. After that first SELECT is performed, the transaction will have an ACCESS SHARE lock on that table until the transaction commits or rolls back.

Unfortunately, ADD COLUMN requires an ACCESS EXCLUSIVE lock - regardless of whether there's a foreign key, self-referential or not, involved. And, per the table of conflicting lock modes, ACCESS EXCLUSIVE does conflict with ACCESS SHARE.

You can poke into which locks specifically are involved with pg_locks, and it's useful to join pg_stat_activity to see the actual queries. What follows is 3 psql sessions, interleaved, so you can see what's going on.

-- psql session 1: what the always-reading application code is doing
testdb=# BEGIN; SELECT * from register where id=10;
BEGIN
 id | name 
----+------
 10 | bar
(1 row)

testdb=# -- now going to psql session #2.


-- psql session 2: look at the locks from the SELECT's still-open transaction.

testdb=# select relation::regclass, case when granted then mode else concat(mode, '*') end as mode, query from pg_locks join pg_stat_activity on pg_locks.pid=pg_stat_activity.pid and pg_locks.pid<>pg_backend_pid() where pg_locks.locktype='relation';
   relation    |      mode       |                query                
---------------+-----------------+-------------------------------------
 register      | AccessShareLock | SELECT * from register where id=10;
 register_pkey | AccessShareLock | SELECT * from register where id=10;
(2 rows)

-- psql session #3: perform the ALTER
testdb=# begin; ALTER TABLE register ADD COLUMN leader_id integer;
BEGIN
-- is now hanging on the ALTER - back to session #2

testdb=# select relation::regclass, case when granted then mode else concat(mode, '*') end as mode, query from pg_locks join pg_stat_activity on pg_locks.pid=pg_stat_activity.pid and pg_locks.pid<>pg_backend_pid() where pg_locks.locktype='relation';
   relation    |         mode         |                       query                        
---------------+----------------------+----------------------------------------------------
 register      | AccessShareLock      | SELECT * from register where id=10;
 register_pkey | AccessShareLock      | SELECT * from register where id=10;
 register      | AccessExclusiveLock* | ALTER TABLE register ADD COLUMN leader_id integer;
(3 rows)


As you can see, the ALTER TABLE is trying, and failing (I add an asterisk when pg_locks.granted is false) to get an AccessExclusiveLock on the register table.

At this point, your application is going to start having problems. Let's open up a 4th psql and try another SELECT:

testdb=#  BEGIN; SELECT * from register where id=12;
BEGIN
--- aaaand... the SELECT is hanging now too! Let's go back to #2 and check the locks again

testdb=# select relation::regclass, case when granted then mode else concat(mode, '*') end as mode, query from pg_locks join pg_stat_activity on pg_locks.pid=pg_stat_activity.pid and pg_locks.pid<>pg_backend_pid() where pg_locks.locktype='relation';
   relation    |         mode         |                       query                        
---------------+----------------------+----------------------------------------------------
 register      | AccessShareLock      | SELECT * from register where id=10;
 register_pkey | AccessShareLock      | SELECT * from register where id=10;
 register      | AccessExclusiveLock* | ALTER TABLE register ADD COLUMN leader_id integer;
 register      | AccessShareLock*     | SELECT * from register where id=12;
(4 rows)



PG, seeing that there's an already-waiting ACCESS EXCLUSIVE lock on the table, will no longer grant additional ACCESS SHARE locks on the table until the ACCESS EXCLUSIVE has finished it's thing and released the lock. So now your SELECTs are piling up and everything is stuck until all of the transactions that had selected from register as of when the ALTER TABLE first tried to grab the lock either commit or rollback.

The fix here, alas, is "don't do that"; the application code needs to issue a COMMIT or ROLLBACK (as long as the SELECT is the only thing that has occurred in the transaction, it doesn't matter which) right after the SELECT, so that the ACCESS SHARE lock is released immediately. You could also split the ADD COLUMN and ADD FOREIGN KEY steps since only the ADD COLUMN requires ACCESS EXCLUSIVE (ADD FOREIGN KEY requires only SHARE ROW EXCLUSIVE, which does not conflict with ACCESS EXCLUSIVE), but I'd be surprised if that helps at all; it's the locking sequence that is causing issues, not the bit of extra work that is involved in doing the ADD FOREIGN KEY.

Licencié sous: CC-BY-SA avec attribution
Non affilié à dba.stackexchange
scroll top