postgres cluster: Crash of master crashes the replicas too

https://dba.stackexchange.com/questions/281656

12-03-2021
|

Question

Environment: Postgres version: 9.6 Cluster of 3 servers, with Patroni and etcd

Scenario: When index of tables were started with 16 parallel requests (on a 16CPU machine), postgres on the master crashed with Linux OOM killer. This was a 124GB machine. We understand that spawning so many parallel requests needs more memory and we have addressed that.

Problem: However, the concern is that, when the master crashed due to OOM, all the replicas too crashed. This is something not expected and puts the high availability of the cluster in question. We can easily simulate this and every time the behaviour of the replicas is exactly the same.

The logs of the master when the crash happened:

2020-12-16 09:54:44 UTC [11619]: [9-1] user=,db=LOG:  checkpointer process (PID 30834) was terminated by signal 9: Killed
2020-12-16 09:54:44 UTC [11619]: [10-1] user=,db=LOG:  terminating any other active server processes
2020-12-16 09:54:44 UTC [30838]: [1-1] user=,db=FATAL:  archive command was terminated by signal 3: Quit
2020-12-16 09:54:44 UTC [16870]: [1-1] user=postgres,db=mydbWARNING:  terminating connection because of crash of another server process
2020-12-16 09:54:44 UTC [16870]: [2-1] user=postgres,db=mydbDETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-12-16 09:54:44 UTC [16870]: [3-1] user=postgres,d=mydbHINT:  In a moment you should be able to reconnect to the database and repeat your command.
...
2020-12-16 09:54:59 UTC [24609]: [1-1] user=postgres,db=mydbFATAL:  the database system is in recovery mode
...
2020-12-16 09:55:04 UTC [22780]: [4-1] user=,db=LOG:  redo done at 52712/BAFFD3C0
2020-12-16 09:55:07 UTC [11619]: [13-1] user=,db=LOG:  database system is ready to accept connections

The logs of the replica, when the crash happened:

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-12-16 09:54:44 UTC [13293]: [2-1] user=,db=FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
...
2020-12-16 09:54:55 UTC [12843]: [19675-1] user=,db=LOG:  restored log file "00000007000526C70000005D" from archive
...
2020-12-16 11:19:18 UTC [12843]: [38972-1] user=,db=LOG:  restored log file "0000000700052712000000C1" from archive
2020-12-16 11:19:20 UTC [18104]: [1-1] user=,db=LOG:  started streaming WAL from primary at 52712/C2000000 on timeline 7

Question: Will the postgres crash on the master (due to OOM/corrupted shared memory), necessarily cause a similar crash on the replicas too? Is there a way to circumvent this?*

Solution

This is harmless.

That message is sent by the server to the client when a process has crashed and the backend is about to die (quickdie in postgres.c) so that crash recovery can begin.

You are only seeing the message sent by the primary server to the WAL receiver. Note the WARNING: this is not even an error. This only gets logged because log_min_messages on the standby is set to warning or lower.

The standby server keeps running – as you see, it catches up from the archive while the primary server recovers. As soon as it has read all archives and the primary is up again, it will reconnect and continue streaming.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange