PostgreSQL slave failed after master failure; unable to promote

https://dba.stackexchange.com/questions/189089

09-10-2020
|

سؤال

I just encountered a problem with a PostgreSQL 9.6 warm standby architecture: by mistake, we stopped the master and destroyed its data, so, some tens of minutes later, when we noticed the problem, we tried to promote the standby server. It stopped after the master failed, and we weren't able to start it, so we also failed to promote it. I don't understand why, but I assume I'm missing something in the replication process which would explain that. The logs at master failure are the following:

2017-10-17 10:51:39.182 CEST,,,32236,,59dcdafa.7dec,2,,2017-10-10    16:36:42 CEST,,0,FATAL,XX000,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,""
2017-10-17 10:51:39.540 CEST,,,32142,,59dcdaf4.7d8e,13,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"invalid resource manager ID 32 at 28/B0163E8",,,,,,,,,""
2017-10-17 10:51:39.642 CEST,,,8532,,59e5c49b.2154,1,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"started streaming WAL from primary at 28/B000000 on timeline 1",,,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,2,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 1 at 28/C000098",,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,3,,2017-10-17 10:51:39 CEST,,0,FATAL,XX000,"could not send end-of-streaming message to primary: no COPY in progress",,,,,,,,,""
2017-10-17 10:51:54.088 CEST,,,32142,,59dcdaf4.7d8e,14,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"record with incorrect prev-link 3F136/36 at 28/C000098",,,,,,,,,""
2017-10-17 10:51:54.113 CEST,,,8607,,59e5c4aa.219f,1,,2017-10-17 10:51:54 CEST,,0,FATAL,XX000,"could not connect to the primary server: 
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting
        TCP/IP connections on port 5433?
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (137.xx.xx.xx) and
accepting
        TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:51:59.133 CEST,,,8610,,59e5c4af.21a2,1,,2017-10-17 10:51:59 CEST,,0,FATAL,XX000,"could not connect to the primary server:
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting
        TCP/IP connections on port 5433 ?
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (137.xx.xx.xx) and accepting
        TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:52:03.969 CEST,,,32142,,59dcdaf4.7d8e,15,,2017-10-10 16:36:36 CEST,,0,FATAL,XX000,"could not restore file « 00000001000000280000000C » from archive: child process exited with exit code 255",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,2,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"startup process (PID 32142) exited with exit code 1",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,3,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2017-10-17 10:52:03.990 CEST,,,32139,,59dcdaf4.7d8b,4,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"database system is shut down",,,,,,,,,""

As I understand these lines, the standby server lost the WAL stream to the master and tried to reconnect; on failure, it tried to retrieve the last archive and, failing again, stopped. First problem: in such a case, I thought the standby would stay online, waiting for instructions, instead of shutting itself down. Am I wrong on that? If not, what did I missed, which prompted the standby to shut itself down?

Then, when we tried to restart the slave to promote it, we got these lines:

2017-10-17 10:51:39.182 CEST,,,32236,,59dcdafa.7dec,2,,2017-10-10 16:36:42 CEST,,0,FATAL,XX000,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,""
2017-10-17 10:51:39.540 CEST,,,32142,,59dcdaf4.7d8e,13,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"invalid resource manager ID 32 at 28/B0163E8",,,,,,,,,""
2017-10-17 10:51:39.642 CEST,,,8532,,59e5c49b.2154,1,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"started streaming WAL from primary at 28/B000000 on timeline 1",,,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,2,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 1 at 28/C000098",,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,3,,2017-10-17 10:51:39 CEST,,0,FATAL,XX000,"could not send end-of-streaming message to primary: no COPY in progress",,,,,,,,,""
2017-10-17 10:51:54.088 CEST,,,32142,,59dcdaf4.7d8e,14,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"record with incorrect prev-link 3F136/36 at 28/C000098",,,,,,,,,""
2017-10-17 10:51:54.113 CEST,,,8607,,59e5c4aa.219f,1,,2017-10-17 10:51:54 CEST,,0,FATAL,XX000,"could not connect to the primary server:
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting
        TCP/IP connections on port 5433?
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (137.xx.xx.xx) and
accepting
        TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:51:59.133 CEST,,,8610,,59e5c4af.21a2,1,,2017-10-17 10:51:59 CEST,,0,FATAL,XX000,"could not connect to the primary server:
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting
        TCP/IP connections on port 5433 ?
could not connect to server: Connection refused
        Is the server running on host « xxxxxx » (137.xx.xx.xx) and accepting
        TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:52:03.969 CEST,,,32142,,59dcdaf4.7d8e,15,,2017-10-10 16:36:36 CEST,,0,FATAL,XX000,"could not restore file « 00000001000000280000000C » from archive: child process exited with exit code 255",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,2,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"startup process (PID 32142) exited with exit code 1",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,3,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2017-10-17 10:52:03.990 CEST,,,32139,,59dcdaf4.7d8b,4,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"database system is shut down",,,,,,,,,""

It seems that the server complains that the WAL stream was abruptly stopped and that, as it fails to reconnect to the master, it can't check if its (the standby) data are fresh, so it refuses to start. If so, is it consistant? I can't see why the standby, which is supposed to preserve database service on master failure, would not start precisely because of the master failure, so I assume I did something wrong, but what? Otherwise, what should I do to allow the standby to restart and be promoted?

المحلول

First, make a full copy of what you have left, before you do something irreversible to it.

Did you really stop the master first, and then destroy its data, or destroy its data while it was still running, and then stop it? I wonder if the last WAL was in the process of getting destroyed while it was still being copied.

You don't need to start a standby in order to promote. Promotion is how you go from a hot standby to a master, without interruption. Since your standby is already interrupted, you would just rename the recovery.conf out of the way, and start it.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى dba.stackexchange