Post Mortem Analysis: PostgreSQL replication failed

https://dba.stackexchange.com/questions/180062

08-10-2020
|

Pergunta

We have a PostgreSQL 9.4.9 production server which was replicating to a slave instance, but today I discovered that the instance is out of sync!

The obvious actions would be to recreate the slave, setup metrics and proper alarms for the replication activity, so we can effectively monitor the sync status between master and slave nodes.

But, as the sync failed, I'd like to first diagnose the problem and try to identify the root cause of it, as this would be the second time this happens in about 6 months.

Question: How to diagnose what failed in the replication process so it can be done in a better way this time?

Version specifics:

PostgreSQL 9.4.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit

From slave node, at /var/log/postgresql/postgresql-9.4-main.log I can see:

2017-07-18 19:43:55 UTC [12816-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:43:55 UTC [12816-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:00 UTC [12817-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:00 UTC [12817-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:05 UTC [12821-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:05 UTC [12821-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:10 UTC [12825-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:10 UTC [12825-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:15 UTC [12826-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:15 UTC [12826-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

New Question: How can I see backward on where the actual problem appeared?

Master postgresql.conf: https://pastebin.com/NJX5ku6m

Slave postgresql.conf: https://pastebin.com/CUZcyazC

Slave recovery.conf:

standby_mode = on
primary_conninfo = 'host=10.1.1.65 port=5432 user=replicador password=replicador'

Solução

Based on this I'd say you didn't have enough wal_keep_segments on the master, weren't using a replication slot, and either had hot_standby_feedback off or had the connection drop for long enough for the master to remove needed WAL.

And you are presumably not using WAL archiving (archive_command on master, restore_command on replica) as fallback.

So the master removed transaction logs the standby needed.

You'll need to re-create the standby. Then either:

Set the standby to use a replication slot and enable hot_standby_feedback; or
enable archive_command and restore_command

Outras dicas

First thing: look at the logs. You will find warnings, error, fatal, and panic messages.

You can find where your logs are in your postgresql.conf file.

Look for the logging_collector setting, if it's on, you will find your server logs in the directory specified in the log_directory setting.

If logging_collector is set to off, look at the log_destination setting. If it's syslog you need to look at your syslog settings to find where your logs are. If it's stderr you might find something under /proc/<PID>/fd/2 where <PID> is the PID of your running PostgreSQL server.

You might find this page of documentation usefull.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a dba.stackexchange