Post Mortem Analysis: PostgreSQL replication failed
Pergunta
We have a PostgreSQL 9.4.9 production server which was replicating to a slave instance, but today I discovered that the instance is out of sync!
The obvious actions would be to recreate the slave, setup metrics and proper alarms for the replication activity, so we can effectively monitor the sync status between master and slave nodes.
But, as the sync failed, I'd like to first diagnose the problem and try to identify the root cause of it, as this would be the second time this happens in about 6 months.
Question: How to diagnose what failed in the replication process so it can be done in a better way this time?
Version specifics:
PostgreSQL 9.4.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
From slave node, at /var/log/postgresql/postgresql-9.4-main.log
I can see:
2017-07-18 19:43:55 UTC [12816-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:43:55 UTC [12816-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
2017-07-18 19:44:00 UTC [12817-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:00 UTC [12817-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
2017-07-18 19:44:05 UTC [12821-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:05 UTC [12821-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
2017-07-18 19:44:10 UTC [12825-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:10 UTC [12825-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
2017-07-18 19:44:15 UTC [12826-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:15 UTC [12826-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
New Question: How can I see backward on where the actual problem appeared?
Master postgresql.conf
: https://pastebin.com/NJX5ku6m
Slave postgresql.conf
: https://pastebin.com/CUZcyazC
Slave recovery.conf
:
standby_mode = on
primary_conninfo = 'host=10.1.1.65 port=5432 user=replicador password=replicador'
Solução
Based on this I'd say you didn't have enough wal_keep_segments
on the master, weren't using a replication slot, and either had hot_standby_feedback
off or had the connection drop for long enough for the master to remove needed WAL.
And you are presumably not using WAL archiving (archive_command
on master, restore_command
on replica) as fallback.
So the master removed transaction logs the standby needed.
You'll need to re-create the standby. Then either:
Set the standby to use a replication slot and enable
hot_standby_feedback
; orenable
archive_command
andrestore_command
Outras dicas
First thing: look at the logs. You will find warnings, error, fatal, and panic messages.
You can find where your logs are in your postgresql.conf
file.
Look for the logging_collector
setting, if it's on
, you will find your server logs in the directory specified in the log_directory
setting.
If logging_collector
is set to off
, look at the log_destination
setting. If it's syslog
you need to look at your syslog settings to find where your logs are.
If it's stderr
you might find something under /proc/<PID>/fd/2
where <PID>
is the PID of your running PostgreSQL server.
You might find this page of documentation usefull.