PostgreSQL 9.6 log shipping replication stuck on startup
-
07-10-2020 - |
Question
Overview
We have been trying to restore replication between a Master and Replication server for a while, but the individual who had set up has left and we've been working at it for a while. In brief, the issue seems to be that the Replication server starts up and within a few minutes reports postgres: startup process recovering [FILENAME]
and never advances beyond that same FILENAME.
Configuration
Master
pg_hba.conf
local replication replicator trust
host replication replicator 10.1.20.181/32 trust
postgresql.conf
(only active settings related to replication)
wal_level = hot_standby # minimal, archive, hot_standby, or logical
fsync = on # turns forced synchronization on or off
wal_sync_method = fdatasync # the default is the first option
archive_mode = on # allows archiving to be done
archive_command = '/bin/true' # Temporary until replication is working
archive_timeout = 300 # force a logfile segment switch after this
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 600 # in logfile segments, 16MB each; 0 disables
max_replication_slots = 3 # max number of replication slots
Replication Server
postgresql.conf
hot_standby = on
recovery.conf
standby_mode = 'on' # enables stand-by (readonly) mode
# Connect to the master postgres server using the replicator user we created.
primary_conninfo = 'host=10.1.20.180 port=5432 user=replicator'
primary_slot_name = 'oh_slot'
# Specifies a trigger file whose presence should cause streaming replication to
# end (i.e., failover).
trigger_file = '/tmp/pg_failover_trigger'
Replication Set up Process
On the Master, we clear out all old WAL files by using pg_controldata
to find the current WAL file and delete anything older
On the Replication Server:
- Stop the database
- Unmount
pg_xlog
directory - Delete contents of data directory
- Run
pg_basebackup -Xs -U replicator -h #{SOURCE_DB} -D /var/lib/postgresql/9.6/main -v
- Copy
recovery.conf
into new data directory - Move the
pg_xlog
directory, create a new one and mount it, then copy the archive files over to the new directory and fix permissions - Start up the database
At this point, I've seen the startup process process a number of WAL files, but it always (and rather quickly -- like within a few minutes) stops and continually reports that it is recovering the same file indefinitely. Checking tables between the two databases confirms that the replication database hasn't added new records that are in the master database.
I get the feeling that I am missing something fundamental and would greatly appreciate questions and guidance.
Updates
Luan Huynh: I had the false impression that there were no errors upon startup, but the log has the following 2017-04-16 06:38:32 EDT [13029-1] FATAL: could not start WAL streaming: ERROR: replication slot "oh_slot" does not exist
; so clearly I did not create the replication slot on the Master as I thought I had.
KookieMonster: I had found advice to set the archive_command to true as a debugging technique. $FILENAME is the last of the files in the pg_xlog
directory as it had been copied during the pg_basebackup
. The Master indeed has additional WAL files that were created after $FILENAME, but the error in the log seems to clearly indicate why the system has not been been moving over the WAL files created after the pg_basebackup
.
I'll verify the replication slot on the master and update.
Solution
Thanks to Luan Huynh, I checked the logs on the Slave and saw that it was not able to start WAL streaming due to the missing replication slot on the Master (I really was certain it had been created! :-) )
Creating the replication slot with pg_create_physical_replication_slot
was all that was needed. Verifying the replication slot with SELECT * FROM pg_replication_slots
showed the slot had been created and WAL streaming was active.
I was able to verify that the latest records on highly active tables on the Master are being successfully replicated on the Slave.