PostgreSQL 9.6 pg_rewind - takes a long time to replicate

https://dba.stackexchange.com/questions/256813

21-02-2021
|

Question

I use Streaming Replication,and Replication works normally. Two servers are HA and PostgreSQL DB is replicated. The data in the DB is about 40G. When a failover occurs, the slave is successfully promoted to master, master becomes slave, and tries to replicate new master data. But, It takes a long time to replicate data from the new master server. (using pg_rewind / success / 38G - 6minutes).

Please let me know if there are other ways to save time.

This is the command I ran:

pg_rewind --target-pgdata="targetdir" --source-server="sourceserver"

This is the output:

connected to server
servers diverged at WAL position 35/DD0D2260 on timeline 37
rewinding from last common checkpoint at 35/DC4E94F8 on timeline 37
reading source file list
reading target file list
reading WAL in target
need to copy 39193 MB (total source directory size is 77268 MB)
698400/40134372 kB (1%) copied

Solution

pg_rewind connects to the new master and locates the latest checkpoint it shares with the old master. Then it examines local WAL to find all blocks that have been modified since and copies these blocks from the new master.

So the procedure is slow if

promotion happened a long time ago, so pg_rewind has to dig through many WAL files

many blocks have changed since the promotion

From the output of pg_rewind it becomes obvious that half of the blocks in the database cluster have been modified since the slave was promoted. So the problem is that you waited too long after promotion. Immediately after failover, pg_rewind would be much faster.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange