Question

I have four PostgreSQL 9.5 instances running on EC2 using m4.8xlarge instances with five PIOPS SSD in a RAID0 setup, and a separate XLOG drive. Until this morning, I never had more than a minute or two of replication lag, but now replication fails completely on all instances after about 30 minutes.

Restarting Postgres fixes the problem for another half hour.

There's no CPU contention and iowait is usually less than 1%. Blocking reads to the server, thinking that it might be overwhelmed, does nothing. I can't figure out what the problem is here, other than something on Amazon's end.

Can anyone give me some idea how to troubleshoot this problem? There's nothing in the logs, replay_location (from pg_stat_replication) just stops updating until I restart the slaves.

Was it helpful?

Solution

TL;DR: Do not disable max_standby_streaming_delay in the config.

For the sake of others looking for answers on this one, I thought I would answer my own question since the amount of information out there on this topic via Google is very slim.

Digging around, I managed to discover a book called Troubleshooting PostgreSQL: https://www.packtpub.com/big-data-and-business-intelligence/troubleshooting-postgresql

It comes in ebook format and was useful for helping me to track down the issue I was experiencing. It describes using a tool called pg_view which has a lot of very useful realtime metrics on what's happening in your PostgreSQL server.

Running it on one of my servers that had halted replication showed the problem occurred during long running queries:

db.server.com up 21:34:25 36 cores Linux 4.4.5-15.26.amzn1.x86_64 load average 1.79 0.94 0.47                                                                                                                                               08:58:29
 sys: utime 5.7   stime 0.1   idle 94.3  iowait 0.0   ctxt 580 run 4   block 0
 mem: total 59.0GB free 2.3GB  buffers 41.0MB cached 55.0GB dirty 692KB  limit 29.5GB as 14.7GB left 14.8GB
 /var/lib/pgsql/9.5/data 9.5.2 standby connections: 5 of 500 allocated, 2 active
 type dev     fill   total    left   read  write    await path_size path
 data md127    0.0 499.7GB 382.8GB    0.0    0.1      0.0   116.6GB /var/lib/pgsql/9.5/data
 xlog xvdg     0.0 100.0GB  84.0GB    0.0    0.6     15.7    15.9GB /var/lib/pgsql/9.5/data/pg_xlog
   pid type    s utime stime guest read write   age  uss db   user     query
 26644 backend R 100.2   0.0   0.0  0.0   0.0 03:21 93.4 data postgres WITH query AS ( WITH original_transaction_id AS( SELECT original_transaction_id FROM table WHERE...
 26645 backend R  99.3   0.0   0.0  0.0   0.0 00:01 93.8 data postgres WITH query AS( SELECT COUNT(id) AS total, original_transaction_id FROM table GROUP BY original_tr...

The key there is the age column showing the long running query. The secret, however, is to show all the processes on the server by pressing the 's' hotkey, which isn't described in the pg_view GitHub page. This shows the system processes, including the startup and WAL receiver:

   pid type            s utime stime guest read write   age  uss db   user     query
 12169 startup         S   0.0   0.0   0.0  0.0   0.0        1.2               recovering 000000010000070A0000006D...
 12170 checkpointer    S   0.0   0.0   0.0  0.0   1.9        1.2               ...
 12171 writer          S   0.0   0.0   0.0  0.0   0.0        1.0               ...
 12172 stats collector S   0.0   0.0   0.0  0.0   0.0        1.2               ...
 12178 wal receiver    S   0.0   1.0   0.0  0.0   1.1        1.2               streaming 70A/6D998000...

The useful info there is on the startupline where it says recovering. This is normal, as it is a read-only slave and is technically always recovering as new data comes in via the wal receiver.

The problem is when it has waiting on the end of that line - what this tells you is that the server is stuck waiting for a long running query to release a lock on a table it needs to update, and until it does, replication is halted, or the max_standby_streaming_delay value is reached.

By default, max_standby_streaming_delay is set to 30ms. However, it was disabled on my servers, perhaps for some perceived benefit to replication reliability:

max_standby_streaming_delay = -1

Removing that line from the config file and reloading fixed the issue going forward. Total Uptime

Another outcome was for us to move all long-running queries to a separate replica only used my these backend processes to help long-term reliability and avoid possible table locks on customer-facing databases.

Most of the troubleshooting guides say that replication issues are network related, but in this case it was a misconfiguration and poorly-formed SQL queries.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top