postgresql replication: WAL rotated before segment could be archived?

https://dba.stackexchange.com/questions/280664

11-03-2021
|

Question

Using PostgreSQL 10.7

Checking my logs this morning, I'm seeing this error a lot:

archive command failed with exit code 1
The failed archive command was: test ! -f /opt/bitnami/postgresql/archive/00000001000005D30000008B && cp pg_wal/00000001000005D30000008B /opt/bitnami/postgresql/archive/00000001000005D30000008B

I'm just now seeing that the command in the documentation isn't recommended, and I'm wondering if anyone can point me in the direction of a better archiving command.

In the meantime, this segment isn't in pg_wal, archive_status or the archive directory. I've noticed some checkpoint warnings in the logs.

We'd previously had an issue with the replication server where something similar happened, and I thought turning on archiving would help since it wasn't enabled for some reason (it now is) but now it seems the issue is elsewhere.

The databases do undergo a heavy load every day with a cronjob that restores them to match production databases, this seems to be when the checkpoint errors are occurring.

Is my only option now to create a new base backup, since I'm missing at least one WAL segment?

Here's the relevant part of postgresql.conf

#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------

# - Settings -

wal_level = 'hot_standby'
                                        # (change requires restart)
#fsync = on                             # flush data to disk for crash safety
                                        # (turning this off can cause
                                        # unrecoverable data corruption)
#synchronous_commit = on                # synchronization level;
                                        # off, local, remote_write, remote_apply, or on
#wal_sync_method = fsync                # the default is the first option
                                        # supported by the operating system:
                                        #   open_datasync
                                        #   fdatasync (default on Linux)
                                        #   fsync
                                        #   fsync_writethrough
                                        #   open_sync
#full_page_writes = on                  # recover from partial page writes
#wal_compression = off                  # enable compression of full-page writes
#wal_log_hints = off                    # also do full page writes of non-critical updates
                                        # (change requires restart)
#wal_buffers = -1                       # min 32kB, -1 sets based on shared_buffers
                                        # (change requires restart)
#wal_writer_delay = 200ms               # 1-10000 milliseconds
#wal_writer_flush_after = 1MB           # measured in pages, 0 disables

#commit_delay = 0                       # range 0-100000, in microseconds
#commit_siblings = 5                    # range 1-1000

# - Checkpoints -

#checkpoint_timeout = 5min              # range 30s-1d
max_wal_size = '400MB'
#min_wal_size = 80MB
#checkpoint_completion_target = 0.5     # checkpoint target duration, 0.0 - 1.0
#checkpoint_flush_after = 0             # measured in pages, 0 disables
#checkpoint_warning = 30s               # 0 disables

# - Archiving -

archive_mode = on               # enables archiving; off, on, or always
                                # (change requires restart)
archive_command = 'test ! -f /opt/bitnami/postgresql/archive/%f && cp %p /opt/bitnami/postgresql/archive/%f'      # command to use to archive a logfile segment
                                # placeholders: %p = path of file to archive
                                #               %f = file name only
                                # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
#archive_timeout = 0            # force a logfile segment switch after this
                                # number of seconds; 0 disables

I'd appreciate any guidance for what I can do to prevent this from happening again.

Solution

The file should remain in pg_wal until the archive command succeeded. If archive never succeeded then the file should still be there. Maybe your filesystem is corrupt and losing files, or your staff are messing around with things they shouldn't be.

What could happen with the command you are using is that if it reports success and then the system crashes, and when it comes back up the archived file is not there or does not have the correct data because it never got synced to disk. But that happening would not leave archival error messages in you log.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange