Postgres: how to restore backup with WAL-E?

https://dba.stackexchange.com/questions/133420

01-10-2020
|

Question

I am trying to setup Postgres and WAL-E.

So far I have managed to install WAL-E and it appears to be working. I can make backups, WAL logs seem to be sent to S3, great. Here is what S3 looks like:

Root directory:

basebackups_005 folder:

wal_005 folder:

As a test, I killed the Postgres instance and spun up another to ensure we were starting with 0 data, just Postgres and WAL-E installed.

I added my S3 credentials, fetched the latest backup from S3, and added a recovery.conf file so that Postgres would replay the WAL segments:

#!/usr/bin/env bash
echo "Fetching latest backup..."
rm -rf /var/lib/postgresql/9.3/main
envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-fetch /var/lib/postgresql/9.3/main LATEST
chown -R postgres:postgres /var/lib/postgresql/9.3/main
chmod 0700 /var/lib/postgresql/9.3/main

echo "Creating restore file..."
touch /var/lib/postgresql/9.3/main/recovery.conf
RESTORE_COMMAND='/usr/bin/envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-fetch %f %p'
echo "restore_command = '$RESTORE_COMMAND'" > /var/lib/postgresql/9.3/main/recovery.conf

echo "Backup fetched and restore file created."

Cool. When I restart Postgres, it seems to replay all the files correctly and I'm back in action - great! But now I want to start writing WAL files again from this 2nd instance, just like I did with the 1st instance.

Okay, so I configure WAL logging in postgresql.conf:

#!/usr/bin/env bash
echo "Configuring WAL push to S3..."

echo "wal_level = hot_standby" >> /etc/postgresql/9.3/main/postgresql.conf
echo "archive_mode = yes" >> /etc/postgresql/9.3/main/postgresql.conf
ARCHIVE_COMMAND="archive_command = 'envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-push %p'"
echo ${ARCHIVE_COMMAND} >> /etc/postgresql/9.3/main/postgresql.conf
echo "archive_timeout = 60" >> /etc/postgresql/9.3/main/postgresql.conf

echo "WAL push to S3 configured."

And restart Postgres. Now, WAL logs are being sent to S3, but they appear to have a different numbering scheme:

New WAL logs after restoring and restarting on the new/2nd instance:

Notice how there is that 2 number instead of a 1. So I inserted a few rows on the 2nd instance, and then repeated the new-instance-restore-restart process above to ensure the backups on the 2nd instance were working.

Killed the instance and made a new one, setup restoring with WAL-E just like above, etc. Our 3rd instance is now ready to replay the data created and backed up from the first 2 instances.

Here is the problem. The WAL logs are replayed on the 3rd instance up until those logs with the 2 numbers (the logs created by the 2nd instance), and then that's it. It doesn't replay the logs from the 2nd instance. The 3rd instance is restored to the same state as the end of the 1st instance/beginning of 2nd instance. It seems like WAL-E can't figure out that there are those additional WAL files at the end.

Here is the last few lines of logs from /var/log/postgresql/postgresql-9.3-main.log on the 3rd instance:

wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-03-27T19:31:26.410542-00 pid=426
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2016-03-27T19:31:26.662740-00 pid=426 action=wal-fetch key=s3://pickle-wal/wal_005/000000010000000000000018.lzo prefix= seg=000000010000000000000018 state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://pickle-wal/wal_005/000000010000000000000018.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2016-03-27T19:31:28.268220-00 pid=426
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2016-03-27T19:31:28.369974-00 pid=426 action=wal-fetch key=s3://pickle-wal/wal_005/000000010000000000000018.lzo prefix= seg=000000010000000000000018 state=complete
2016-03-27 19:31:28 UTC LOG:  redo done at 0/170000C8
2016-03-27 19:31:28 UTC LOG:  last completed transaction was at log time 2016-03-27 16:49:09.568954+00
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-03-27T19:31:29.608778-00 pid=442
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2016-03-27T19:31:29.827687-00 pid=442 action=wal-fetch key=s3://pickle-wal/wal_005/000000010000000000000017.lzo prefix= seg=000000010000000000000017 state=begin
wal_e.blobstore.s3.s3_util INFO     MSG: completed download and decompression
        DETAIL: Downloaded and decompressed "s3://pickle-wal/wal_005/000000010000000000000017.lzo" to "pg_xlog/RECOVERYXLOG"
        STRUCTURED: time=2016-03-27T19:31:31.850344-00 pid=442
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2016-03-27T19:31:31.851721-00 pid=442 action=wal-fetch key=s3://pickle-wal/wal_005/000000010000000000000017.lzo prefix= seg=000000010000000000000017 state=complete
2016-03-27 19:31:32 UTC LOG:  restored log file "000000010000000000000017" from archive
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-03-27T19:31:32.979935-00 pid=465
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2016-03-27T19:31:33.165372-00 pid=465 action=wal-fetch key=s3://pickle-wal/wal_005/00000002.history.lzo prefix= seg=00000002.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://pickle-wal/wal_005/00000002.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2016-03-27T19:31:34.355366-00 pid=465
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2016-03-27T19:31:34.457207-00 pid=465 action=wal-fetch key=s3://pickle-wal/wal_005/00000002.history.lzo prefix= seg=00000002.history state=complete
2016-03-27 19:31:34 UTC LOG:  selected new timeline ID: 2
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-03-27T19:31:35.207091-00 pid=483
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2016-03-27T19:31:35.322521-00 pid=483 action=wal-fetch key=s3://pickle-wal/wal_005/00000001.history.lzo prefix= seg=00000001.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://pickle-wal/wal_005/00000001.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2016-03-27T19:31:36.228126-00 pid=483
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2016-03-27T19:31:36.329787-00 pid=483 action=wal-fetch key=s3://pickle-wal/wal_005/00000001.history.lzo prefix= seg=00000001.history state=complete
2016-03-27 19:31:36 UTC LOG:  archive recovery complete
2016-03-27 19:31:36 UTC LOG:  MultiXact member wraparound protections are now enabled
2016-03-27 19:31:36 UTC LOG:  database system is ready to accept connections
2016-03-27 19:31:36 UTC LOG:  autovacuum launcher started

I have tried multiple times and gotten the same result. It's really frustrating, and I can't seem to find other people having the same sort of problems.

Does anyone have an idea as to what I'm doing wrong?

Solution

The different numbers you see are log timeline IDs. When you had completed recovery of you database for the first time, a new log timeline started with ID 2.

However, when you again restore a backup, now into the third database, only logs from timeline 1 (the one that was active when the backup was taken) will be used for recovery by default. To use logs from timeline 2 you will probably need to add recovery_target_timeline = latest to recovery.conf. Alternatively, after completing recovery you may choose to take a fresh backup to establish a new recovery baseline.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange