archive_command: Can hardlink (ln) be used instead of cp?

https://dba.stackexchange.com/questions/279683

10-03-2021
|

Question

We use Postgres 9.6, with 10+TB, with a multi-node cluster setup, managed by Patroni. The WAL archives and backups are managed by home grown tool pgrsync.

The archive_command was initially set to cp %p /archives/%f. There is a background job (pgrsync) that pushes the archives to S3 periodically. The volume of WAL archives was higher (avg around 200 WAL files/min, with the peak being 500/min). The cp also adds to the Disk IO bandwidth, which is precious for us in a cloud environment.

We are looking to optimise this in the application. Also, I noticed that in pg_xlog folder that several files were hard link to other WAL files. (This part is not understood fully, how could Postgres internally have one WAL archive being a link to another -- it is unlikely that so many transactions could be repeated exactly after some time).

Anyway, as an optimisation exercise, we set the archive_command to ln %p /archives/%f. This reduces the disk IO, we are just adding one more link to the same file. When we are done copying to S3, the link is removed and the OS manages deleting the actual file, when Postgres also frees it. Looks good on paper. Except one problem: If Postgres writes to the same file (with the same inode) after completing the archive_command, then we are in a mess. Please refer postgres: WAL ends before end of online backup where we are seeing random WAL corruption and we don't know if using ln caused this.

Question: Is it safe to use ln (hardlink) instead of cp in archive_command?

WAL files are default 16MB. We have a plan to increase it, but I suspect it will reduce the number of files, but the size and IO will likely remain the same. Not likely to help, isn't it?

Backup WAL from a standby node is a good idea - or the home grown consumer for archiving. Or even simply scp to another server, to reduce disk write IO on Postgres server. Since anything could be standby or master at any time, managing who will actually push files could be slightly complex. So, I would either go for archiving consumer or scp to non-Postgres server.

Reg not invoking pgrsync directly, We did this initially. pgrsync would then have to handle each file individually in-series (compress and push to S3). This was slowing down the ability to handle 500 WAL files (or 500*16MB WAL bytes). Copying to another location gives pgrsync the ability to push files in parallel and thus the ability to handle 500 WAL files per min. I was avoiding the need for another server (rsync / scp to), but it looks like that is a better approach now.

Solution

No, that is not an option and will lead to trouble.

The first and most obvious reason is that the WAL segments exists only once, even if there is a hard link to it in another directory. If anything happens to the disk with the file, you have no copy and cannot recover.

The second reason is more subtle and will lead to data corruption. When a WAL segment is archived and older than the latest checkpoint, it is not necessarily deleted. If PostgreSQL determines that it needs new WAL segments (max_wal_size or – in older versions – checkpoint_segments has not been exceeded and there is much data modification activity going on), it will recycle the WAL segment rather than deleting it.

If that happens, and PostgreSQL starts writing to the recycled WAL segment before you have managed to copy it to safety, the original WAL entries will be overwritten and are lost. If you recover using such a WAL segment, PostgreSQL will notice that there are unexpected data in the WAL and will stop recovering.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange