WAL Archiving Best practices

Matthew Gwillam
Matthew Gwillam
  • Updated
  • The core premise of Postgres DR/ PITR scenarios is combining physical base backups with WAL archiving.

  • When archiving WAL logs for potential DR/ PITR scenarios, there are many solutions available, which we will discuss in this article.

  • We highly recommend using a supported tool for a comprehensive physical backup solution. We recommend Barman or pgBackRest for their reliability and features.

  • While the core behavior of base backups and WAL archives is achieved in both Barman & pgBackRest, they each have various additional functionalities, discussed in later sections.

  • See our comparison of backup method features.

WAL archiving options

| Solution | WAL Archiving Durability | Complexity Level | Method |
| 1) Barman backup tool | High | Medium | Push and/or Pull |
| 2) pgBackRest backup tool | High | Medium | Push |
| 3) Native `archive_library` (pg15+) | High | Medium | Push |
| 4) Native `pg_receivewal`, replication slot, and `systemd` scheduler | High | High | Pull |
| 5) Native archive_command using `rsync` | Medium | Easy | Push |
| 6) Native archive_command using `cp` | Low | Easy | Push |
| 7) Native archive_command using `scp` | Low | Easy | Push |
  • Solution column lists the different solutions that can be used for WAL archiving.

  • WAL archiving durability column describes how durable or safe the procedure of archiving the WALs is. Factors affecting this which are elaborated in the article include whether the WAL files are fsyncd to disk, and how networking issues are dealt with.

  • Complexity level describes how difficult the solution is to set up, and maintain a production system on. Even if a method is safer theoretically, if it is too complex for the needs of the user in practice and performed incorrectly, it can become unsafe.

  • Method column describes the broad principle of how the WAL logs are archived.

  • Push method means that postgres is sending WAL files to a location after it has created them.

  • Pull method means that postgres replication is being used to retrieve the WAL data from postgres when it creates it, and store it in a location.

Note - all commands mentioned in this article can have different nuances on different OSs, and should be tested in your testing environment thoroughly before applying to production.

1) Barman

Barman background

Barman is one of our recommended physical backup tools. Its barman cron function is very effective at managing the backup processes.

Barman includes lots of features such as:

Barman usage is documented on its website.

archive_command & barman-wal-archive

  • Barman improves PostgreSQL's archive_command parameter through its barman-wal-archive command.
  • barman-wal-archive guarantees that the WAL file content is fsynced to disk. This avoids production issues created by the fragility of rsync, cp or scp.
  • Additionally, barman-wal-archive reduces the risk of copying a WAL file in to the wrong directory on the Barman host, as the only parameter to be used in the archive_command is the server's ID.

Example command:

archive_command = 'barman-wal-archive barmanserver db01 %p'
  • barmanserver is the Barman hostname
  • db01 is the database server db01 being backed up, specified as [db01] in a barman configuration file, such as /etc/barman.d/db01.conf.
  • %p here represents the pathname of each WAL file that postgres will create ($PGDATA/pg_wal/wal_file_name). For each file created, Postgres re-runs the command with the new file name.

Archiving to multiple servers:

  • Some use cases involve archiving to multiple locations. Barman georedundancy should be used to achieve this.

  • With this, multiple Barman servers can be used, where Barman servers can use another Barman's backups as the primary data source.

  • Do not combine rsync/cp/scp with barman-wal-archive!

We have seen customers experience production issues due to archive_commands such as:

#DO NOT DO THIS!

archive_command = test ! -f /archive_location/%f && cp %p /archive_location/%f && barman-wal-archive barmanserver db01 %p

The use of rsync/cp/scp negates the durability that barman-wal-archive provides.

In this particular case, if the cp process fails due to the lack of durability, barman-wal-archive is also prevented from running due to use of &&.

Although in this example && could be replaced with ; this is still inferior to Barman cascading backups, as the first archival would be using cp and would not be as reliable.

Barman pg_receivewal

Barman also supports streaming WAL archives through pg_receivewal, which minimises the RPO. pg_receivewal is discussed further in later section pg_receivewal background.

2) pgBackRest

pgBackRest background

pgBackRest is also one of our recommended physical backup tools. pgBackRest also uses file and directory level fsync to ensure durability.

pgBackRest includes lots of features, such as:

pgBackRest usage is documented on its website and in EDB docs.

archive_command & pgBackRest

If you are using this solution, you should use pgbackRest archive-push in your archive_command.

Example command:

archive_command = 'pgbackrest --stanza=demo archive-push %p'

This is demonstrated in the quickstart page.

3) archive_library & basic_archive (pg15+, native)

  • If you are on PG15+, and are unable to use Barman or pgBackRest, we recommend use of native archive_library and basic_archive.

  • This new postgres functionality fsyncs the WAL file, ensuring WAL archiving durability.

  • Additionally, the server will avoid recycling or removing these WAL files until the module indicates that the files were successfully archived.

  • Please note that after server crashes, temporary archtemp files should be deleted before starting the server. Please see the documentation notes.

Example setup (EPAS 15, but applies to Postgres 15):

edb=# show shared_preload_libraries;

             shared_preload_libraries              
$libdir/dbms_pipe,$libdir/edb_gen,$libdir/dbms_aq,$libdir/basic_archive

edb=# alter system set shared_preload_libraries='$libdir/dbms_pipe','$libdir/edb_gen','$libdir/dbms_aq','$libdir/basic_archive';

$ /usr/local/enterprisedb/bin/pg_ctl -D /var/lib/edb/as15 -l /var/lib/edb/as15/log/logfile restart

edb=# show archive_library;

edb=# show basic_archive.archive_directory

edb=# alter system set archive_library = 'basic_archive';
edb=# alter system set basic_archive.archive_directory = '/var/lib/edb/archive_test';

edb=# SELECT pg_reload_conf();

edb=# show archive_library;
 basic_archive
 
edb=# show basic_archive.archive_directory
 /var/lib/edb/archive_test
 
edb=# create table test( i int);
edb=# insert into test generate_series(1,1000);
edb=# select pg_switch_wal();
edb=# insert into test generate_series(1,1000);
edb=# select pg_switch_wal();

edb=# select * from pg_stat_archiver;

archived_count|last_archived_wal|last_archived_time| failed_count|last_failed_wal|last_failed_time|stats_reset            
23|000000010000000000000019|09-FEB-23 17:26:38.722981 +00:00|0 ||| 07-FEB-23 16:28:15.290586 +00:00

$ ls -l /var/lib/edb/archive_test

-rw_______ 1 enterprisedb enterprisedb 16777216 Feb  9 17:26 000000010000000000000018
-rw_______ 1 enterprisedb enterprisedb 16777216 Feb  9 17:26 000000010000000000000019
 
$ ps waux | grep archive 
postgres: archiver last was 000000010000000000000019

4) Native pg_receivewal, replication slot, and systemd scheduler

pg_receivewal background

  • If you are on PG14 or older, and you have some constraint of being unable to use Barman or pgBackRest, archive_command can be replaced by the native postgres tool pg_receivewal which streams the WAL logs using replication to an archive location.

  • pg_receivewal WAL archiving is more durable than use of rsync, cp, or scp within archive_command

    • When writing WALs to pg_wal, and sending via replication, postgres uses fsync to ensure that the updates are physically written to disk.
  • Due to this, pg_receivewal lowers the RPO, since pg_receivewal streams the write-ahead log in real time instead of waiting for an entire WAL file to be populated.

  • Synchronous archiving can be used using pg_receivewal --synchronous, synchronous_standby_names, and synchronous_commit, to ensure each transaction is archived and have the RPO at 0, but this creates deteriorated performance. As noted by the documentation, this should not be done if you are running synchronous_commit=remote_apply.

standalone pg_receivewal (without Barman) complexity

  • Due to needing an external scheduling system to maintain it, this is more complicated than using Barman, and would require a professional services engagement to evaluate whether this solution fits your use case, fully test and implement the solution. This is effectively replacing some of the functionality that barman cron offers.

  • Examples of maintenance includes restarting the program when it is offline (e.g if postgres is restarted, pg_receivewal is killed, the archive server host is restarted). Therefore, a custom process e.g a systemd process needs to be written and used.

  • If using pg_receivewal as your only WAL archiving solution, it is highly recommended to use a physical replication slot, otherwise WAL data can be lost. Due to the replication slot, if the pg_receivewal process stops (e.g networking issues, the program failing, the archive OS being down), the replication slot will be inactive, the primary will not recycle the WALs until the process resumes (assuming max_slot_wal_keep_size=1).

  • However, if the primary continually holds the WAL logs, the server will eventually reach 100% disk usage and take the primary down. So in the event of a longer unavailability. A judgement call needs to be made, what is more important:

    • reduce chance of backup data being lost but have the risk of the primary crashing with no disk space.
    • or
    • maintain the primary availability, but increase the risk losing WAL segments (with impact in backups and standbys).
  • Monitoring of the archiving is achieved through:

    • Database replication - select * from pg_stat_replication, select * from pg_replication_slots.
    • The scheduling process - systemctl status systemd_name.

5) archive_command using rsync

If you can not use a backup tool, an archive_command using rsync can be used.

  • rsync first creates a temporary copy of the file and then, only if the copy is successful, it renames it as the target filename.

  • In the future, when customer environments widely support rsync 3.2.4+, we will discuss use of rsync --fsync.

Example command:

archive_command = 'test ! -f enter_backup_location/%f && rsync -a %p enter_backup_location/%f'
  • Here,

    • -a runs rsync in archive mode (recursive, and preserving all but hard links).
    • %p here represents the pathname of each WAL file that postgres will create ($PGDATA/pg_wal/wal_file_name). For each file created, Postgres re-runs the command with the new file name.
    • %f is a postgres parameter representing the name of the WAL segment file.
    • test ! -f enter_backup_location location/%f checks that the filename does not already exist, before writing. If existing, the WAL file will not copy and the command will fail to prevent potential loss of data.
  • test ! -f is implemented in case a problem has occurred, and file with the same name already exists, and needs to be renamed with manual intervention, as the previous file may contain data that is needed.

    • At this point, postgres will retain it's WALs, until the archive location is fixed.
    • If this test is not implemented, in this situation, the previous file would be overwritten.
  • rsync flag --ignore-existing, should not be used, as if a file with the same name already existed, Postgres will skip archival (even if the file has different contents), as the Linux exit status is 0. This would result in silently losing the backup data.

6) archive_command using cp

  • cp is prone to networking issues.

  • Additionally, cp is not ENOSPC safe.

    • If the disk becomes full, and postmaster hits ENOSPC, even after freeing disk space, the cp process will become stuck, leaving a partial file, requiring manual removal. rsync prevents this issue.
  • rsync is recommended over cp as it is more robust during networking issues, and prevents the ENOSPC issue. Therefore, it is less likely to hit a WAL archiving issue.

  • If using cp in archive_command, changing to rsync is an easy change to test in your test environment, with lots of benefits.

7) archive_command using scp

scp is also not recommended due to being inferior to barman-wal-archive and rsync, but also becoming deprecated.

  • If using scp in archive_command, changing to rsync is an easy change to test in your test environment, with lots of benefits.

Related to

Was this article helpful?

0 out of 0 found this helpful