Speed up getting WAL files from Barman with barman-wal-restore

Gabriele Bartolini
Gabriele Bartolini

This article guides you through the installation of the barman-wal-restore application in your PostgreSQL High Availability cluster using the barman-cli package that is publicly available and maintained by 2ndQuadrant.

PostgreSQL standby servers can rely on an “infinite” basin of WAL files and finally pre-fetch batches of WAL files in parallel from Barman, speeding up the restoration process as well as making the disaster recovery solution more resilient as a whole.

The master, the backup and the standby

Before we start, let’s define our playground. We have:

  1. our PostgreSQL primary server, called angus
  2. a server with Barman, called barman, and
  3. a third server with a reliable PostgreSQL standby, called chris

NOTE: for different reasons, unfortunately, we had to rule out the following names bon, brian, malcolm, phil, cliff and obviously axl. ;)

angus is a high workload server and is continuously backed up on barman, while chris is a hot standby server with streaming replication from angus enabled. This is a very simple, robust and cheap business continuity cluster that you can easily create with pure open source PostgreSQL, yet capable of reaching over 99.99% uptime in a year (according to our experience with several customers at 2ndQuadrant).

What we are going to do is to instruct chris (the standby) to fetch WAL files from barman whenever streaming replication with angus is not working, as a fallback method, making the entire system more resilient and robust. Most typical examples of these problems are:

  1. temporary network failure between chris and angus;
  2. prolonged downtime for chris which causes the standby to go out of sync with angus.

Technically, we will be configuring the standby server chris to remotely fetch WAL files from barman as part of the restore_command option in the recovery.conf file. We can also take advantage of Barman's parallel pre-fetching of WAL files, which exploits network bandwidth and reduces recovery time of the standby.

Requirements

This scenario requires:

  1. Barman >= 1.6.1 on the barman server
  2. barman-cli package on the chris (and, for high availability symmetry, on angus)
  3. Public SSH key of the postgres@chris user in the ~/.ssh/authorized_keys file of the barman@barman user (procedure known as exchange of SSH public key)
  4. You are advised to repeat step 3 for the postgres@angus user: the reason for this is that in a HA cluster, after a switchover, angus might become a standby itself.

Installation

As root user on chris (and then angus), install the barman-cli package from the 2ndQuadrant's public repository, using apt or yum.

Then, as postgres user verify it is working:

$ barman-wal-restore -h
usage: barman-wal-restore [-h] [-V] [-U USER] [-s SECONDS] [-p JOBS]
[--spool-dir SPOOL_DIR] [-z] [-j] [-c CONFIG] [-t]
BARMAN_HOST SERVER_NAME WAL_NAME WAL_DEST

This script will be used as a 'restore_command' based on the get-wal feature
of Barman. A ssh connection will be opened to the Barman host.

positional arguments:
BARMAN_HOST The host of the Barman server.
SERVER_NAME The server name configured in Barman from which WALs
are taken.
WAL_NAME The value of the '%f' keyword (according to
'restore_command').
WAL_DEST The value of the '%p' keyword (according to
'restore_command').

optional arguments:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-U USER, --user USER The user used for the ssh connection to the Barman
server. Defaults to 'barman'.
-s SECONDS, --sleep SECONDS
Sleep for SECONDS after a failure of get-wal request.
Defaults to 0 (nowait).
-p JOBS, --parallel JOBS
Specifies the number of files to peek and transfer in
parallel. Defaults to 0 (disabled).
--spool-dir SPOOL_DIR
Specifies spool directory for WAL files. Defaults to
'/var/tmp/walrestore'.
-z, --gzip Transfer the WAL files compressed with gzip
-j, --bzip2 Transfer the WAL files compressed with bzip2
-c CONFIG, --config CONFIG
configuration file on the Barman server
-t, --test test both the connection and the configuration of the
requested PostgreSQL server in Barman to make sure it
is ready to receive WAL files. With this option, the
'wal_name' and 'wal_dest' mandatory arguments are
ignored.

If you get this output, the script has been installed correctly.

You can test that the SSH connection from PostgreSQL server to Barman works correctly as follows:

barman-wal-restore --test barman angus DUMMY DUMMY

(where DUMMY is just a placeholder)

Configuration and setup

Locate the recovery.conf in chris and properly set the restore_command option:

restore_command = 'barman-wal-restore -U barman -p 8 -j -s 60 barman angus %f %p'

The above example will connect to barman as barman user via SSH and transparently execute the get-wal command on the angus PostgreSQL server backed up in Barman.

The script will pre-fetch up to 8 WAL files at a time and, by default, store them in a temporary spool folder (by default in /var/tmp/barman-wal-restore, but it can be changed with --spool-dir option).

In case of error, it will sleep for 60 seconds. Using the help page you can learn more about the available options and tune them in order to best fit in your environment.

Alternatively, you can use the man page:

man barman-wal-restore

Further information is available also in the get-wal section of the Barman documentation.

Verification

All you have to do now is restart the standby server on chris and check from the PostgreSQL log that WALs are being fetched from Barman and restored:

Jul 15 15:57:21 chris postgres[30058]: [23-1] LOG: restored log file "00000001000019EA0000008A" from archive

You can also peek in the /var/tmp/barman-wal-restore directory and verify that the script has been executed.

Even Barman logs contain traces of this activity.

Conclusions

The barman-wal-restore script, part of the barman-cli packages, is open source software that we have written and that is available under GNU GPL 3. It makes the PostgreSQL cluster more resilient, thanks to the tight cooperation with Barman.

It not only provides a stable fallback method for WAL fetching, but it also protects PostgreSQL standby servers from the infamous 255 error returned by SSH in the case of network problems – which is different than SIGTERM and therefore is treated as an exception by PostgreSQL, causing the recovery process to abort (see the “Archive Recovery Settings” section in the PostgreSQL documentation).

Was this article helpful?

0 out of 0 found this helpful