Duplicate WAL file error in Barman

Florin Irion
Florin Irion

Issue

It is possible that, due to an unexpected behavior, Barman receives for a second time a WAL file that has already been archived, issuing a duplicate WAL file error. It is a rare event that is not (yet) directly managed by Barman and that requires manual intervention.

IMPORTANT: throughout this document, we will assume that the name of the involved server is angus. Please change that with the actual name of the server.

Usually, it is the monitoring platform itself where Barman is integrated with (e.g. icinga) that promptly sends alerts when this issue is present, with:

BARMAN CRITICAL - 1 server out of xx have issues * angus FAILED: archiver errors

Alternatively, you can manually check for this error by executing the following command on the Barman server:

barman check angus

In the output, the archiver errors section will display the required information on the matter.

Resolution

You must inject the correct WAL file in the proper WALs directory of the server in Barman. This operation needs to be performed with extreme care.

In case the archived WAL file is OK, you don't need to do anything (apart from cleaning the work directory).

Otherwise, if the archived file is the one with problems, you need to substitute that with the correct one.

First, verify the compression algorithm used with the server:

# barman show-server angus | grep compression
compression: gzip
custom_compression_filter: None
custom_decompression_filter: None
network_compression: False

In this case, Barman uses gzip for WAL compression.

Then, enter the work directory and type:

# If you have followed the diagnostic steps, the file now called '00000001000007C9000000CF'
# was previously '00000001000007C9000000CF.20181209T212501Z.duplicate'

gzip 00000001000007C9000000CF 
mv 00000001000007C9000000CF.gz 00000001000007C9000000CF
# Copy the duplicate WAL in the correct directory
cp 00000001000007C9000000CF \
$(barman show-server angus | grep -P '^\twals_directory' | cut -f 2 -d ':')/00000001000007C9/

The last operation injects the compressed WAL file into the archive.

IMPORTANT: Based on the feedback received during support operations, 2ndQuadrant will add a command to Barman itself for the management of this (rare) incident.

Diagnostic Steps

Preliminaries

Prepare on the Barman directory a work directory that you will use for investigation during the incident resolution. For example:

mkdir ~/wal-investigation

Make sure that the server where you run the investigation has pg_waldump (or pg_xlogdump) installed, with the same major version as the PostgreSQL server that produced the WAL to check.

Disable the alert

Before anything else you should stop the alerts, simply by moving the duplicate file out of the errors directory (in Barman configuration values: errors_directory).

Locate the full path of errors_directory for the involved server (angus in the example below), with the following command:

# barman show-server angus | grep directory

backup_directory: /srv/barman/angus
barman_lock_directory: /srv/barman
basebackups_directory: /srv/barman/angus/base
data_directory: /srv/postgresql/data
errors_directory: /srv/barman/angus/errors
incoming_wals_directory: /srv/barman/angus/incoming
streaming_wals_directory: /srv/barman/angus/streaming
wals_directory: /srv/barman/angus/wals

WAL files in the errors_directory are automatically renamed by Barman, adding the timestamp of archive and the .duplicate suffix.

You can get the list of files with errors for the angus server with the following command:

find $(barman show-server angus | grep -P '^\terrors_directory' | cut -f 2 -d ':') -type f -name '*.duplicate'

An example of duplicate file could be something like this:

00000001000007C9000000CF.20181209T212501Z.duplicate

This means that a WAL file name 00000001000007C9000000CF had previously been archived.

Move the files in the ~/wal-investigation directory for later analysis to stop receiving the alert. This can be automated with:

mv $( \
find $( \
barman show-server angus | grep -P '^\terrors_directory' | cut -f 2 -d ':'
) -type f -name '*.duplicate' \
) ~/wal-investigation

Retrieve the archived WAL file

Then, retrieve the archived WAL file from Barman using the get-wal command:

barman get-wal --output-directory ~/wal-investigation angus 00000001000007C9000000CF

The get-wal command will transparently manage decompression of the WAL file, where applicable.

Verify the files with pg_waldump

Now you can finally run pg_waldump with both the archived file and the duplicate ones, inside the ~/wal-investigation directory.

We suggest that you save the output to a file for further analysis.

Start with the archived one:

pg_waldump 00000001000007C9000000CF &> 00000001000007C9000000CF.archived.txt
mv 00000001000007C9000000CF 00000001000007C9000000CF.archived

Then with the duplicate one (in case you have multiple files with the same name, you need to repeat the operation for each of them):

mv 00000001000007C9000000CF.20181209T212501Z.duplicate 00000001000007C9000000CF
pg_waldump 00000001000007C9000000CF &> 00000001000007C9000000CF.duplicate.txt

Now search the text files for errors ( e.g. pg_waldump: FATAL: error in WAL record at 7C9/CFFFF858: invalid record length at 7C9/CFFFF8C8: wanted 24, got 0).

For a line by line comparison you could use the diff command on them e.g.

diff 00000001000007C9000000CF.*.txt

Root Cause

When barman receives a WAL file (either in the streaming or incoming subdirectory) with the same name but different MD5 checksum of one that has already been archived, it will put it inside the errors subdirectory for that server.

IMPORTANT: The file moved by barman in the error directory is just the one that arrived last, not necessarily the broken one.

The reason for this normally falls into any of these categories:

  • abnormal crash of the originating PostgreSQL server, which caused the first WAL file to become corrupted
  • injection of the WAL from the wrong PostgreSQL server (typically by defining the wrong incoming directory in the archive_command script)

Was this article helpful?

0 out of 0 found this helpful