It is possible that, due to an unexpected behavior, Barman receives for a second time a WAL file that has already been archived, issuing a duplicate WAL file error. It is a rare event that is not (yet) directly managed by Barman and that requires manual intervention.
IMPORTANT: throughout this document, we will assume that the name of the involved server is
angus
. Please change that with the actual name of the server.
Usually, it is the monitoring platform itself where Barman is integrated with (e.g. icinga
) that promptly sends alerts when this issue is present, with:
BARMAN CRITICAL - 1 server out of xx have issues * angus FAILED: archiver errors
Alternatively, you can manually check for this error by executing the following command on the Barman server:
barman check angus
In the output, the archiver errors
section will display the required information on the matter.
You must inject the correct WAL file in the proper WALs directory of the server in Barman. This operation needs to be performed with extreme care.
In case the archived WAL file is OK, you don't need to do anything (apart from cleaning the work directory).
Otherwise, if the archived file is the one with problems, you need to substitute that with the correct one.
First, verify the compression algorithm used with the server:
# barman show-server angus | grep compression
compression: gzip
custom_compression_filter: None
custom_decompression_filter: None
network_compression: False
In this case, Barman uses gzip for WAL compression.
Then, enter the work directory and type:
# If you have followed the diagnostic steps, the file now called '00000001000007C9000000CF'
# was previously '00000001000007C9000000CF.20181209T212501Z.duplicate'
gzip 00000001000007C9000000CF
mv 00000001000007C9000000CF.gz 00000001000007C9000000CF
# Copy the duplicate WAL in the correct directory
cp 00000001000007C9000000CF \
$(barman show-server angus | grep -P '^\twals_directory' | cut -f 2 -d ':')/00000001000007C9/
The last operation injects the compressed WAL file into the archive.
IMPORTANT: Based on the feedback received during support operations, 2ndQuadrant will add a command to Barman itself for the management of this (rare) incident.
Prepare on the Barman directory a work directory that you will use for investigation during the incident resolution. For example:
mkdir ~/wal-investigation
Make sure that the server where you run the investigation has pg_waldump
(or pg_xlogdump
) installed, with the same major version as the PostgreSQL server that produced the WAL to check.
Before anything else you should stop the alerts, simply by moving the duplicate file out of the errors directory (in Barman configuration values: errors_directory
).
Locate the full path of errors_directory
for the involved server (angus
in the example below), with the following command:
# barman show-server angus | grep directory
backup_directory: /srv/barman/angus
barman_lock_directory: /srv/barman
basebackups_directory: /srv/barman/angus/base
data_directory: /srv/postgresql/data
errors_directory: /srv/barman/angus/errors
incoming_wals_directory: /srv/barman/angus/incoming
streaming_wals_directory: /srv/barman/angus/streaming
wals_directory: /srv/barman/angus/wals
WAL files in the errors_directory
are automatically renamed by Barman, adding the timestamp of archive and the .duplicate
suffix.
You can get the list of files with errors for the angus
server with the following command:
find $(barman show-server angus | grep -P '^\terrors_directory' | cut -f 2 -d ':') -type f -name '*.duplicate'
An example of duplicate file could be something like this:
00000001000007C9000000CF.20181209T212501Z.duplicate
This means that a WAL file name 00000001000007C9000000CF
had previously been archived.
Move the files in the ~/wal-investigation
directory for later analysis to stop receiving the alert. This can be automated with:
mv $( \
find $( \
barman show-server angus | grep -P '^\terrors_directory' | cut -f 2 -d ':'
) -type f -name '*.duplicate' \
) ~/wal-investigation
Then, retrieve the archived WAL file from Barman using the get-wal
command:
barman get-wal --output-directory ~/wal-investigation angus 00000001000007C9000000CF
The get-wal
command will transparently manage decompression of the WAL file, where applicable.
Now you can finally run pg_waldump
with both the archived file and the duplicate ones, inside the ~/wal-investigation
directory.
We suggest that you save the output to a file for further analysis.
Start with the archived one:
pg_waldump 00000001000007C9000000CF &> 00000001000007C9000000CF.archived.txt
mv 00000001000007C9000000CF 00000001000007C9000000CF.archived
Then with the duplicate one (in case you have multiple files with the same name, you need to repeat the operation for each of them):
mv 00000001000007C9000000CF.20181209T212501Z.duplicate 00000001000007C9000000CF
pg_waldump 00000001000007C9000000CF &> 00000001000007C9000000CF.duplicate.txt
Now search the text files for errors ( e.g. pg_waldump: FATAL: error in WAL record at 7C9/CFFFF858: invalid record length at 7C9/CFFFF8C8: wanted 24, got 0
).
For a line by line comparison you could use the diff
command on them e.g.
diff 00000001000007C9000000CF.*.txt
When barman receives a WAL file (either in the streaming or incoming subdirectory) with the same name but different MD5 checksum of one that has already been archived, it will put it inside the errors
subdirectory for that server.
IMPORTANT: The file moved by barman in the error directory is just the one that arrived last, not necessarily the broken one.
The reason for this normally falls into any of these categories:
- abnormal crash of the originating PostgreSQL server, which caused the first WAL file to become corrupted
- injection of the WAL from the wrong PostgreSQL server (typically by defining the wrong
incoming
directory in thearchive_command
script)