pg_xlog (pg_wal) files start to pile up

Craig Ringer
Craig Ringer

Issue

Sometimes, files start to pile up in the pg_wal (or pg_xlog for PostgreSQL versions older than 10) directory.

This may be due to a sudden and temporary increase of transactional workload - in this case all you have to do is monitor the situation and wait for the burst to end.

In some other cases, there might be problems that are causing this phenomena, and it is better to investigate. If not properly and timely fixed, files will continue accumulating until the disk fills up, and no further transactions are executed from that point onwards, leading you to a downtime.

Never delete files directly from pg_wal, pg_xlog, pg_clog, etc. You will corrupt your database.

The most common cases for this to happen are:

  1. an inactive replication slot (important Do not just drop it! See below.)
  2. continuous archive command that is failing

You can also have problems due to

  1. excessive min_wal_segments and max_wal_segments (9.5+) or wal_keep_segments (9.4 and older) settings
  2. checkpoints failing and preventing WAL removal

Resolution

Replication slots

Important: Never drop a replication slot without identifying what it's for, even if it's not shown as "active". See related articles links.

Replication slots make sure that PostgreSQL does not delete a WAL file until the attached clients have consumed transactional data from it. If streaming replication clients are not attached to the slot, PostgreSQL preserves WAL files until the slot becomes active again or it is dropped.

To determine if a replication slot is not currently being used (inactive), use this query:

SELECT * FROM pg_replication_slots;

slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn 
slot_node30 | | physical | | | t | 3557567510 | | 3705/2A440EC8
slot_node209 | | physical | | | t | 3557569748 | | 3705/2A440EC8
slot_node21 | | physical | | | t | 3557562449 | | 3705/2A440EC8
slot_node53 | | physical | | | f | 3545071332 | | 36DF/7C7E0000
slot_node32 | | physical | | | t | 3545071332 | | 3705/2A440EC8

(5 rows)

Note that one slot is marked as inactive; also, its restartlsn value is way behind the other values. This may mean that the replication slot is no longer requried, but you should verify this conclusively before you drop the slot. If the slot is still needed by a node or service that is temporarily offline, encountering an error, etc, then dropping the slot may unrecoverably damage that node or service's operation. See the linked article for guidance on identifying what a replication slot is used for and whether it is safe to remove. Especially on a BDR system it's vital that you not drop slots without very careful checking.

If you are absolutely certain the slot is no longer required by any existing service or node you may drop it with:

SELECT pg_drop_replication_slot('slot_node53');

If it is Barman that has requested that replication slot, you can either resume WAL streaming or drop the replication slots. In the first case, if you have installed Barman using 2ndQuadrant packages, you normally need to enable streaming_archiver = on to the server and barman cron will worry about the rest. If you need to drop the slot, instead, you can use the barman receive-wal --drop-slot [SERVER_ID] command.

Continuous WAL archiving with archive_command

To verify that an archive_command is working properly, you need to examine the server logs. A failing archive_command would report an error like this:

FATAL: archive command failed with exit code 255

DETAIL: The failed archive command was: /usr/bin/scp -B /Volumes/DataDrive/data/pg_xlog/000000010000007400000086 postgres@10.0.204.55:/Volumes/DataDrive/wals_from_master/000000010000007400000086

LOG: archiver process (PID 17771) exited with exit code 1

The following grep command may make it easier to find archive_command errors in a larger log file:

grep -C10 'FATAL: archive command failed with'

An alternative method is to check the status of the archiver process through the pg_stat_archiver system view, which in newer PostgreSQL versions contains information about the current status and statistics of the archiver process:

\d pg_stat_archiver
View "pg_catalog.pg_stat_archiver"
Column | Type | Collation | Nullable | Default 
archived_count | bigint | | | 
last_archived_wal | text | | | 
last_archived_time | timestamp with time zone | | | 
failed_count | bigint | | | 
last_failed_wal | text | | | 
last_failed_time | timestamp with time zone | | | 
stats_reset | timestamp with time zone | | | 

If you use Barman for backups, make sure that there is enough available space on that system and that the barman check command returns SUCCESS.

There are two options:

  1. Fixing the condition that is causing the command to fail (for instance, mount a network partition, or free some space in a remote partition or the backup server, or install SSH keys to allow non-interactive access, etc)
  2. Removing the archive command and set something that always works such as 'exit 0'. Note that this would stop archiving and thus any non-streaming replica or backup solution that depends on the archive would no longer work. If the operation is temporary, this would cause a break in the recoverability window of your disaster recovery solution.

Do not delete the WAL files directly.

Excessive WAL retention configuration

PostgreSQL may be configured to retain extra WAL in pg_wal (pg_xlog on older versions). This is needed to allow physical streaming replicas that are not using physical replication slots to function correctly if they fall behind the master or their connections are interrupted. With no physical slot to tell the master to keep WAL for them, it will delete it when no longer required by the master.

The option wal_keep_segmentscontrols this extra WAL retention. If you can't figure out why WAL is being retained and it's not down to a severely lagging slot or failing archiving, check to see if it's reasonable for your available pg_wal space.

If you only use replicas with replication slots, or if you have an archive_command and your replicas have a restore_command that uses the wal archive, you don't need any wal_keep_segments.

Checkpoint configuration problems

PostgreSQL offers control over when it checkpoints the system. A checkpoint ensures all WAL prior to the checkpoint is fully applied, flushes various caches and buffers, then deletes or recycles the old WAL.

Checkpoints running too frequently can cause performance issues like I/O stalls so they're often tuned to be spread out. But if you let checkpoints run too long the server may fail to keep up with the rate at which new WAL is generated, especially when doing very heavy writes.

The WAL size limit trigger for a forced checkpoint is the max_wal_size option in PostgreSQL 9.5 and higher, and the amount of WAL to keep for recycling is min_wal_size. On 9.4 and older the checkpoint limit is controlled by checkpoint_segments.

If these values are too high, your system may not clear its backlog of WAL properly.

Failing checkpoints

It's also possible for checkpoints to fail when something is wrong in PostgreSQL. If checkpoints fail, PostgreSQL cannot remove WAL and it will pile up in pg_wal.

Check for errors in your PostgreSQL logs relating to checkpoints.

You can also enable the log_checkpoints configuration option.

It can also be informative to force a checkpoint with the CHECKPOINT; SQL command and see what happens.

You can see when the last successful checkpoint was by checking pg_controldata (unless you're on a sealed system where you have no data directory access).

Diagnostic Steps

You can diagnose problems with continuous archiving with this query, taken from Barman's check command internals:

SELECT *,
current_setting('archive_mode')::BOOLEAN
AND (last_failed_wal IS NULL
OR last_failed_wal <= last_archived_wal)
AS is_archiving
FROM pg_stat_archiver;

Verify that the calculated is_archiving column from the above query has value true.

Root Cause

Inactive replication slots and/or failed archive command, checkpoint configuration, WAL retention configuration, or failing checkpoints.

Was this article helpful?

0 out of 0 found this helpful