pg_wal files start to pile up

Craig Ringer
Craig Ringer
  • Updated

Co-author: Craig Ringer (original) and Jakub Wartak (maintenance of this KB article)

Issue

Sometimes, files start to pile up in the pg_wal directory (pg_wal in Postgres 11+, pg_xlog in Postgres 10-).

This may be due to a sudden and temporary increase of transactional workload - in this case all you have to do is monitor the situation and wait for the burst to end.

In some other cases, there might be problems that are causing this phenomena, and it is better to investigate. If not properly and timely fixed, files will continue accumulating until the disk fills up, and no further transactions are executed from that point onwards, leading you to a downtime.

We recommend setting up monitoring and alerting on disk space of postgres servers, so that we can proactively fix the problems before we run out of disk space. It is difficult to recover from running out of disk space, often resulting in an extended outage, and can result in losing production data if correct procedures are not being followed.

Never delete files directly from pg_wal, pg_xlog, pg_clog, etc. You will corrupt your database.

The most common cases for this to happen are:

  1. An inactive replication slot (important Do not just drop it! See below.)
  2. Continuous archive command that is failing.
  3. Excessive wal_keep_size (wal_keep_size in Postgres 13+,wal_keep_segments Postgres 12-).
  4. Checkpoints failing and preventing WAL removal.
  5. Workloads causing WAL spikes.

Replication slots

Important: Never drop a replication slot without identifying what it's for, even if it's not shown as "active".

Replication slots make sure that PostgreSQL does not delete a WAL file until the attached clients have consumed transactional data from it. If streaming replication clients are not attached to the slot, PostgreSQL preserves WAL files until the slot becomes active again or it is dropped.

To determine if a replication slot is not currently being used (inactive), use this query:

SELECT slot_name, slot_type, active, restart_lsn FROM pg_replication_slots ;

  slot_name   | slot_type | active | restart_lsn
--------------+-----------+--------+--------------
 slot_node30  | physical  | t      | 3705/2A440EC8
 slot_node209 | physical  | t      | 3705/2A440EC8
 slot_node21  | physical  | t      | 3705/2A440EC8
 slot_node53  | physical  | f      | 36DF/7C7E0000
 slot_node32  | physical  | t      | 3705/2A440EC8

Note that one slot is marked as inactive; also, its restartlsn value is way behind the other values. This may mean that the replication slot is no longer required, but you should verify this conclusively before you drop the slot. If the slot is still needed by a node or service that is temporarily offline, encountering an error, etc, then dropping the slot may unrecoverably damage that node or service's operation. See the linked article for guidance on identifying what a replication slot is used for and whether it is safe to remove. Especially on a PGD system it's vital that you not drop slots without very careful checking (this applies especially to slot names starting with names like bdr_bdrdb_).

If you are absolutely certain the slot is no longer required by any existing service or node you may drop it with:

SELECT pg_drop_replication_slot('slot_node53');

If it is Barman that has requested that replication slot, you can either resume WAL streaming or drop the replication slots:

  • If you want to resume WAL streaming, you normally need to enable streaming_archiver = on to the server, and barman cron will perform the needed actions.
  • If you need to drop the slot, instead, you can use the barman receive-wal --drop-slot [SERVER_ID] command.

If the slot is active and being used by physical standby, but the standby's lag is growing (meaning that the standby is slow with data replication - due to slow WAL replay and/or slow data transfer) - it would be best to investigate root cause of the slowness.

Often the primary is able to keep up wth the WAL workload, however due to TCP bandwidth limitation (which are function of time, especially on WAN links) - network path might not have bandwidth available to receive data, which prevents the replication slot from being updated on primary and causes also high WAL disk space consumption on primary. Such an issue can be diagnosed by tracking long-term (e.g. by plotting graphs):

  1. replication lag of standbys,
  2. bandwidth used by TCP replication on the standby (tricky, but assuming standby is NOT being used for queries it is just network ingress on standby host)

Continuous WAL archiving with archive_command

To verify that an archive_command is working properly, you need to examine the server logs. A failing archive_command would report an error like this:

FATAL: archive command failed with exit code 255

DETAIL: The failed archive command was: /usr/bin/scp -B /Volumes/DataDrive/data/pg_xlog/000000010000007400000086 postgres@10.0.204.55:/Volumes/DataDrive/wals_from_master/000000010000007400000086

LOG: archiver process (PID 17771) exited with exit code 1

The following grep command may make it easier to find archive_command errors in a larger log file:

grep -C10 'FATAL: archive command failed with'

An alternative method is to check the status of the archiver process through the pg_stat_archiver system view, which in newer PostgreSQL versions contains information about the current status and statistics of the archiver process:

                       View "pg_catalog.pg_stat_archiver"
       Column       |           Type           | Collation | Nullable | Default
--------------------+--------------------------+-----------+----------+---------
 archived_count     | bigint                   |           |          |
 last_archived_wal  | text                     |           |          |
 last_archived_time | timestamp with time zone |           |          |
 failed_count       | bigint                   |           |          |
 last_failed_wal    | text                     |           |          |
 last_failed_time   | timestamp with time zone |           |          |
 stats_reset        | timestamp with time zone |           |          |

If you use Barman for backups, make sure that there is enough available space on that system and that the barman check command returns SUCCESS.

There are two options:

  1. Fixing the condition that is causing the command to fail (for instance, mount a network partition, or free some space in a remote partition or the backup server, or install SSH keys to allow non-interactive access, etc)
  2. Removing the archive command and set something that always works such as 'exit 0'. Note that this would stop archiving and thus any non-streaming replica or backup solution that depends on the archive would no longer work. If the operation is temporary, this would cause a break in the recoverability window of your disaster recovery solution.

Do not delete the WAL files directly.

Please see our knowledge base article WAL Archiving Best practices for more information.

Excessive WAL retention configuration

PostgreSQL may be configured to retain extra WAL in pg_wal directory (pg_wal in Postgres 11+, pg_xlog in Postgres 10-).

Keeping WALs for non replication slots:

One use case is for when using physical streaming replicas without replication slots, to configure to hold some additional WALs in case the replicas temporarily fall behind the master or their connections are interrupted. This allows the replicas to catch back up after the connections are re-established. Without this, with no physical slot to tell the master to keep WAL for them, the master will delete the WALs when it no longer needs them.

The option wal_keep_size (wal_keep_size in Postgres 13+,wal_keep_segments Postgres 12-) controls this extra WAL retention.

If you can't figure out why WAL is being retained and it's not down to a severely lagging slot or failing archiving, check to see if it's reasonable for your available pg_wal space.

If you only use replicas with replication slots, or if you have an archive_command and your replicas have a restore_command that uses the WAL archive, you don't need any wal_keep_size.

Keeping WALs for replication slots:

The default use case for replication slots is for the primary to keep WALs indefinitely when a replication slot is inactive. This is controlled by max_slot_wal_keep_size (max_slot_wal_keep_size in Postgres 13+, not configurable in Postgres 12-).

If a replication slot stays permanently inactive, then eventually the primary will run out of disk space and hit a major outage.

Checkpoint configuration problems

PostgreSQL offers control over when it checkpoints the system. A checkpoint ensures all WAL prior to the checkpoint is fully applied, flushes various caches and buffers, then deletes or recycles the old WAL. Since PostgreSQL 11 the amount of retained WALs was reduced from two last checkpoints to just one.

Checkpoints running too frequently can cause performance issues like I/O stalls so they're often tuned to be spread out. But if you let checkpoints run too long the server may fail to keep up with the rate at which new WAL is generated, especially when doing very heavy writes.

The WAL size limit trigger for a forced checkpoint is the max_wal_size option in PostgreSQL 9.5 and higher, and the amount of WAL to keep for recycling is min_wal_size. On 9.4 and older the checkpoint limit is controlled by checkpoint_segments.

Postgres runs a scheduled checkpoint every checkpoint_timeout minutes since the last checkpoint.

If some values are too high, your system may not clear its backlog of WAL properly.

Failing checkpoints

It's also possible for checkpoints to fail when something is wrong in PostgreSQL. If checkpoints fail, PostgreSQL cannot remove WAL and it will pile up in pg_wal.

Check for errors in your PostgreSQL logs relating to checkpoints.

We recommend having enable the log_checkpoints configuration option.

It can also be informative to force a checkpoint with the CHECKPOINT; SQL command and see what happens.

You can see when the last successful checkpoint was by running SELECT * FROM pg_control_checkpoint();

Diagnostic Steps

You can diagnose problems with continuous archiving with this query, taken from barman check command internals:

SELECT *,
current_setting('archive_mode')::BOOLEAN
AND (last_failed_wal IS NULL
OR last_failed_wal <= last_archived_wal)
AS is_archiving
FROM pg_stat_archiver;

Verify that the calculated is_archiving column from the above query has value true.

Workloads causing WAL spikes

In our experience heavy batch jobs, CREATE TABLE AS SELECT, VACUUM FULL and REFRESH MATERIALIZED VIEW are generally the major causes of high WAL throughput generation, especially if executed on large relations with several jobs at once.

Most of the time pg_wal write performance is adequate, however one needs be sure that the `archive_command performance and standby replication performance can match the demands.

How to measure total WAL throughput

If there's a dedicated filesystem hosting pg_wal, one can check write rate there by using command iostat -dxm 1 XXX (where XXX is the block device for that filesystem).

We also recommend graphing filesystem space consumption (or growth rate on it), especially for the pg_wal filesystem.

By running the following query (the best result is to combine that with psql's \watch 1 functionality), one can get real time WAL generation rate on PostgreSQL 14+:

WITH
        b as (SELECT wal_bytes as s, pg_sleep(1), pg_stat_clear_snapshot() FROM pg_stat_wal ),
        e as (SELECT wal_bytes as s, pg_sleep(0), pg_stat_clear_snapshot() FROM pg_stat_wal )
SELECT round((e.s - b.s)/1) as wal_bytes_per_sec FROM b CROSS JOIN e;

SQL statements generating most of WAL

Since PostgreSQL 13 one can track WAL generation rates for most active SQLs using pg_stat_statements. Please see the documentation for more about this, especially about wal_bytes column.

Segment/top-talking-relations identification from WAL segment

It is possible to get a glimpse of the most active relations from the WAL segment(s). Those WAL segments are 16MB by default, and their last modification date can be used to tell us what relation(s) did generate high WAL traffic up to their last modification time.

  1. Get current WAL segment name using SELECT pg_walfile_name(pg_current_wal_lsn()); or just try to get the WAL segment during the peak.

  2. Run pg_waldump on that WAL segment: pg_waldump /path/to/pg_wal/000000010000000000000071 > /tmp/walanalysis.txt

  3. Analyze the result for top#10 most active WAL relations : grep -Po ' rel [0-9\/]+' /tmp/walanalysis.txt | awk '{print $2}' | sort | uniq -c | sort -r -n -k 1 | head -10

Sample output:

  70923 1663/5/28243
   5576 1663/5/28228
   2800 1663/5/28231
   2797 1663/5/28229
[]..]

The format is as follows: tablespace OID, database OID, and relfilenode. In this case, the top consumer of WAL volume was 28243 on default tablespace on database with OID = 5. The OIDs can be looked up using the following SQL queries:

SELECT oid, spcname FROM pg_tablespace WHERE oid = 1663;

SELECT oid, datname FROM pg_database WHERE oid = 5;

SELECT c.oid, nspname as schema, relname FROM pg_class c JOIN pg_namespace n ON (c.relnamespace=n.oid) WHERE c.oid = 28243;

Was this article helpful?

0 out of 0 found this helpful