A bug present in non-current PostgreSQL versions can cause the pg_replslot
directory to grow without bound. If not detected, this can cause "no space left on device" errors, an unreachable server, etc, but will rarely do so due to the preconditions required to trigger it.
A bug in non-current PostgreSQL versions causes the contents of the pg_replslot
directory to not be properly cleaned up during a logical decoding session. This can result in excessive disk space use.
WARNING: Do not drop replication slots to free space, even if you think they are unused because active=f.
The bug only occurs where the transaction being decoded is big enough to be written to a reorder buffer file in pg_replslot
, and when the server crashed or had an unclean shutdown while the transaction was in progress. The reorder buffer logic fails to recognise that the transaction is aborted, and retains it until the logical decoding session is restarted or the whole server is restarted.
Restarting the provider/upstream PostgreSQL where the reorder buffer accumulation occurs is the simplest and safest solution. It'll only be effective if confirmed_flush_lsn
in all logical replication slots has advanced past the xlog position at which the server crash(es) occurred.
In the wild, this bug has mainly been observed with servers that are unstable for other reasons, leading to repeat series of crashes due to out-of-memory errors on undersized servers configured with VM overcommit, hosts with hardware faults, etc.
Fixed in:
- 10.2
- 9.6.7
- 9.5.11
- 9.4.16
See commit, mailing list