A customer running on EPAS 14.6 experienced a rare edge case where a stuck/broken autovacuum process with no wait event was preventing checkpoints. This caused a build of up WAL files, and a risk of running out of disk space and a critical incident.
During investigations, automatic checkpoints were not executing.
- A manual checkpoint was hanging with
wait_event_type=IPC
,wait_event=CheckpointStart
. - The checkpointer postgres process was hanging with
wait_event_type=LWLock, wait_event=BufferContent
.
The checkpointer was stuck in wait event BufferContent - waiting to access a data page in memory.
This wait event is part of a set of internal postgres locks with wait event type=LWLock (light-weight), locks meant to be acquired for a short time to work with a data structure.
LW locks will not show in pg_locks
, which displays heavy-weight locks/wait_event_type=Lock
.
Using the following query, we saw that an autovacuum worker process had been broken for 7 days:
select now(),backend_start,backend_type from pg_stat_activity;
now | backend_start | backend_type
2024-10-30 16:32:11.132339+08 | 2024-10-23 00:29:35.74479+08 | autovacuum worker
Examining further, we saw that it was a stuck vacuum analyze process, with no wait event:
select pg_blocking_pids(pid),* from pg_stat_activity where backend_type = 'autovacuum worker';
pg_blocking_pids | datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query_id | query | backend_type
------------------+-------+---------------+--------+------------+----------+---------+------------------+-------------+-----------------+-------------+------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------+--------+-------------+--------------+----------+-----------------------------------------------------------+-------------------
{} | 20000 | affected_database_name | 8000000 | (NULL) | (NULL) | (NULL) | | (NULL) | (NULL) | (NULL) | 2024-10-23 00:29:35.74479+08 | 2024-10-23 02:26:20.392731+08 | 2024-10-23 02:26:20.392731+08 | 2024-10-23 02:26:20.392731+08 | (NULL) | (NULL) | active | (NULL) | 490000000 | (NULL) | autovacuum: VACUUM ANALYZE schema_name.table_name | autovacuum worker
Given that this process was stuck, it was suspected that it may have been related to the issue. This process was endlessly running with 1 CPU core 100% occupied.
kill
/pg_terminate_backend
did not work on this autovacuum worker process. Restarting postgres was not an option, as the customer hoped to close this session gracefully, and avoid postgres crash recovery and replaying WALs.
For further context, the database size was in the TBs, and the checkpoint had not been able to run for days, meaning that multiple days of WALs would have needed to been replayed, and a production outage of up to a day was expected for WAL replay.
Note that kill -9
on the process also would have resulted in postgres performing crash recovery.
At this point, EDB development were involved in this issue, and aiding EDB support.
EDB provided some commands to obtain diagnostic data on the processes.
EDB then provided a command to try to close the broken autovacuum process gracefully. The command had been tested successfully locally on standard healthy postgres processes, but was not guaranteed to close this broken process gracefully due to the rare edge case that was being experienced, and may have resulted in crash recovery.
This command worked, resulted in a graceful closure of the autovacuum process, checkpoint then being able to run, avoiding an outage for the customer. Therefore, in hindsight, the broken vacuum was permanently occupying and not releasing some data within shared memory, that the checkpointer was waiting for.
Analysis of the diagnostic data created, dmesg -T
, /var/log/messages
, postgres logs did not give additional information into why this autovacuum command became stuck.
If you experience an issue similar to this, where postgres internals are not completing, please contact EDB support.
The commands for diagnosing the broken processes, and closing the broken process have not been provided in this article. This is because in these kind of situations where postgres behaviour is broken in an unexpected way, an EDB developer needs to review the situation, and provide actions on a case by case basis. For EDB development use, use 42298
to review this case.