Calling the function pg_cancel_backend()
or pg_terminate_backend()
returns true, but does not actually stop the postgres backend. Similarly sending SIGTERM
(signal 15 on Linux) with the kill
does not terminate the backend.
WARNING: Sending a SIGKILL
(kill -9
, kill -KILL
) will cause the postmaster to do a crash/recovery/restart cycle where all other connections are terminated and all open transactions are aborted. So while it'll usually kill the stuck backend, it's often very disruptive to production operations. The postmaster may take a while to recover and accept connections again as it has to redo WAL from the last checkpoint.
As a short-term operator intervention, the stuck postgres process can usually be forced to process interrupts with gdb if gdb and debug symbols are available.
IMPORTANT: Please follow the diagnostic steps listed in "Diagnostic Steps" section first, if at all possible as it may be difficult or impossible to determine why the process got stuck once it has terminated.
You will need debug info to be available for the running PostgreSQL for these instructions to work. See "Notes on debug info" at the end.
Check if you have debuginfo by starting psql
and running select pg_backend_pid()
then, in a regular unix shell as the same user that runs postgres (or as root, if system security settings disallow doing it as postgres) gdb -p $the_pid_you_got
, then in gdb
run:
set logging on
set pagination off
set print pretty on
p debug_query_string
If the p debug_query_string
succeeds without error, you have debug symbols.
First use gdb to attach to the postgres process:
gdb -p <pid>
Next capture a backtrace showing what postgres
is doing, for later analysis and debugging.
thread apply all bt full
If possible get more than one by continuing execution (cont
) and then interrupting again with control-C, as this will help diagnose infinite loops.
If there is enough disk space somewhere on the system, preferably outside the partition/file system the PGDATA
directory is on, you should also run gcore /path/to/somewhere/with/space/postgres.PROCESS_ID
to write a core file for the problem postgres backend. You'll need at least shared_buffers
worth of space or the RSS
of the postgres
backend, whichever is greater, but it's hard to say exactly how much more. Note that you cannot usefully copy it to another machine for analysis (you'd have to copy all the library binaries PostgreSQL is linked to and all their debuginfos too). Just keep the core for later examination.
To try to get the stuck postgres
backend to exit cleanly(ish), force the process to process interrupts by running, on the gdb prompt:
p ProcessInterrupts()
Note that this runs the risk of crashing the PostgreSQL backend. The process might terminate uncleanly with a SIGABRT
(signal 6), SIGSEGV
(signal 11, segmentation fault) etc, if it wasn't in a safe state to exit cleanly. This will force a restart. At which point PostgreSQL will begin crash recovery from the last checkpoint, causing an outage until recovery is complete.
If this continues to fail to stop the postgres process, then all graceful recovery methods have be exhausted. All related applications should be gracefully stopped before ungracefully stopping the the postgres process with a kill -9
on the unresponsive pid. After the stuck postgres backend has been killed, PostgreSQL should self-restart and enter recovery then come up for service.
If it fails to, pg_ctl -m immediate stop
it or kill -QUIT
the postmaster. If that fails, kill -KILL
all postgres
processes. Once postgres is fully shut down, restart it using normal methods and let it finish recovery then come up normally.
In most cases that means you need to install debuginfo packages. These are usually packages named with the -debug
, -debuginfo
or -dbg
suffixes, depending on distro and version. For source builds of PostgreSQL you must have built with --enable-debug
. Building PostgreSQL with --enable-debug
has no effect on performance and should always be done for all production or test deployments, there is no good reason to ever leave it out.
You cannot rebuild PostgreSQL with --enable-debuginfo
and use debuginfo from those binaries to examine non-debug binaries. Even if it's exactly the same git revision with all the other build flags the same it will not work. Similarly, "same minor version" is not good enough for debuginfo packages, the match must be exact. And unfortunately yum.postgresql.org deletes debuginfo for old minor versions quite quickly...
To cover a reasonable set of possible causes, you should try to:
- Capture
pg_stat_activity
,pg_locks
andpg_stat_replication
for the stuck backend or preferably all backends: \copy pg_catalog.pg_stat_activity to 'pg_stat_activity.csv' with (format csv, header)
\copy pg_catalog.pg_locks to 'pg_locks.csv' with (format csv, header)
\copy pg_catalog.pg_stat_replication to 'pg_stat_replication.csv' with (format csv, header)
- Check the
state
andwchan
fields inps
for the process(es). See related KB for details. You may need root, or at least to run as the same user as postgres, to see the wchan fields, and supported ps arguments can vary a bit. Save the output for later and run it a few times if possible. In short, assuming postgres runs as userpostgres
: sudo ps -u postgres -H --forest -o pid,ppid,state,wchan:40,cmd
- Check
dmesg
to see if there are signs of I/O errors, filesystem errors, CPU or memory errors etc that may result in system calls not returning or spending excessively long times blocked in the kernel.dmesg
arguments vary butdmesg -T
is preferable, and plaindmesg
if that doesn't work. - Get a
backtrace
ingdb
if possible, or even better a repeated set of them. Instructions included in workaround steps above. - Save a core file from the problem backend with
gcore
if possible. This works even if you don't have debuginfo; you can analyse it later if you can find debuginfo for the binaries afterwards. Details in workaround steps above. - Check if the database contains procedural languages other than
pl/pgsql
. If it does, look at the function definitions and see if there are calls to external libraries. Checkpg_stat_activity
to see if there's any sign the stuck backend might be running in one now. - Check the PostgreSQL logs for errors relating to stuck spinlocks, I/O errors, etc
- Save
journalctl
output relating topostgres
if on asystemd
system with something like: -
systemctl |grep postgres
to get the postgres service name(s); then -
sudo journalctl -u $the_service_name
for basic messages; and -
sudo journalctl -o json-pretty -u $the_service_name > $the_services_name-journal.json
for detailed output - Obviously capture all available PostgreSQL logs, preferably in both csv and text formats if both are available
The data collector can capture some, but not all, of the above.
In many cases you can only tell exactly what's happening by attaching gdb
(with debug symbols for postgres and preferably libc and other libs linked into the program) then running bt
. Where this isn't possible, ps
's wchan
and state
fields can sometimes offer some insight, as can strace
. In some cases systemtap
, perf
or ebpf-tools
may also shed some light, but these require more expertise to use and
It's uncommon for a simple perf record
and perf report
or a perf top
to shed much light, as in default operation it'll only show CPU activity. But it can help spot some kinds of busy-loops.
See also tools for performance insight and tracing.
Note: If you're going to file a report for a developer, please capture bt full
and if available disk space permits, also save a core file for the problem postgres process with gcore
.
The postgres backend process is waiting within a routine that is not trapping interrupt signals. What this routine may be can vary and may not necessarily be immediately in the PostgreSQL code.
This may be library code, procedural language code, operating system / C library code, or system calls into the kernel.
Possible causes include:
- Loops within PostgreSQL or PostgreSQL extension code that fail to respect PostgreSQL's interrupt handling mechanisms and signal handlers:
- PostgreSQL native C code running in a tight busy-loop that lacks any
CHECK_FOR_INTERRUPTS()
call, so it fails to notice the variables set by PostgreSQL's signal handlers when they're run. In this case you'll see a backend with very high CPU use. - PostgreSQL native C code running in a loop over
WaitLatchOrSocket
,WaitLatch
orWaitEventSetWait
where the loop lacks anyCHECK_FOR_INTERRUPTS()
call. Again it'll fail to notice that it's been killed, but in this case it'll typically have very low CPU use and spend most of its time in the sleep state. You might see a syscall likeepoll_wait
instrace
. Theps
wchan
field will probably be-
as it doesn't seem to show epoll etc. - PostgreSQL native C code that invokes a system call in a retry loop that checks for
EINTR
and retries without callingCHECK_FOR_INTERRUPTS()
first. You might see the system call inps
'swchan
field. - For backends that fail to exit even when the postmaster exits, the backend might be running in a wait loop that fails to set the
WL_POSTMASTER_DEATH
flag toWaitLatch
,WaitLatchOrSocket
orWaitEventSetWait
and/or fails to handle it in the loop withif (rc & WL_POSTMASTER_DEATH) { proc_exit(1); }
like it should. - A PostgreSQL backend that is stuck in a
LWLockAcquire(...)
call on a PostgreSQL LWLock that another backend is failing to release, or in some cases a LWLock already held by the same backend. (LWLocks are not re-entrant, so re-acquiring an already held lock will deadlock as the backend blocks on itself). This is easy to diagnose fromgdb
output, and if the issue is repeatable you may be able to diagnose it withperf
dynamic user-space probes placed onLWLockAcquire
andLWLockRelease
if you have debuginfo and yourperf
is new enough to be able to capture function call arguments. (TODO Craig: verify, I'm not 100% sure LWLockAcquire fails to respond to signals). - A backend stuck in an uninterruptable system call (by PostgreSQL itself, or a library used by PostgreSQL) where control does not return to PostgreSQL code in order to check for interrupts. If a backend stays in the
D
state inps
for extended periods this may be the cause - but it could also just be doing lots of I/O. Checkdmesg
output for possible I/O errors, and check thewchan
field inps
to see which system call the process is waiting in or usestrace
orgdb
. - A backend that has called into a non-PostgreSQL library or language runtime that does not promptly return control to PostgreSQL-aware code. For example, a procedural language function that invokes some external library's own event loop and sets up its own signal handling that isn't aware of the PostgreSQL and PL/whatever signal handling.
- A PostgreSQL extension or a library used by an extension where the ext has overwritten PostgreSQL's signal handlers. That can happen when running as a background worker or as a regular function in a user backend. If the
PendingInterrupts
global is not set to true when you send aSIGTERM
this might be the issue; you can observe the signal handler ingdb
to be sure.