Postgres process not responding to pg_cancel_backend() or pg_terminate_backend()

Mark Wong
Mark Wong

Issue

Calling the function pg_cancel_backend() or pg_terminate_backend() returns true, but does not actually stop the postgres backend. Similarly sending SIGTERM (signal 15 on Linux) with the kill does not terminate the backend.

WARNING: Sending a SIGKILL (kill -9, kill -KILL) will cause the postmaster to do a crash/recovery/restart cycle where all other connections are terminated and all open transactions are aborted. So while it'll usually kill the stuck backend, it's often very disruptive to production operations. The postmaster may take a while to recover and accept connections again as it has to redo WAL from the last checkpoint.

Resolution

As a short-term operator intervention, the stuck postgres process can usually be forced to process interrupts with gdb if gdb and debug symbols are available.

IMPORTANT: Please follow the diagnostic steps listed in "Diagnostic Steps" section first, if at all possible as it may be difficult or impossible to determine why the process got stuck once it has terminated.

Checking for the presence of debug symbols

You will need debug info to be available for the running PostgreSQL for these instructions to work. See "Notes on debug info" at the end.

Check if you have debuginfo by starting psql and running select pg_backend_pid() then, in a regular unix shell as the same user that runs postgres (or as root, if system security settings disallow doing it as postgres) gdb -p $the_pid_you_got, then in gdb run:

set logging on
set pagination off
set print pretty on
p debug_query_string

If the p debug_query_string succeeds without error, you have debug symbols.

Examining the problem process

First use gdb to attach to the postgres process:

gdb -p <pid>

Next capture a backtrace showing what postgres is doing, for later analysis and debugging.

thread apply all bt full

If possible get more than one by continuing execution (cont) and then interrupting again with control-C, as this will help diagnose infinite loops.

If there is enough disk space somewhere on the system, preferably outside the partition/file system the PGDATA directory is on, you should also run gcore /path/to/somewhere/with/space/postgres.PROCESS_ID to write a core file for the problem postgres backend. You'll need at least shared_buffers worth of space or the RSS of the postgres backend, whichever is greater, but it's hard to say exactly how much more. Note that you cannot usefully copy it to another machine for analysis (you'd have to copy all the library binaries PostgreSQL is linked to and all their debuginfos too). Just keep the core for later examination.

To try to get the stuck postgres backend to exit cleanly(ish), force the process to process interrupts by running, on the gdb prompt:

p ProcessInterrupts()

Note that this runs the risk of crashing the PostgreSQL backend. The process might terminate uncleanly with a SIGABRT (signal 6), SIGSEGV (signal 11, segmentation fault) etc, if it wasn't in a safe state to exit cleanly. This will force a restart. At which point PostgreSQL will begin crash recovery from the last checkpoint, causing an outage until recovery is complete.

If this continues to fail to stop the postgres process, then all graceful recovery methods have be exhausted. All related applications should be gracefully stopped before ungracefully stopping the the postgres process with a kill -9 on the unresponsive pid. After the stuck postgres backend has been killed, PostgreSQL should self-restart and enter recovery then come up for service.

If it fails to, pg_ctl -m immediate stop it or kill -QUIT the postmaster. If that fails, kill -KILL all postgres processes. Once postgres is fully shut down, restart it using normal methods and let it finish recovery then come up normally.

Notes on debug info

In most cases that means you need to install debuginfo packages. These are usually packages named with the -debug, -debuginfo or -dbg suffixes, depending on distro and version. For source builds of PostgreSQL you must have built with --enable-debug. Building PostgreSQL with --enable-debug has no effect on performance and should always be done for all production or test deployments, there is no good reason to ever leave it out.

You cannot rebuild PostgreSQL with --enable-debuginfo and use debuginfo from those binaries to examine non-debug binaries. Even if it's exactly the same git revision with all the other build flags the same it will not work. Similarly, "same minor version" is not good enough for debuginfo packages, the match must be exact. And unfortunately yum.postgresql.org deletes debuginfo for old minor versions quite quickly...

Diagnostic Steps

To cover a reasonable set of possible causes, you should try to:

  • Capture pg_stat_activity, pg_locks and pg_stat_replication for the stuck backend or preferably all backends:
  • \copy pg_catalog.pg_stat_activity to 'pg_stat_activity.csv' with (format csv, header)
  • \copy pg_catalog.pg_locks to 'pg_locks.csv' with (format csv, header)
  • \copy pg_catalog.pg_stat_replication to 'pg_stat_replication.csv' with (format csv, header)
  • Check the state and wchan fields in ps for the process(es). See related KB for details. You may need root, or at least to run as the same user as postgres, to see the wchan fields, and supported ps arguments can vary a bit. Save the output for later and run it a few times if possible. In short, assuming postgres runs as user postgres:
  • sudo ps -u postgres -H --forest -o pid,ppid,state,wchan:40,cmd
  • Check dmesg to see if there are signs of I/O errors, filesystem errors, CPU or memory errors etc that may result in system calls not returning or spending excessively long times blocked in the kernel. dmesg arguments vary but dmesg -T is preferable, and plain dmesg if that doesn't work.
  • Get a backtrace in gdb if possible, or even better a repeated set of them. Instructions included in workaround steps above.
  • Save a core file from the problem backend with gcore if possible. This works even if you don't have debuginfo; you can analyse it later if you can find debuginfo for the binaries afterwards. Details in workaround steps above.
  • Check if the database contains procedural languages other than pl/pgsql. If it does, look at the function definitions and see if there are calls to external libraries. Check pg_stat_activity to see if there's any sign the stuck backend might be running in one now.
  • Check the PostgreSQL logs for errors relating to stuck spinlocks, I/O errors, etc
  • Save journalctl output relating to postgres if on a systemd system with something like:
  • systemctl |grep postgres to get the postgres service name(s); then
  • sudo journalctl -u $the_service_name for basic messages; and
  • sudo journalctl -o json-pretty -u $the_service_name > $the_services_name-journal.json for detailed output
  • Obviously capture all available PostgreSQL logs, preferably in both csv and text formats if both are available

The data collector can capture some, but not all, of the above.

In many cases you can only tell exactly what's happening by attaching gdb (with debug symbols for postgres and preferably libc and other libs linked into the program) then running bt. Where this isn't possible, ps's wchan and state fields can sometimes offer some insight, as can strace. In some cases systemtap, perf or ebpf-tools may also shed some light, but these require more expertise to use and

It's uncommon for a simple perf record and perf report or a perf top to shed much light, as in default operation it'll only show CPU activity. But it can help spot some kinds of busy-loops.

See also tools for performance insight and tracing.

Note: If you're going to file a report for a developer, please capture bt full and if available disk space permits, also save a core file for the problem postgres process with gcore.

Root Cause

The postgres backend process is waiting within a routine that is not trapping interrupt signals. What this routine may be can vary and may not necessarily be immediately in the PostgreSQL code.

This may be library code, procedural language code, operating system / C library code, or system calls into the kernel.

Possible causes include:

  • Loops within PostgreSQL or PostgreSQL extension code that fail to respect PostgreSQL's interrupt handling mechanisms and signal handlers:
  • PostgreSQL native C code running in a tight busy-loop that lacks any CHECK_FOR_INTERRUPTS() call, so it fails to notice the variables set by PostgreSQL's signal handlers when they're run. In this case you'll see a backend with very high CPU use.
  • PostgreSQL native C code running in a loop over WaitLatchOrSocket, WaitLatch or WaitEventSetWait where the loop lacks any CHECK_FOR_INTERRUPTS() call. Again it'll fail to notice that it's been killed, but in this case it'll typically have very low CPU use and spend most of its time in the sleep state. You might see a syscall like epoll_wait in strace. The ps wchan field will probably be - as it doesn't seem to show epoll etc.
  • PostgreSQL native C code that invokes a system call in a retry loop that checks for EINTR and retries without calling CHECK_FOR_INTERRUPTS() first. You might see the system call in ps's wchan field.
  • For backends that fail to exit even when the postmaster exits, the backend might be running in a wait loop that fails to set the WL_POSTMASTER_DEATH flag to WaitLatch, WaitLatchOrSocket or WaitEventSetWait and/or fails to handle it in the loop with if (rc & WL_POSTMASTER_DEATH) { proc_exit(1); } like it should.
  • A PostgreSQL backend that is stuck in a LWLockAcquire(...) call on a PostgreSQL LWLock that another backend is failing to release, or in some cases a LWLock already held by the same backend. (LWLocks are not re-entrant, so re-acquiring an already held lock will deadlock as the backend blocks on itself). This is easy to diagnose from gdb output, and if the issue is repeatable you may be able to diagnose it with perf dynamic user-space probes placed on LWLockAcquire and LWLockRelease if you have debuginfo and your perf is new enough to be able to capture function call arguments. (TODO Craig: verify, I'm not 100% sure LWLockAcquire fails to respond to signals).
  • A backend stuck in an uninterruptable system call (by PostgreSQL itself, or a library used by PostgreSQL) where control does not return to PostgreSQL code in order to check for interrupts. If a backend stays in the D state in ps for extended periods this may be the cause - but it could also just be doing lots of I/O. Check dmesg output for possible I/O errors, and check the wchan field in ps to see which system call the process is waiting in or use strace or gdb.
  • A backend that has called into a non-PostgreSQL library or language runtime that does not promptly return control to PostgreSQL-aware code. For example, a procedural language function that invokes some external library's own event loop and sets up its own signal handling that isn't aware of the PostgreSQL and PL/whatever signal handling.
  • A PostgreSQL extension or a library used by an extension where the ext has overwritten PostgreSQL's signal handlers. That can happen when running as a background worker or as a regular function in a user backend. If the PendingInterrupts global is not set to true when you send a SIGTERM this might be the issue; you can observe the signal handler in gdb to be sure.

Was this article helpful?

0 out of 0 found this helpful