Terminate hung postgres backends waiting for oracledblinks

Jakub Wartak
Jakub Wartak

How to terminate hung postgres backends waiting for oracledblinks.

Background

  • When sending data between Oracle & Postgres, hung backends can be caused by a combination of networking issues, the oracle libraries, and postgres. The stuck backends can hold back the xmin, and prevent vacuums on an object. The oracle library doesn't let any of the following kill the backends:

  • pg_terminate_backend

  • pg_cancel_backend

  • $ kill.

  • $ kill -9 should never be used, as it may kill the postgres cluster.

  • Restarting postgres removes the problem, but is a very inconvenient solution.

  • This article describes reproducing this and closing the backends, using a workaround of finding the associated file descriptor, and closing the file descriptor. This is the current recommended way to resolve this issue.

How to reproduce hung Oracle dblink

Tested on EPAS 11.x and 14.x (with edb_oci_dblink 1.0)

First, we are going to reproduce the same situation to understand what's happening:

In session sql1:

test1=# select pg_backend_pid();
pg_backend_pid
6617

test1=# select dbms_lock.sleep(300) from global_name@dblink3;

In session SSH2:

[root@rockyora ~]# netstat -ntpo | grep 6617 ## connection from PID=6617 over localhost on port 54874 ($PORT)
tcp 0 0 127.0.0.1:54874 127.0.0.1:1521 ESTABLISHED 6617/postgres: ente off (0.00/0/0)
[root@rockyora ~]# iptables -I INPUT -p tcp --dport 54874 -j DROP
[root@rockyora ~]# iptables -I INPUT -p tcp --sport 54874 -j DROP

Connection is now blocked both ways and nothing is working :

test1=# select dbms_lock.sleep(300) from global_name@dblink3;
^CCancel request sent
ERROR: canceling statement due to user request
^CCancel request sent
^CCancel request sent
^CCancel request sent
^CCancel request sent
# this is also not working:
[root@rockyora ~]# kill -TERM 6617
[root@rockyora ~]#

That sleep(300) can be blocked here almost infinitely (days). The problem on PostgreSQL side might be backend_xid that is going to be kept for this session might block VACUUM from performing removal of deadrows and/or preventing XID wraparounds. The only possible choice of killing that session was to kill -9 the PID (do NOT try that as it is going to crash whole PostgreSQL/EPAS instance and cause outage)

Root-cause

The EPAS/PostgreSQL session hangs on Oracle (Instant) Client which waits on read() syscall:

(gdb) where
#0 0x00007f2c35d63a82 in read () from /lib64/libpthread.so.0
#1 0x00007f2bdf8a7c04 in snttread () from /u01/app/oracle/product/11.2.0/xe/lib/libclntsh.so
#2 0x00007f2bdf8a6b95 in ?? () from /u01/app/oracle/product/11.2.0/xe/lib/libclntsh.so
[..]
#10 0x00007f2bddd3e2c4 in upirtr () from /u01/app/oracle/product/11.2.0/xe/lib/libclntsh.so
#11 0x00007f2bde8de1dc in kputxabt () from /u01/app/oracle/product/11.2.0/xe/lib/libclntsh.so
#12 0x00007f2bde97115f in OCITransRollback () from /u01/app/oracle/product/11.2.0/xe/lib/libclntsh.so
#13 0x00007f2be0113d67 in oci_endtrans () from /usr/edb/as11/lib/edb_dblink_oci.so
#14 0x00007f2be0113e60 in oci_xact_callback () from /usr/edb/as11/lib/edb_dblink_oci.so
#15 0x000000000050717c in AbortTransaction ()
#16 0x0000000000507eb7 in AbortCurrentTransaction ()
#17 0x00000000007cbbfc in PostgresMain ()
#18 0x000000000074f606 in ServerLoop ()
#19 0x000000000075052d in PostmasterMain ()
#20 0x00000000004889af in main ()

while the proper EPAS/PostgreSQL SIGINT signal handler is blocked and handled by the Oracle sighandler instead (unable to escape read() on this malfunctioning socket), strace output:

read(14, <CTRL-C> 0x25812a6, 8208) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
-- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=9218, si_uid=992} --
rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
rt_sigaction(SIGINT, {sa_handler=0x7f2bdefc5c14, sa_mask=~[ILL ABRT BUS FPE SEGV USR2 XCPU XFSZ SYS RTMIN RT_1], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7f2c35d64cf0}, {sa_handler=0x7f2bdefc5c14, sa_mask=~[ILL ABRT BUS FPE KILL SEGV USR2 STOP XCPU XFSZ SYS RTMIN RT_1], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7f2c35d64cf0}, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
rt_sigreturn({mask=[]}) = 0
read(14,

(note it entered read() again after being delivered SIGINT)

As such OCITransRollback() is not capable of finishing/aborting quickly, so it blocks (almost forever) due to:

  • missing socket configuration for DCD/TCD (native Oracle's keepalive and/or TCP keepalives in Oracle 12c+)
  • libclntsh having its own signal handlers and also masking signals

In addition it might take hours before the session is killed - if at all it works (due to how OS is configured, i.e. Linux is having by default 2h? TCP keepalives).

Injecting call (void *)exit(1) won't work, because exit() will try to again call AbortCurrentTransaction() even with overwriting memory for edb_oci_dblink's static bool xact_got_connection to 0x0 because there are "on_exit()" handlers present to close DBlink connections. So when calling exit() the following will happen: exit() -> proc_exit_prepare() -> RemoveTempRelationsCallback() -> AbortOutOfAnyTransactionInternal -> AbortTransaction() -> oci_xact_callback() oci_endtrans() -> OCITransRollback () , so it's broken at this point due to various cleanup handlers.

Please do not use SIGKILL as it crashes the whole DB.

Workaround for hung backend

First identify the FD number of this dblink connection for this specific PID (here it's 14):

[root@rockyora ~]# lsof -nPp 6617 | grep -e ^COMMAND -e IPv4
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
edb-postg 6617 enterprisedb 14u IPv4 144843 0t0 TCP 127.0.0.1:54874->127.0.0.1:1521 (ESTABLISHED)

(if there are many remote TCP connections, pick the fd for the one having :1521 remote port (Oracle port).)

Inject the code to force-close that fd=14u:

[root@rockyora ~]# gdb -q --batch -p 6617 -ex "call (int)close(14)"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f2c35d63a82 in read () from /lib64/libpthread.so.0
$1 = 0
[Inferior 1 (process 10450) detached]

This will immediately result in generating error in the DBLINK session and allow running AbortTransaction() with added benefit handling other signals (SIGTERM)//killing just the backend:

WARNING: OCI error: ORA-03113: end-of-file on communication channel
Process ID: 6620
Session ID: 9 Serial number: 95
FATAL: terminating connection due to administrator command

References

Case involved: https://techsupport.enterprisedb.com/rt/ticket/89752/

JIRA enhancement/patch proposal: https://enterprisedb.atlassian.net/browse/DB-2156

Was this article helpful?

0 out of 0 found this helpful