Collecting backtrace on the startup of Postgres server

Luciano Botti
Luciano Botti

Sometimes you end up with a server which does not start up completely by the presence of some sort of corruption. In cases like this having the postmaster's backtrace during the starting phase would be highly valuable.

If you are facing a situation similar to this and you need to collect that backtrace, this solution guide would be the right place to start.

Pre-requisites

Postgresql server should have installed the *-debuginfo packages as well (for Red Hat / Centos / Fedora) or the *-dbgsym/ *-dbg packages ( for Debian / Ubuntu). See related Knowledge Base articles for more information about.

You will also need the gdb package installed.

Collecting debug data

The postgres service must be launched by gdb to properly collect the backtrace before the crash and gdb would have to be set to follow the child process of the server.

1) Preparing gdb commands

First we need to create a cmds file with some gdb commands that will be used later:

cat > cmds << EOF
set logging file gdb.txt
set logging on
set pagination off
set confirm off
set detach-on-fork off
set schedule-multiple on
break AuxiliaryProcessMain
commands 1
watch InRecovery
bt full
break errfinish if (errordata[0].elevel >= 20)
commands 3
print errordata[0]
bt full
cont
end
cont
end
run
cont
quit
EOF

A few notes about the GDB commands configured above:

  • LINE 5: It is important to set detach-on-fork off so both processes (parent and child) will be held under the control of GDB and capture the backtrace needed.
  • LINE 7: We set a breakpoint on AuxiliaryProcessMain, the function that bumps up the "startup" process. When that function is hit, it will continue with the group of commands in commands 1.
  • LINE 9: A watchpoint for the variable InRecovery is set. So this will run forever until the variable InRecovery changes. As this will happen only during the startup process when WALs are being replayed, we assure to capture any posible error during this phase.
  • LINE 10: A full backtrace will be collected after setting the watchpoint.
  • LINE 11: As we are in the startup process and WALs are being replayed, we need to set another breakpoint on the errfinish function so we can capture the backtrace only on an error situation. Note that we are interested only in Error Level >= 20 (ERROR=20, FATAL=21, PANIC=22).
  • LINES 12 to 16: When the errfinish function with the correct parameter is hit, the errordata variable will be printed to the logging file along with the full backtrace.
  • LINE 17: We use cont here so GBD will resume and continue the execution until the next breakpoint(errorfinish) or watchpoint (InRecovery)
  • LINE 19-21: The execution of GDB will be launched by issuing the run and the cont commands. After the full execution of the process GDB will return to the console so we are using quit to end GDB process and get again into the shell.

2) Attach gdb to postmaster process

Execute gdb attaching postmaster process using the cmds file created before:

gdb postmaster -x cmds

Note that in this example gdb will break by reaching the errfinish function on a conditional expression. You can change this behavior according to your needs.

This last step will end up with a gdb.txt file with the collected backtrace of the failing startup of the postmaster process. For example:

Breakpoint 1 at 0x507930: file bootstrap.c, line 195.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
2019-11-29 20:05:43 UTC [@//:17437]: [1] LOG: loading BDR 3.6.10 (030610)
2019-11-29 20:05:43 UTC [@//:17437]: [2] LOG: listening on IPv4 address "0.0.0.0", port 5432
2019-11-29 20:05:43 UTC [@//:17437]: [3] LOG: listening on IPv6 address "::", port 5432
2019-11-29 20:05:43 UTC [@//:17437]: [4] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2019-11-29 20:05:43 UTC [@//:17437]: [5] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-11-29 20:05:43 UTC [@//:17437]: [6] LOG: ending log output to stderr
2019-11-29 20:05:43 UTC [@//:17437]: [7] HINT: Future log output will go to log destination "syslog".
[New process 17441]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Switching to Thread 0x7ffff7fda8c0 (LWP 17441)]

Breakpoint 1, AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fffffffe3e0) at bootstrap.c:195
195 bootstrap.c: No such file or directory.
Hardware watchpoint 2: InRecovery
#0 AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fffffffe3e0) at bootstrap.c:195
progname = 0x1 <Address 0x1 out of bounds>
flag = <optimized out>
userDoption = <optimized out>
__func__ = "AuxiliaryProcessMain"
#1 0x00000000006b0850 in StartChildProcess (type=StartupProcess) at postmaster.c:5367
pid = <optimized out>
av = {0x86b115 "postgres", 0x7fffffffe430 "-x2", 0x0, 0x7fffffffe430 "-x2", 0x0, 0x1 <Address 0x1 out of bounds>, 0x7fffffffe440 "\001", 0x7ffff7ffb84c "H\213K\bH\272\317\367S\343\245\233\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a[1\300A\\]\303\213\005\337\310\377\377[A\211\004$\213\005\330\310\377\377A\211D$\004\061\300A\\]\303f\017\037\204", 0x0, 0x7ffff7ffb84c "H\213K\bH\272\317\367S\343\245\233\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a[1\300A\\]\303\213\005\337\310\377\377[A\211\004$\213\005\330\310\377\377A\211D$\004\061\300A\\]\303f\017\037\204"}
ac = 2
typebuf = "-x2\000\000\000\000\000\000S\307\022\036\\\253\b\001", '\000' <repeats 14 times>
#2 0x00000000006b3169 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xd1da60) at postmaster.c:1356
opt = <optimized out>
status = <optimized out>
userDoption = <optimized out>
listen_addr_saved = <optimized out>
i = <optimized out>
output_config_variable = <optimized out>
__func__ = "PostmasterMain"
#3 0x000000000047c1bf in main (argc=1, argv=0xd1da60) at main.c:228
No locals.
Breakpoint 3 at 0x81d5a0: errfinish. (2 locations)

Breakpoint 3, errfinish (dummy=0) at elog.c:414
414 elog.c: No such file or directory.
$1 = {elevel = 22, output_to_server = 1 '\001', output_to_client = 0 '\000', show_funcname = 0 '\000', hide_stmt = 0 '\000', hide_ctx = 0 '\000', filename = 0x86a375 "xlog.c", lineno = 6601, funcname = 0x872499 <__func__.26603> "StartupXLOG", domain = 0x964bc9 "postgres-10", context_domain = 0x964bc9 "postgres-10", sqlerrcode = 2600, message = 0xd20438 "could not locate a valid checkpoint record", detail = 0x0, detail_log = 0x0, hint = 0x0, context = 0x0, message_id = 0x870cc8 "could not locate a valid checkpoint record", schema_name = 0x0, table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 0, assoc_context = 0xd1dbd8}
#0 errfinish (dummy=0) at elog.c:414
edata = <optimized out>
elevel = <optimized out>
oldcontext = <optimized out>
econtext = <optimized out>
__func__ = "errfinish"
#1 0x00000000004fb8fd in StartupXLOG () at xlog.c:6283
Insert = <optimized out>
checkPoint = <optimized out>
wasShutdown = <optimized out>
reachedStopPoint = <optimized out>
haveBackupLabel = <optimized out>
haveTblspcMap = <optimized out>
RecPtr = <optimized out>
checkPointLoc = <optimized out>
EndOfLog = <optimized out>
EndOfLogTLI = <optimized out>
PrevTimeLineID = <optimized out>
record = <optimized out>
oldestActiveXID = <optimized out>
backupEndRequired = <optimized out>
backupFromStandby = <optimized out>
dbstate_at_startup = <optimized out>
xlogreader = <optimized out>
private = {emode = 15, fetching_ckpt = 1 '\001', randAccess = 1 '\001'}
fast_promoted = <optimized out>
st = {st_dev = 0, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {tv_sec = 0, tv_nsec = 0}, __unused = {0, 0, 0}}
__func__ = "StartupXLOG"
#2 0x00000000006b3731 in StartupProcessMain () at startup.c:211
No locals.
#3 0x0000000000507d3a in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fffffffe3e0) at bootstrap.c:426
progname = 0x86b115 "postgres"
flag = <optimized out>
userDoption = 0x0
__func__ = "AuxiliaryProcessMain"
#4 0x00000000006b0850 in StartChildProcess (type=StartupProcess) at postmaster.c:5367
pid = <optimized out>
av = {0x86b115 "postgres", 0x7fffffffe430 "-x2", 0x0, 0x7fffffffe430 "-x2", 0x0, 0x1 <Address 0x1 out of bounds>, 0x7fffffffe440 "\001", 0x7ffff7ffb84c "H\213K\bH\272\317\367S\343\245\233\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a[1\300A\\]\303\213\005\337\310\377\377[A\211\004$\213\005\330\310\377\377A\211D$\004\061\300A\\]\303f\017\037\204", 0x0, 0x7ffff7ffb84c "H\213K\bH\272\317\367S\343\245\233\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a[1\300A\\]\303\213\005\337\310\377\377[A\211\004$\213\005\330\310\377\377A\211D$\004\061\300A\\]\303f\017\037\204"}
ac = 2
typebuf = "-x2\000\000\000\000\000\000S\307\022\036\\\253\b\001", '\000' <repeats 14 times>
#5 0x00000000006b3169 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xd1da60) at postmaster.c:1356
opt = <optimized out>
status = <optimized out>
userDoption = <optimized out>
listen_addr_saved = <optimized out>
i = <optimized out>
output_config_variable = <optimized out>
__func__ = "PostmasterMain"
#6 0x000000000047c1bf in main (argc=1, argv=0xd1da60) at main.c:228
No locals.

Program received signal SIGABRT, Aborted.
0x00007ffff5681207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);

In this test case scenario, our postgres server is getting a PANIC error "could not locate a valid checkpoint record" and gets the SIGABRT signal, ending the postgres process and the gdb execution.

Final recommendations

As a final recommendation, reading the GDB Documentation it is highly advisable as it contains more information of all GDB commands used here and many more: GDB Online Documentation

Also reading and understanding the Postgres source code is most welcome and could be needed for other situations. For example if you need to change the breakpoint to any other function in the code instead of the errfinishfunction.

Was this article helpful?

0 out of 0 found this helpful