Sometimes you end up with a server which does not start up completely by the presence of some sort of corruption. In cases like this having the postmaster's backtrace during the starting phase would be highly valuable.
If you are facing a situation similar to this and you need to collect that backtrace, this solution guide would be the right place to start.
Postgresql server should have installed the *-debuginfo
packages as well (for Red Hat / Centos / Fedora) or the *-dbgsym
/ *-dbg
packages ( for Debian / Ubuntu). See related Knowledge Base articles for more information about.
You will also need the gdb
package installed.
The postgres service must be launched by gdb
to properly collect the backtrace before the crash and gdb
would have to be set to follow the child process of the server.
1) Preparing gdb commands
First we need to create a cmds
file with some gdb
commands that will be used later:
cat > cmds << EOF
set logging file gdb.txt
set logging on
set pagination off
set confirm off
set detach-on-fork off
set schedule-multiple on
break AuxiliaryProcessMain
commands 1
watch InRecovery
bt full
break errfinish if (errordata[0].elevel >= 20)
commands 3
print errordata[0]
bt full
cont
end
cont
end
run
cont
quit
EOF
A few notes about the GDB commands configured above:
- LINE 5: It is important to set
detach-on-fork off
so both processes (parent and child) will be held under the control of GDB and capture the backtrace needed. - LINE 7: We set a breakpoint on AuxiliaryProcessMain, the function that bumps
up the "startup" process. When that function is hit, it will continue with the group
of commands in
commands 1
. - LINE 9: A watchpoint for the variable
InRecovery
is set. So this will run forever until the variableInRecovery
changes. As this will happen only during the startup process when WALs are being replayed, we assure to capture any posible error during this phase. - LINE 10: A full backtrace will be collected after setting the watchpoint.
- LINE 11: As we are in the startup process and WALs are being replayed, we need
to set another breakpoint on the
errfinish
function so we can capture the backtrace only on an error situation. Note that we are interested only in Error Level >= 20 (ERROR=20, FATAL=21, PANIC=22). - LINES 12 to 16: When the
errfinish
function with the correct parameter is hit, theerrordata
variable will be printed to the logging file along with the full backtrace. - LINE 17: We use
cont
here so GBD will resume and continue the execution until the next breakpoint(errorfinish
) or watchpoint (InRecovery
) - LINE 19-21: The execution of GDB will be launched by issuing the
run
and thecont
commands. After the full execution of the process GDB will return to the console so we are usingquit
to end GDB process and get again into the shell.
2) Attach gdb to postmaster process
Execute gdb
attaching postmaster
process using the cmds
file created before:
gdb postmaster -x cmds
Note that in this example gdb
will break by reaching the errfinish
function on a conditional expression. You can change this behavior according to your needs.
This last step will end up with a gdb.txt
file with the collected backtrace of the failing startup of the postmaster process. For example:
Breakpoint 1 at 0x507930: file bootstrap.c, line 195.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
2019-11-29 20:05:43 UTC [@//:17437]: [1] LOG: loading BDR 3.6.10 (030610)
2019-11-29 20:05:43 UTC [@//:17437]: [2] LOG: listening on IPv4 address "0.0.0.0", port 5432
2019-11-29 20:05:43 UTC [@//:17437]: [3] LOG: listening on IPv6 address "::", port 5432
2019-11-29 20:05:43 UTC [@//:17437]: [4] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2019-11-29 20:05:43 UTC [@//:17437]: [5] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-11-29 20:05:43 UTC [@//:17437]: [6] LOG: ending log output to stderr
2019-11-29 20:05:43 UTC [@//:17437]: [7] HINT: Future log output will go to log destination "syslog".
[New process 17441]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Switching to Thread 0x7ffff7fda8c0 (LWP 17441)]
Breakpoint 1, AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fffffffe3e0) at bootstrap.c:195
195 bootstrap.c: No such file or directory.
Hardware watchpoint 2: InRecovery
#0 AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fffffffe3e0) at bootstrap.c:195
progname = 0x1 <Address 0x1 out of bounds>
flag = <optimized out>
userDoption = <optimized out>
__func__ = "AuxiliaryProcessMain"
#1 0x00000000006b0850 in StartChildProcess (type=StartupProcess) at postmaster.c:5367
pid = <optimized out>
av = {0x86b115 "postgres", 0x7fffffffe430 "-x2", 0x0, 0x7fffffffe430 "-x2", 0x0, 0x1 <Address 0x1 out of bounds>, 0x7fffffffe440 "\001", 0x7ffff7ffb84c "H\213K\bH\272\317\367S\343\245\233\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a[1\300A\\]\303\213\005\337\310\377\377[A\211\004$\213\005\330\310\377\377A\211D$\004\061\300A\\]\303f\017\037\204", 0x0, 0x7ffff7ffb84c "H\213K\bH\272\317\367S\343\245\233\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a[1\300A\\]\303\213\005\337\310\377\377[A\211\004$\213\005\330\310\377\377A\211D$\004\061\300A\\]\303f\017\037\204"}
ac = 2
typebuf = "-x2\000\000\000\000\000\000S\307\022\036\\\253\b\001", '\000' <repeats 14 times>
#2 0x00000000006b3169 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xd1da60) at postmaster.c:1356
opt = <optimized out>
status = <optimized out>
userDoption = <optimized out>
listen_addr_saved = <optimized out>
i = <optimized out>
output_config_variable = <optimized out>
__func__ = "PostmasterMain"
#3 0x000000000047c1bf in main (argc=1, argv=0xd1da60) at main.c:228
No locals.
Breakpoint 3 at 0x81d5a0: errfinish. (2 locations)
Breakpoint 3, errfinish (dummy=0) at elog.c:414
414 elog.c: No such file or directory.
$1 = {elevel = 22, output_to_server = 1 '\001', output_to_client = 0 '\000', show_funcname = 0 '\000', hide_stmt = 0 '\000', hide_ctx = 0 '\000', filename = 0x86a375 "xlog.c", lineno = 6601, funcname = 0x872499 <__func__.26603> "StartupXLOG", domain = 0x964bc9 "postgres-10", context_domain = 0x964bc9 "postgres-10", sqlerrcode = 2600, message = 0xd20438 "could not locate a valid checkpoint record", detail = 0x0, detail_log = 0x0, hint = 0x0, context = 0x0, message_id = 0x870cc8 "could not locate a valid checkpoint record", schema_name = 0x0, table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 0, assoc_context = 0xd1dbd8}
#0 errfinish (dummy=0) at elog.c:414
edata = <optimized out>
elevel = <optimized out>
oldcontext = <optimized out>
econtext = <optimized out>
__func__ = "errfinish"
#1 0x00000000004fb8fd in StartupXLOG () at xlog.c:6283
Insert = <optimized out>
checkPoint = <optimized out>
wasShutdown = <optimized out>
reachedStopPoint = <optimized out>
haveBackupLabel = <optimized out>
haveTblspcMap = <optimized out>
RecPtr = <optimized out>
checkPointLoc = <optimized out>
EndOfLog = <optimized out>
EndOfLogTLI = <optimized out>
PrevTimeLineID = <optimized out>
record = <optimized out>
oldestActiveXID = <optimized out>
backupEndRequired = <optimized out>
backupFromStandby = <optimized out>
dbstate_at_startup = <optimized out>
xlogreader = <optimized out>
private = {emode = 15, fetching_ckpt = 1 '\001', randAccess = 1 '\001'}
fast_promoted = <optimized out>
st = {st_dev = 0, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {tv_sec = 0, tv_nsec = 0}, __unused = {0, 0, 0}}
__func__ = "StartupXLOG"
#2 0x00000000006b3731 in StartupProcessMain () at startup.c:211
No locals.
#3 0x0000000000507d3a in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fffffffe3e0) at bootstrap.c:426
progname = 0x86b115 "postgres"
flag = <optimized out>
userDoption = 0x0
__func__ = "AuxiliaryProcessMain"
#4 0x00000000006b0850 in StartChildProcess (type=StartupProcess) at postmaster.c:5367
pid = <optimized out>
av = {0x86b115 "postgres", 0x7fffffffe430 "-x2", 0x0, 0x7fffffffe430 "-x2", 0x0, 0x1 <Address 0x1 out of bounds>, 0x7fffffffe440 "\001", 0x7ffff7ffb84c "H\213K\bH\272\317\367S\343\245\233\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a[1\300A\\]\303\213\005\337\310\377\377[A\211\004$\213\005\330\310\377\377A\211D$\004\061\300A\\]\303f\017\037\204", 0x0, 0x7ffff7ffb84c "H\213K\bH\272\317\367S\343\245\233\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a[1\300A\\]\303\213\005\337\310\377\377[A\211\004$\213\005\330\310\377\377A\211D$\004\061\300A\\]\303f\017\037\204"}
ac = 2
typebuf = "-x2\000\000\000\000\000\000S\307\022\036\\\253\b\001", '\000' <repeats 14 times>
#5 0x00000000006b3169 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xd1da60) at postmaster.c:1356
opt = <optimized out>
status = <optimized out>
userDoption = <optimized out>
listen_addr_saved = <optimized out>
i = <optimized out>
output_config_variable = <optimized out>
__func__ = "PostmasterMain"
#6 0x000000000047c1bf in main (argc=1, argv=0xd1da60) at main.c:228
No locals.
Program received signal SIGABRT, Aborted.
0x00007ffff5681207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
In this test case scenario, our postgres server is getting a PANIC error "could not locate a valid checkpoint record" and gets the SIGABRT signal, ending the postgres process and the gdb execution.
As a final recommendation, reading the GDB Documentation it is highly advisable as it contains more information of all GDB commands used here and many more: GDB Online Documentation
Also reading and understanding the Postgres source code is most welcome and could be needed for other situations. For example if you need to change the breakpoint
to any other function in the code instead of the errfinish
function.