Diagnosing high activity of startup process on a standby system / slow replay

Mark Wong
Mark Wong

Replication lag is increasing. Standby displaying bursts of heavy read activity on data volume, startup process is using 100% CPU or I/O despite only moderate load on master

The standby server is taking an extraordinary amount of time processing WAL files. We can examine the content of the WAL files to determine if there are any particular locations that are taking more time to process. Overall system load information and an indication of whether it's bottlenecked on disk or CPU can be gained from vmstat, sar and top.

A good starting point for examining WAL is the pg_waldump (pg_xlogdump on 9.6-) tool's --stats option, which is available since PostgreSQL 9.5.

For diagnostic purposes the perf tool is invaluable, especially when used with Brendan Gregg's perf tools like the perf heat map. On newer systems eBPF tools may help with in depth analysis.

There are also scripts in the internal supporttools repository, under the xpoller directory that will help analyse xlog access patterns. See the README.

For issues related to read I/O during apply, the fast_redo extension (previously pg_xlog_prefetch) may be helpful. PostgreSQL's startup process does single-threaded I/O with no lookahead on WAL, so whenever it has to do something like read an index block to modify it and write it back out, it will block. This adds up fast, and a bit of intelligent lookahead in WAL makes a huge difference.

Was this article helpful?

0 out of 0 found this helpful