walsender timeout errors in BDR/pglogical/logical replication

Craig Ringer
Craig Ringer
  • Updated

On non-current PostgreSQL point releases, logical decoding sessions can be aborted with false walsender timeouts.

A bug (now fixed and backported) could result in logical decoding sessions incorrectly being terminated with

ERROR: terminating walsender process due to replication timeout

BDR and pglogical subscribers may report:

ERROR: connection to other side has died

or, for some older versions:

ERROR: epoll_ctl() failed: Invalid argument

This occurs when the replica keeps the walsender busy, so it never executes the slow-path that updates the timer.

Fixed in

  • 10.2
  • 9.6.7
  • 9.5.11
  • 9.4.16

This bug can interact with reorder buffer bugs (see related articles links) to cause logical decoding sessions to replay transactions multiple times, appear to get stuck, produce spurious conflicts, etc.

The problem will probably occur again, but setting an extremely high wal_sender_timeout (many hours) is generally sufficient to work around it. The side effect is that replication will not properly detect problems caused by subscriber crashes, network interruptions, etc, so manual intervention may be needed to pg_terminate_backend the walsender in those cases.

See commit, mailing list discussion.

Related to

Was this article helpful?

0 out of 0 found this helpful