On non-current PostgreSQL point releases, logical decoding sessions can be aborted with false walsender timeouts.
A bug (now fixed and backported) could result in logical decoding sessions incorrectly being terminated with
ERROR: terminating walsender process due to replication timeout
BDR and pglogical subscribers may report:
ERROR: connection to other side has died
or, for some older versions:
ERROR: epoll_ctl() failed: Invalid argument
This occurs when the replica keeps the walsender busy, so it never executes the slow-path that updates the timer.
Fixed in
- 10.2
- 9.6.7
- 9.5.11
- 9.4.16
This bug can interact with reorder buffer bugs (see related articles links) to cause logical decoding sessions to replay transactions multiple times, appear to get stuck, produce spurious conflicts, etc.
The problem will probably occur again, but setting an extremely high wal_sender_timeout
(many hours) is generally sufficient to work around it. The side effect is that replication will not properly detect problems caused by subscriber crashes, network interruptions, etc, so manual intervention may be needed to pg_terminate_backend
the walsender in those cases.
See commit, mailing list discussion.
Related to