terminating walsender process due to replication timeout

Craig Ringer
Craig Ringer

Issue

BDR replication stopped syncing data.

On the sender you can find messages like:

receive LOG: terminating walsender process due to replication timeout

and on the receiver node:

apply ERROR: connection to other side has died

Resolution

It could be that a large transaction is requiring more than wal_sender_timeout to be decoded. The default of one minute is usually sufficient, but large transactions may require longer than this amount of time to process. Since the wal sender must process the full size of the transaction before transmitting it to a waiting replication connection, Postgres can see that as a timeout.

If the problem is actually due to a large transaction, raising wal_sender_timeout to a higher value, like 3600 seconds or higher, and reloading the server could solve the problem.

There are also two known PostgreSQL bugs that may apply if you are not on a current point-release:

Root Cause

A large transaction is requiring more than wal_sender_timeout to be decoded.

Was this article helpful?

0 out of 0 found this helpful