BDR replication stopped syncing data.
On the sender you can find messages like:
receive LOG: terminating walsender process due to replication timeout
and on the receiver node:
apply ERROR: connection to other side has died
It could be that a large transaction is requiring more than wal_sender_timeout to be decoded. The default of one minute is usually sufficient, but large transactions may require longer than this amount of time to process. Since the wal sender must process the full size of the transaction before transmitting it to a waiting replication connection, Postgres can see that as a timeout.
If the problem is actually due to a large transaction, raising wal_sender_timeout
to a higher value, like 3600 seconds or higher, and reloading the server could solve the problem.
There are also two known PostgreSQL bugs that may apply if you are not on a current point-release:
- walsender timeout errors in BDR/pglogical/logical replication
- Spurious logical CONFLICTs, walsender timeouts, apply ERRORs in logical decoding
A large transaction is requiring more than wal_sender_timeout to be decoded.