How to retain the replication slot after EFM failover/switchover

Mohini Ogale
Mohini Ogale

One of the common problems with using replication slots is that it needs to be reconfigured after a failover and switchover. Replication Slot doesn’t resume at the same restart LSN.

This problem doesn’t exist with EFM v3.10 and later using database version 12 or later.

You need to use the update.physical.slots.period property to define the slot advance frequency for database version 12 and later. When update.physical.slots.period is set to a positive integer value, the primary agent reads the current restart_lsn of the physical replication slots after every update.physical.slots.period seconds and sends this information with its pg_current_wal_lsn and primary_slot_name (if it is set in the postgresql.conf file) to the standbys. The physical slots must already exist on the primary for the agent to find them. If physical slots do not already exist on the standbys, standby agents create the slots and then update restart_lsn parameter for these slots. A non-promotable standby doesn't create new slots but updates them if they exist. Note: all slot names, including one set on the current primary if desired, must be unique.

Period in seconds between having the primary agent update promotable standbys with physical replication slot information so that the cluster will continue to use replication slots after a failover. This parameter value can be set > 0 to enable it.
Set to zero to turn off.

1. Setup Standby server with replication slot

[enterprisedb@localhost ~]# /usr/edb/as14/bin/psql -p 5468 edb -U enterprisedb
psql (14.2.1, server 14.2.1)
Type "help" for help.

edb=# select pg_create_physical_replication_slot('replica');
pg_create_physical_replication_slot 
(replica,)
(1 row)

edb=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_l
sn | wal_status | safe_wal_size | two_phase 
replica | | physical | | | f | t | 26830 | | | 0/4000148 | 
| reserved | | f
(1 row)

edb=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backe
nd_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | 
reply_time 
26830 | 10 | enterprisedb | walreceiver | 172.19.12.61 | | 53709 | 21-APR-22 04:01:23.599989 -07:00 | 
| streaming | 0/4000148 | 0/4000148 | 0/4000148 | 0/4000148 | | | | 0 | async | 21-A
PR-22 04:01:43.770416 -07:00
(1 row)

2. Create replication slot on Standby server as well which will initially be inactive

[enterprisedb@localhost ~]$ /usr/edb/as14/bin/psql -p 5468 edb
psql (14.2.1, server 14.2.1)
Type "help" for help.

edb=# select pg_create_physical_replication_slot('replica');
pg_create_physical_replication_slot 
(replica,)
(1 row)

edb=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_l
sn | wal_status | safe_wal_size | two_phase 
replica | | physical | | | f | f | | | | | 
| | | f
(1 row)

3. Set update.physical.slots.period=10 on all efm.nodes and start/restart EFM

[root@localhost efm-4.4]# /usr/edb/efm-4.4/bin/efm cluster-status efm
Cluster Status: efm

Agent Type Address DB VIP
Witness 172.19.10.189 N/A 
Primary 172.19.11.73 UP 
Standby 172.19.12.61 UP 

Allowed node host list:
172.19.11.73 172.19.12.61 172.19.10.189

Membership coordinator: 172.19.11.73

Standby priority host list:
172.19.12.61

Promote Status:

DB Type Address WAL Received LSN WAL Replayed LSN Info
Primary 172.19.11.73 0/4000148 
Standby 172.19.12.61 0/4000148 0/4000148 

Standby database(s) in sync with primary. It is safe to promote.

4. Testing EFM switchover

[root@localhost efm-4.4]# /usr/edb/efm-4.4/bin/efm promote efm -switchover
Promote/switchover command accepted by local agent. Proceeding with promotion and will reconfigure original primary. Run the 'cluster-status' command for information about the new cluster state.
[root@localhost efm-4.4]# /usr/edb/efm-4.4/bin/efm cluster-status efm
Cluster Status: efm

Agent Type Address DB VIP
Witness 172.19.10.189 N/A 
Standby 172.19.11.73 UP 
Primary 172.19.12.61 UP 

Allowed node host list:
172.19.11.73 172.19.12.61 172.19.10.189

Membership coordinator: 172.19.11.73

Standby priority host list:
172.19.11.73

Promote Status:

DB Type Address WAL Received LSN WAL Replayed LSN Info
Primary 172.19.12.61 0/40002D8 
Standby 172.19.11.73 0/40002D8 0/40002D8 

Standby database(s) in sync with primary. It is safe to promote.

5. Verify that slot is active on new primary now

[enterprisedb@localhost ~]$ /usr/edb/as14/bin/psql -p 5468 edb
psql (14.2.1, server 14.2.1)
Type "help" for help.

edb=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_l
sn | wal_status | safe_wal_size | two_phase 
replica | | physical | | | f | t | 106061 | | | 0/40002D8 | 
| reserved | | f
(1 row)

edb=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | back
end_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | 
reply_time 
106061 | 10 | enterprisedb | walreceiver | 172.19.11.73 | | 35062 | 21-APR-22 04:28:42.667902 -07:00 | 
| streaming | 0/40002D8 | 0/40002D8 | 0/40002D8 | 0/40002D8 | | | | 0 | async | 21-
APR-22 04:29:22.783566 -07:00
(1 row)

6. Slot is inactive on new standby (old primary)

[root@localhost efm-4.4]# /usr/edb/as14/bin/psql -p 5468 edb -U enterprisedb
psql (14.2.1, server 14.2.1)
Type "help" for help.

edb=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn 
| write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time 
(0 rows)

edb=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_l
sn | wal_status | safe_wal_size | two_phase 
replica | | physical | | | f | f | | | | 0/40002D8 | 
| reserved | | f
(1 row)

Primary:

2022-04-21 04:26:08 com.enterprisedb.efm.utils.PhysicalSlotMonitor stopMonitoring INFO: Stopped the physical slot monitor
2022-04-21 04:26:11 com.enterprisedb.efm.utils.ClusterUtils getObjectFromNode INFO: Sending request physical_slot_info_request to node localhost-2166(172.19.12.61)
2022-04-21 04:26:13 com.enterprisedb.efm.nodes.EfmAgent advancePhysicalReplicationSlots INFO: slot info to advance: {replica=0/40002D8}

7. Testing EFM failover

i. Shutdown Primary database

[root@localhost ~]# /usr/edb/efm-4.4/bin/efm cluster-status efm
Cluster Status: efm

Agent Type Address DB VIP
Witness 172.19.10.189 N/A 
Primary 172.19.11.73 UP 
Standby 172.19.12.61 UP 

Allowed node host list:
172.19.11.73 172.19.10.189 172.19.12.61

Membership coordinator: 172.19.11.73

Standby priority host list:
172.19.12.61

Promote Status:

DB Type Address WAL Received LSN WAL Replayed LSN Info
Standby 172.19.12.61 0/4000518 0/4000518 
UNKNOWN 172.19.11.73 UNKNOWN UNKNOWN Connection to 172.19.11.73:5468 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

No primary database was found.

[root@localhost ~]# /usr/edb/efm-4.4/bin/efm cluster-status efm
Cluster Status: efm

Agent Type Address DB VIP
Witness 172.19.10.189 N/A 
Idle 172.19.11.73 UNKNOWN 
Primary 172.19.12.61 UP 

Allowed node host list:
172.19.11.73 172.19.10.189 172.19.12.61

Membership coordinator: 172.19.11.73

Standby priority host list:
(List is empty.)

Promote Status:

DB Type Address WAL Received LSN WAL Replayed LSN Info
Primary 172.19.12.61 0/4000630 

No standby databases were found.

Idle Node Status (idle nodes ignored in WAL LSN comparisons):

Address WAL Received LSN WAL Replayed LSN Info
172.19.11.73 UNKNOWN UNKNOWN Connection to 172.19.11.73:5468 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

ii. Slot is inactive at the moment:

edb=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | 
wal_status | safe_wal_size | two_phase 
replica | | physical | | | f | f | | | | 0/40002D8 | | 
reserved | | f
(1 row)

8. Rebuild Standby server

-bash-4.2$ /usr/edb/as14/bin/pg_basebackup -D /var/lib/edb/as14/data/ -h 172.19.12.61 -p 5468
-bash-4.2$ touch /var/lib/edb/as14/data/standby.signal

9. EFM cluster-status after failover

[root@localhost ~]# /usr/edb/efm-4.4/bin/efm cluster-status efm
Cluster Status: efm

Agent Type Address DB VIP
Witness 172.19.10.189 N/A 
Standby 172.19.11.73 UP 
Primary 172.19.12.61 UP 

Allowed node host list:
172.19.11.73 172.19.10.189 172.19.12.61

Membership coordinator: 172.19.11.73

Standby priority host list:
172.19.11.73

Promote Status:

DB Type Address WAL Received LSN WAL Replayed LSN Info
Primary 172.19.12.61 0/6000060 
Standby 172.19.11.73 0/6000060 0/6000060 

Standby database(s) in sync with primary. It is safe to promote.

10. Verify that slot is active on new primary now

edb=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | 
wal_status | safe_wal_size | two_phase 
replica | | physical | | | f | t | 114527 | | | 0/6000060 | | 
reserved | | f
(1 row)

edb=# select * from pg_stat_replication; 
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_x
min | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | re
ply_time 
114527 | 10 | enterprisedb | walreceiver | 172.19.11.73 | | 35290 | 21-APR-22 22:15:45.318252 -07:00 | 
| streaming | 0/6000060 | 0/6000060 | 0/6000060 | 0/6000060 | | | | 0 | async | 21-APR-22 22:
16:15.407942 -07:00
(1 row)

Was this article helpful?

0 out of 0 found this helpful