Generally it is not possible to recover replication on the subscriber/downstream/receiver after dropping a logical replication slot on the provider/upstream node. A re-clone, drop and re-join, or similar is required.
However, in some special cases workarounds are possible.
If replication is broken in one or both directions between two BDR2 or BDR3 nodes, or in one direction on pglogical, it may be possible to recover without fully resyncing both nodes by re-creating a logical replication slot manually.
Important: re-creating a slot is not normally an acceptable option for a missing logical replication slot.
The receiver will resume replay from the slot happily but it will have no idea that a large volume of data between the last confirmed-flushed point on the old slot and the start-point of the new slot has been skipped over.
There is no way to make a replication slot start "back in time" and since the receiver cannot detect if a slot skips ahead it will proceed thinking all is fine. In reality that missing window of data will usually cause apply errors, DDL failures leaving the whole cluster stuck in a DDL lock, even corruptions in some corner cases.
So this is only useful where one node is read-only or read-mostly on a distinct subset of data and/or there's selective replication of only a few tables from the node with the missing slot. Otherwise it offers no benefit over dropping/parting the node and recreating it, as all the data must be copied anyway. Doing a drop and re-create is safer, doesn't require interrupting the application, and is preferable for most cases.
But, if the slot on node A that replicates to node B is missing/broken and:
- node A does not receive any writes (it's functionally a read replica); or
- node A accepts direct writes from client app(s) to a small-ish and well separated subset of the data so we can re-sync just that subset; and
- There are three or more BDR nodes, so if a slot on A for A->B replication is missing, node C will still have received all A->C and B->C data and has a complete picture (in this case we can use node C to resync A and B's tables); or
- The customer does not mind discarding the data written on node A and replacing it with data from node B; then
... then we can take a shortcut to repair.
This is generally only worthwhile when the database is big enough that a resync is worth avoiding. The manual table resync requires stopping writes to all the affected tables on all nodes, often as an app interruption. A full node drop and re-join does not, and it's also a simpler less mistake-prone process.
The following is pretty raw material drafted for a Support ticket and will need further revision before use.
It should be OK for this specific case because we know the data-flow is mostly from PPA->PR1.
We still need to make sure there weren't any important schema changes run on PR1 that have not replicated to PPA before we proceed.. We also need to know the list of tables for which PR1 does send data back to PPA; then since it sent the data along the working link from PR1->PR2 we can resync PPA's copies using the copies on PR2 that have all changes from all nodes. We can use the same list to check for relevant schema changes by querying pg_attribute
on the tables on pr1, pr2 and ppa and comparing.
There will be two steps involved in resolution. Slot creation then re-sync.
To create the slot we can use SQL on pr1:
SELECT * FROM pg_catalog.pg_create_logical_replication_slot('$SLOT_NAME', 'bdr');
Then to resync the list of tables of interest $tables we will, pending testing, do something like
- Stop writes to $tables on all nodes. Note that you can't just
LOCK TABLE $tables IN EXCLUSIVE MODE
since that'll also stop BDR apply if there are any outstanding changes, you should do it at the application level. - create
dump_script.sql
as:
BEGIN;
-- for each $table in $tables
\copy $table TO '$table.tsv'
-- end loop
ROLLBACK;
and load_script.sql
as
BEGIN;
-- for $table in $tables:
-- Important, does not use TRUNCATE
DELETE FROM $table;
\copy $table FROM '$table.tsv'
VACUUM ANALYSE $table;
-- end loop
COMMIT;
- run
dump_script.sql
on pr2
psql -v ON_ERROR_STOP=1 -f dump_script.sql
-
Copy
*.tsv
to ppa and pr1 -
Run on both ppa and pr1, with a special
psql
command:
psql -v ON_ERROR_STOP=1 "options='-c bdr.do_not_replicate=on -c bdr.skip_ddl_locking=on -c bdr.skip_ddl_replication=on -c bdr.permit_unsafe_ddl_commands=on' user=their_superuser dbname=... ...otherparams..." -f load_script.sql
Note that it's actually technically possible to make a slot in the past using the low level replication slot APIs and a simple extension. See the first revision of the failover_slots
patch set for an example extension, which can be found in src/test/modules. The confirmed_flush_lsn
may be obtained from the downstream's pg_replication_origin_status
corresponding to the slot, catalog_xmin
may be initially set to the global xmin, and the restart_lsn
to the start-of-wal.
However, decoding from such an artificial slot is unsafe and may lead to wrong results or crashes. For a number of reasons:
- The
catalog_xmin
will not have been tracked and enforced after the slot was dropped. So autovacuum will have removed catalog tuples needed to correctly understand the historical WAL. It's only safe to make a slot with acatalog_xmin
as far back as the server's current global catalog_xmin as held down by another slot, or the server's current global xmin as held down by a prepared xact, slot xmin, or backend. - It's difficult to correctly determine the correct
restart_lsn
, the xlog offset of the oldest txn that hadn't committed or aborted as-of the WAL position identified byconfirmed_flush_lsn
. It's possible this could be derived by inspection of standby status update information recorded to WAL? - There may not be enough WAL retained since
restart_lsn
won't have forced retention of the needed WAL. It may be possible to extract it from a WAL archive and restore it if the correct start position can be identified though.