How to recover a PGD node from a Barman backup and join it into an existing PGD cluster?

William Ivanski
William Ivanski

This article describes the procedure of restoring a Barman backup into a Postgres instance and joining it into an existing PGD cluster following a procedure called "empty join". We also discuss pros and cons of this procedure and what needs to be taken into consideration.

0- Initial cluster state

We started from a cluster with 2 Data nodes (RT1 and RT2) and 1 Witness node. The Barman node takes backups from RT1.

We added a table with some data, which can be seen on both Data nodes:

bdrdb=# SELECT * FROM test;
id | msg
1 | AAA
(1 row)

1- Destroy one of the nodes

On RT2, we destroy the Postgres instance:

systemctl stop harp-manager
systemctl stop postgres
rm -rf /opt/postgres/data/*

After this, the RT2 node status will not be correct on other nodes. Then, we need to execute the PART and DROP command from any of the the remaining nodes. For example, from node RT1 execute:

SELECT bdr.part_node(
node_name := 'nodert2',
wait_for_completion := true,
force := false
);

SELECT bdr.drop_node(
node_name := 'nodert2',
cascade := true,
force := false
);

2- Recover the backup

As both nodes RT1 and RT2 have the same data, Barman can be used to rebuild any of them.

From the Barman node, we recover the latest backup of node RT1 into RT2:

barman recover \
--remote-ssh-command 'ssh postgres@nodert2' \
--get-wal \
nodert1 latest /opt/postgres/data/

Then we start Postgres on RT2:

systemctl start postgres

And we confirm the data is still there on RT2:

bdrdb=# SELECT * FROM test;
id | msg
1 | AAA
(1 row)

3- Clean up BDR metadata on the recovered instance

Even though it was recovered from a backup of RT1, node RT2 is not ready to join the PGD cluster yet because it doesn't have the required replication slots and replication origins.

So the first step in order to get node RT2 back into the cluster is to clean up its BDR metadata:

SELECT bdr.consensus_disable();

SELECT bdr.part_node (
node_name := t.node_name,
wait_for_completion := false,
force := true
)
FROM bdr.node_summary t
WHERE t.node_name != (
SELECT node_name FROM bdr.local_node_summary
);

SELECT bdr.part_node (
node_name := t.node_name,
wait_for_completion := false,
force := true
)
FROM bdr.local_node_summary t;

-- Restart Postgres

SELECT bdr.drop_node (
node_name := t.node_name,
cascade := true,
force := true
)
FROM bdr.node_summary t
WHERE t.node_name != (
SELECT node_name FROM bdr.local_node_summary
);

SELECT bdr.drop_node (
node_name := t.node_name,
cascade := true,
force := true
)
FROM bdr.local_node_summary t;

IMPORTANT!!: Before doing the next step of dropping the extension, please make sure CRDTs are not being used, by following the Knowledge Base Article How to check if CRDTs are being used.

DROP EXTENSION bdr CASCADE;

4- The "empty join" procedure

Next step is to join node RT2 back to the cluster. This is the trick for joining without losing data, it's what we call "empty join". These are the steps:

First we create the BDR extension:

CREATE EXTENSION bdr CASCADE;

Then we create the local node:

SELECT bdr.create_node(
node_name := 'nodert2',
local_dsn := 'host=nodert2 port=5432 dbname=bdrdb user=postgres'
);

Now we need to configure the local RT2 node to subscribe to an empty replication set that's used by all other nodes. In a TPAexec-deployed cluster, we can use the already existing witness-only replication set:

SELECT bdr.alter_node_replication_sets(
node_name := 'nodert2',
set_names := ARRAY['witness-only']::text[]
);

IMPORTANT!!: On PGD 5, there is no witness-only replication set because the Witness node is defined differently, as a node kind. In this case, you will need to create an empty, temporary replication set on node RT1.

Here is how the join looks like on RT2. Note we are using the RT1 connection string, joining the bdrgroup group, and we are also not copying any schemas tables or definitions as we have synchronize_structure := 'none':

SELECT bdr.join_node_group(
join_target_dsn := 'host=nodert1 port=5432 dbname=bdrdb user=postgres',
node_group_name := 'bdrgroup',
pause_in_standby := NULL,
wait_for_completion := true,
synchronize_structure := 'none'
);

After that we can configure the RT2 node to subscribe to the correct replication sets:

SELECT bdr.alter_node_replication_sets(
node_name := 'nodert2',
set_names := ARRAY['witness-only','bdrgroup']::text[]
);

5- Test replication

Now let's insert a row on RT1:

bdrdb=# SELECT * FROM test;
id | msg
1 | AAA
2 | BBB
(2 rows)

This row is replicated fine to RT2.

On the other way around, if we insert a row on RT2:

bdrdb=# SELECT * FROM test;
id | msg
1 | AAA
2 | BBB
3 | CCC
(3 rows)

This row gets replicated fine to RT1.

With that we confirm the cluster is working as expected.

Discussion about the "empty join" technique

The logical join procedure is composed of 2 steps:

1- Schema copy, which is controlled by setting synchronize_structure from the bdr.join_node_group() function. Possible values are all (which means the entire schema is created on the joining node) or none (which means the schema is not created);

2- Data copy, which is always performed. The joining node will always copy all the data from the tables associated with the replication sets it subscribes to. By default it subscribes to all replication sets from the upstream node, but we can configure which replication sets the joining node will subscribe to by executing bdr.alter_node_replication_sets() before executing bdr.join_node_group().

The trick to do an "empty join" is to skip both steps:

1- We skip the schema copy by joining with synchronize_structure := 'none';

2- We skip the data copy by executing bdr.alter_node_replication_sets() beforehand, configuring the joining node to subscribe to an empty replication set, for example the witness-only replication set.

Essentially we are adding another Witness node, the difference is that it holds all data from a restored backup.

Right after the join, we run bdr.alter_node_replication_sets() again to configure the joining node to subscribe to the proper replication sets, which causes the node to start replicating normally.

In the simple scenario above it worked because nothing was written to node RT1 while RT2 was being restored, so there was no data loss.

However, on production this "empty join" technique might incur in data loss because data written to the RT1 node while RT2 is being restored might not be included in the backup and WALs on Barman, hence might not be available on node RT2 after the restore, which means data divergence between RT1 and RT2.

Usually we recommend to restore from Barman only when creating a new cluster from scratch, for example when building UAT using data from production, or in the event of a disaster that caused loss of all PGD nodes. While there is at least 1 remaining node in the cluster (in this case, node RT1 was always there), then we recommend rebuilding the node (RT2 in this case) as needed not from Barman, but from the other node using logical join, physical join, or TPAexec deploy.

Was this article helpful?

0 out of 0 found this helpful