The purpose of this article is to demonstrate a scenario where a node needs to be parted and dropped from a BDR cluster. We discuss the concepts, options available and best practices about parting and dropping BDR nodes.
The following query shows the nodes that belong to the cluster:
SELECT node_name,
peer_state_name
FROM bdr.node_summary
ORDER BY 1;
Output would be, for example:
node_name | peer_state_name
node1 | ACTIVE
node2 | ACTIVE
node2ls | ACTIVE
node3 | ACTIVE
node4so | ACTIVE
(5 rows)
Even though node2
is still part of the catalog, and even if the state says ACTIVE
(which can be misguiding), actually it's down. We can confirm if a node is online or not by checking Raft Consensus, with the following query:
SELECT node_name,
state
FROM bdr.group_raft_details
ORDER BY 1;
Output would be:
node_name | state
node1 | RAFT_LEADER
node2 |
node2ls | RAFT_FOLLOWER
node3 | RAFT_FOLLOWER
node4so | RAFT_FOLLOWER
(5 rows)
Note how node2
doesn't have a Raft state, which means it's not communicating with Raft, hence it's offline.
Now let's assume, for the purpose of this article, that node2
crashed and is not recoverable. So we need to remove this node from the cluster.
The correct way to do this would be:
SELECT bdr.part_node(
node_name := 'node2',
wait_for_completion := true,
force := false
);
On the parameters:
-
wait_for_completion
: The part operation happens in background and involves multiple Raft communication among the remaining nodes in the cluster. At this point, the part operation also triggers thePART_CATCHUP
step, which means that the remaining nodes are exchanging data from the node being parted, that happen to be available on their own databases. So it's important to setwait_for_completion := true
, otherwise we might risk doing the next step, i.e., the drop, without the part had successfully finished, which can cause severe catalog issues and even data inconsistency. -
force
: The regular part is an operation orchestrated by Raft, and one of the steps is thePART_CATCHUP
as explained earlier. If we setforce := true
, then the part operation will skip thePART_CATCHUP
, and will also be a local operation, that will not happen on the other nodes. This immediately causes catalog inconsistency (because other nodes will still see the node asACTIVE
instead ofPARTED
), and as thePART_CATCHUP
step is being skipped, then we risk data inconsistency too. So usingforce := true
is never recommended, except cases where it's not possible to achieve Raft majority, or there are issues with Raft, or other catalog issues. Only the Support and PGD Teams are positioned to recommend to useforce := true
depending on the case.
The output of the part execution above should be:
NOTICE: node node2 has been removed from the bdr group
part_node
(1 row)
You can use the following query to confirm if the node has been parted correctly on all other nodes in the cluster:
WITH node_statuses AS (
SELECT jsonb_array_elements(
bdr.run_on_all_nodes($$
SELECT node_name,
peer_state_name
FROM bdr.node_summary s
WHERE node_name = 'node2'
$$)::jsonb
) AS j
)
SELECT s.node_name AS response_from,
n.j->'response'->'command_tuples'->0->>'node_name' AS node_name,
n.j->'response'->'command_tuples'->0->>'peer_state_name' AS peer_state_name
FROM node_statuses n
INNER JOIN bdr.node_summary s
ON s.node_id = (n.j->>'node_id')::oid
ORDER BY 1;
Output should be:
response_from | node_name | peer_state_name
node1 | node2 | PARTED
node2ls | node2 | PARTED
node3 | node2 | PARTED
node4so | node2 | PARTED
(4 rows)
Which means that the part was successful.
After we have parted the node, it's important to mention that, if we want to rejoin the node using the same name, then we can simply do so, even if the node with the same name is in the catalog as PARTED
.
However, in some situations it might be interesting to drop the node that was parted, i.e., remove all information about this node from the catalog.
After we have parted the node, as described above, the cluster now looks like this:
node_name | peer_state_name
node1 | ACTIVE
node2 | PARTED
node2ls | ACTIVE
node3 | ACTIVE
node4so | ACTIVE
(5 rows)
The correct way to drop the node is as follows:
SELECT bdr.drop_node(
node_name := 'node2',
cascade := true,
force := false
);
On the parameters:
-
cascade
: On BDR 3.6 and 3.7, this parameter was mandatory to be set totrue
, in order to drop the underlyingpglogical
catalog about the node. On PGD 4, this parameter is there for compatibility purposes. Always set it totrue
. -
force
: If for any reason the node is being used by any of the workers, trying to drop the node will not be possible if you useforce := false
. However, if that's the case, then you need to contact Support to troubleshoot this problem. Then Support can instruct you to useforce := true
only after making sure that it's safe.
Note that drop node is a local operation. So you need to run the query above on all nodes in the cluster. You can optionally use bdr.run_on_all_nodes()
to drop the node on all nodes at the same time. First confirm the node is there as PARTED on all nodes, with the query:
WITH node_statuses AS (
SELECT jsonb_array_elements(
bdr.run_on_all_nodes($$
SELECT node_name,
peer_state_name
FROM bdr.node_summary s
WHERE node_name = 'node2'
$$)::jsonb
) AS j
)
SELECT s.node_name AS response_from,
n.j->'response'->'command_tuples'->0->>'node_name' AS node_name,
n.j->'response'->'command_tuples'->0->>'peer_state_name' AS peer_state_name
FROM node_statuses n
INNER JOIN bdr.node_summary s
ON s.node_id = (n.j->>'node_id')::oid
ORDER BY 1;
Output should be as follows:
response_from | node_name | peer_state_name
node1 | node2 | PARTED
node2ls | node2 | PARTED
node3 | node2 | PARTED
node4so | node2 | PARTED
(4 rows)
Now execute on any Master node that's still remaining in the cluster:
SELECT jsonb_pretty(
bdr.run_on_all_nodes($$
SELECT bdr.drop_node(
node_name := 'node2',
cascade := true,
force := false
)
$$)::jsonb
);
On any node, you can see the node is not in the catalog anymore:
node_name | peer_state_name
node1 | ACTIVE
node2ls | ACTIVE
node3 | ACTIVE
node4so | ACTIVE
(4 rows)
Now use the node statuses query above to confirm the dropped node does not exist anymore on all nodes in the cluster:
response_from | node_name | peer_state_name
node1 | |
node2ls | |
node3 | |
node4so | |
(4 rows)
Which means that the node was successfully dropped.