Impact of Raft Consensus in PGD and Assessing and Resolving Issues

William Ivanski
William Ivanski

This article provides an introduction about Raft Consensus (what it is for, how does it work and concepts involved), to then proceed with discussing in depth the two most common problems with Raft Consensus:

  • Bloat in Raft Consensus Journal tables;

  • Raft Consensus duplicate request IDs (also known as the "Birthday Paradox").

For each issue, we outline a few common symptoms, how to diagnose as well as possible fixes and workarounds where applicable.

At the end we provide a generic procedure to fix Raft Consensus issues, from the safer to the more intrusive approach, depending on the situation.

1- Concepts about Raft Consensus

Raft stands for Replicated Available Fault Tolerance. It's a consensus algorithm that uses votes from a quorum of machines in a distributed cluster. Given the total number of nodes, Raft looks for a consensus of the majority (total number of nodes divided by 2 plus 1) voting for any decision.

The purpose of the Raft Consensus algorithm is to allow multiple servers to agree on the state of something. For more details about the Raft Consensus algorithm, you can find a list of resources here.

1.1- The impact of Raft Consensus in PGD

PGD has its own implementation of the Raft Consensus algorithm. In PGD, Raft Consensus is not used for data replication. Instead, PGD leverages Raft Consensus for a number of important features as you can see below. Here we consider Raft Consensus not properly working as:

  • Not having majority of nodes online; or

  • Broken because of bugs as discussed later in this article; or

  • Extremely slow because of bugs as discussed later in this article.

Feature Description What happens if Raft is not properly working?
Global DDL Lock, Global DML Lock Required for every transparent DDL replication DDL is not possible by default
Allocate new chunks for Galloc Sequences Required every time a node runs out of values for a Galloc Sequence INSERTs on tables using a Galloc Sequence start to fail
Cluster maintenance operations Join node, part node, promote standby These operations are not possible. Part is possible if forced
Advance the BDR Group Replication Slot position Also known as the cluster-wide LSN horizon WAL files start to accumulate in all nodes, risking disk exhaustion. The slot also holds back the xmin horizon, causing autovacuum to not work properly, leading to performance issues and even transaction wraparound
AutoPartition For maintenance of globally managed partitioned tables Without new partitions to receive new data, INSERTs fail
Eager Replication In Eager, every transaction is orchestrated by Raft Every Eager transaction fails
Switch CAMO to Local Mode if require_raft is enabled CAMO can be configured to require Raft before switching to Local Mode to prevent split brain CAMO configured like that doesn't switch to Local Mode and if the CAMO partner is really offline, COMMITs start to hang indefinitely
Routing to the Write Leader Starting with PGD 5, there are separate Raft instances dedicated to keep which node is the Write Leader, per location. Before PGD 5, the same Raft instance is used for everything listed here Without Raft Consensus to tell the HARP Proxy or PGD Proxy which node to point to, applications won't reach the database, resulting in downtime

1.2- Monitoring Raft Consensus

As you can see above, Raft Consensus is essential for PGD, so it's extremely recommended to be monitored at all times. You can monitor Raft Consensus with:

-- Summary:
SELECT * FROM bdr.monitor_group_raft();

-- Details:
SELECT * FROM bdr.group_raft_details;

Examples:

SELECT * FROM bdr.monitor_group_raft();

OK   Raft Consensus is working correctly
SELECT * FROM bdr.group_raft_details;

node_id  node_name  state  leader_id  current_term  commit_index  nodes  voting_nodes  protocol_version
3326594109  node1  RAFT_LEADER    3326594109  2122  85463589  4  3  4005
2728265565  node2  RAFT_FOLLOWER  3326594109  2122  85463589  4  3  4005
3326374859  node3  RAFT_FOLLOWER  3326594109  2122  85463589  4  3  4005
4235231024  node4  RAFT_FOLLOWER  3326594109  2122  85463589  4  3  4005

Both bdr.monitor_group_raft() and bdr.group_raft_details leverage bdr.run_on_all_nodes() to query the Raft Consensus state from all nodes.

Locally, the Consensus Worker keeps the Raft Consensus state for the node in table bdr.local_consensus_state.

1.3- Raft Leader and Followers

One of the voting nodes is elected to be the Raft Leader. In PGD, Data nodes and Witness nodes are voting nodes and can be Raft Leader.

The other nodes that are not the Raft Leader are known as Raft Followers.

1.4- Raft Election and Term

If Raft Followers lose connection to the Raft Leader, they start a new Raft Election, putting themselves as Raft Candidates and voting to themselves or each other until a new Raft Leader is elected.

Every time a new Raft Leader is elected, the Raft Term is incremented by 1. This is the field current_term as seen in the bdr.group_raft_details view.

1.5- Raft Requests and Responses

Raft communication starts with a Raft Request. For example a node needs to perform DDL, so it requests the Raft Leader for a Global Lock on a table.

The Raft Leader then consults the other voting nodes in the cluster (including itself) about the request. For example, for a Global Lock to be acquired cluster-wide, each voting node will only approve if it can also acquire a local lock on the table. Each time the Raft Leader communicates with a node to request for voting is also a Raft Request and the each voting node responds with their votes in a Raft Response.

Then the Raft Leader sends a final Raft Response to the node that originated the Raft Request. For example, having the Global Lock being acquired, the requester node can finally perform the DDL.

Each Raft message (Request or Response) has an ID also called Commit Index. This is the field commit_index as seen in the bdr.group_raft_details view.

1.6- Raft Snapshot and Journal

Every piece of information stored by the Raft Consensus algorithm is frozen in a set of information called Raft Snapshot. Each node keeps a copy of the Raft Snapshot in table bdr.local_consensus_snapshot.

In PGD 4 and older, as there is a single Raft instance, then there is a single Raft Snapshot. Starting with PGD 5, as there can be multiple instances (one per group, including the top group), then each Raft instance has a separate Raft Snapshot.

The Raft Journal contains all the recent Raft Requests and Responses exchanged among all nodes in the cluster.

Here, by recent we mean since the last Raft Pruning as explained in the section below.

  • Table bdr.global_consensus_journal contains all Raft messages, being it Raft Requests or Responses, originated from all nodes.

  • Table bdr.global_consensus_response_journal contains every Raft Request and their corresponding Raft Response side by side.

Both tables are replicated among the nodes and contain the actual Raft message encoded in their payload fields.

  • View bdr.global_consensus_journal_details is equivalent to the bdr.global_consensus_response_journal table, but after applying function bdr.decode_message_payload() to decode the payloads.

1.7- Raft Pruning

The status of each piece of information maintained by Raft Consensus (let's say, key A for the sake of this example) is composed of 2 states:

  • First the node checks for the value of A in the Raft Snapshot;

  • Then the node needs to logically walk through all the messages in the Raft Journal, which can have messages changing the value of A;

  • After the node has processed the Raft Journal, then the node knows the current value of A.

Because of this, the Raft Journal can't grow indefinitely. So every time the number of messages in the Raft Journal reaches bdr.raft_keep_min_entries, then the Raft Leader will start a process called Raft Pruning, which is composed of the following steps:

  • First the Raft Leader checks which is the newest Raft Request ID in the Raft Journal and considers this the target ID;

  • Then the Raft Leader processes the Raft Journal, i.e., walks over every message in the Raft Journal until the target ID, changing, inserting or removing keys in the Raft Snapshot;

  • When the Raft Leader reaches the target ID, then it DELETEs all the messages from the Raft Journal until the target ID and sends the updated Raft Snapshot to all Raft Followers.

Raft Pruning is important because the higher the number of messages in the Raft Journal, the slower some Raft messages will be, if they need to for example check the current value of a key stored in Raft Consensus.

1.8- Raft Vacuum

Section above explains why the Raft Journal can't grow indefinitely, which is handled by the Raft Pruning operation happening every time the Raft Journal reaches bdr.raft_keep_min_entries number of messages.

However, it's expected that tables receiving UPDATEs and DELETE`s can accumulate dead rows, also called bloat.

Usually Postgres autovacuum would take care of the bloat on the Raft Consensus tables, but the following tables have autovacuum_enabled = false:

  • bdr.local_consensus_state;
  • bdr.local_consensus_snapshot;
  • bdr.global_consensus_journal;
  • bdr.global_consensus_response_journal.

This is by design, to prevent autovacuum from interfering with the Raft Consensus operation.

However, as bloat on these tables is not supposed to grow indefinitely, the Consensus Worker runs a procedure called Raft Vacuum every 24 hours, or every time the Consensus Worker is started on the node.

From Postgres logs, we can see when Raft Vacuum was executed with messages like:

Aug 22 17:02:46 rt39916n1 postgres[9616]: [12-2] 2024-08-22 17:02:46 UTC [postgres@/pglogical bdr consensus 949376428 17073:0/bdrdb:9616]: [2] STATEMENT:  Raft VACUUM

Raft Vacuum runs the low-level equivalent of a VACUUM FULL on the 4 tables above.

1.9- Raft Key-Value Store

Besides the information the Raft Consensus already tracks by default (such as the Global Lock, which node is the Write Leader, BDR Group Slot LSN, etc), which is crucial for the PGD cluster health and operation, PGD provides an API for persisting any information in the Raft Snapshot. This API consists of:

  • bdr.consensus_kv_data: Pseudo-table where it's possible to see in tabular format the custom values stored in the Raft Snapshot. Built-in values are not listed here;

  • bdr.consensus_kv_store(): Function that allows users to write any key and its corresponding value into the Key-Value Store. Essentially this function does a Raft Request to the Raft Leader, which then propagates the value to all Raft Followers.

  • bdr.consensus_kv_fetch(): Function that allows users to read the current value, given a key, from the Key-Value Store. As explained above, the Raft Snapshot doesn't contain the current values, so bdr.consensus_kv_fetch() might need to read the Raft Snapshot and process the Raft Journal in order to get the current value of a key.

HARP stores all its information into the Raft Key-Value Store, leveraging the 2 functions above.

IMPORTANT!!: Even though nothing prevents users from leveraging the Key-Value Store, it's strongly advised to not use the Key-Value Store for anything else besides HARP. The reason is to not overpopulate and slow down Raft Consensus which is needed for other important aspects of PGD. Starting with PGD 5, the PGD Proxy no longer uses Key-Value Store and, in a future PGD version, the Key-Value Store will be removed.

Now that we are familiar with the concepts about Raft Consensus, let's talk about the 2 main problems involving Raft Consensus in the next 2 sections.

2- Problem: Raft Bloat

We found a problem in the Raft Vacuum routine: While it was running the low-level equivalent of VACUUM FULL, it was not running the equivalent of ANALYZE, so the statistics about the 4 Raft Consensus tables were not being updated.

With outdated statistics, queries on these tables (essentially any internal Raft operation such as Requests, Responses, Pruning, Key-Value Store, etc) can have severely degraded performance.

2.1- Symptoms of Raft Bloat

Degraded Raft performance can impact on all operations controlled by Raft, listed in section 1.1 above.

Usually time-sensitive operations requiring Raft Consensus are impacted because if Raft doesn't respond on time, something unintended happens. For example:

2.1.1- Slow Raft messages

By default, bdr.raft_log_min_message_duration is set to 5s. Any raft messages taking longer than bdr.raft_log_min_message_duration will be logged to Postgres logs, for example:

Aug 22 16:51:33 rt39916n1 postgres[8335]: [25-1] 2024-08-22 16:51:33 UTC [postgres@/pglogical manager 17073/bdrdb:8335]: [32] LOG:  BDR consensus request id: 2332851996956199055 took 33059ms

2.1.2- HARP misbehavior

As HARP is constantly storing and retrieving data from the Raft Key-Value Store, which as explained in section 1.9 above is completely controlled by Raft Consensus, then HARP in general can be impacted by degraded Raft performance. Common symptoms include but are not limited to:

  • Unintended / unexpected failovers;

  • Constant Write Leader flapping;

  • Complete inability to route traffic to the Write Leader because Raft is timing out.

A few example snippets of HARP log messages indicating Raft Bloat are:

"err": "timeout: context deadline exceeded", "sql": "select * from bdr.consensus_kv_fetch($1)"
no leader to connect to
lost dcs connection, cleaning connection and client information

As in BDR 3.7 and PGD 4 application reaches the database through HARP, Raft Bloat can result in severe and even Severity 1 situations.

2.1.3- Global Lock timeout

Each transparently replicated DDL operation needs to acquire a Global Lock, which is a Raft orchestrated operation. If Raft messages are slow because of Raft bloat, then it's possible that the entire DDL transaction will error out, timing out because bdr.global_lock_timeout (by default 1min) wasn't enough.

Sample error message:

ERROR:  canceling statement due to global lock timeout
CONTEXT:  while handling global lock GLOBAL_LOCK_DDL in database 16926 requested by node-id 1609217846 in local acquire stage acquiring_local
for ddl_epoch 20, ddl_epoch_lsn 0/0
"xact rep": true, "ddl rep": true, "ddl locking": all

2.2- Diagnosing Raft Bloat

Postgres view pg_stat_all_tables or Lasso report file postgresql/dbs/BDRDB/tables.out contain the table statistics. In order to define if a table is bloated or not, in percentage, we consider the proportion of dead tuples from the total number of tuples:

bloat_perc = n_dead_tup / (n_live_tup + n_dead_tup)

If you have multiple Lasso reports (usually from different nodes in a PGD cluster) extracted in a folder, you can run a command like below to get the n_live_tup and n_dead_tup from the 4 Raft Consensus tables:

find . -type f -name tables.out -print -exec bash -c "grep consensus {} | awk '{print \$4,\$40,\$41}' | grep -v kv; echo" \;

Examples of Raft bloat:

-- bdr.raft_keep_min_entries = 100

relname                            n_live_tup  n_dead_tup
local_consensus_state                       1         278
local_consensus_snapshot                    1         179
global_consensus_journal                  100       17935
global_consensus_response_journal         100       17935
-- bdr.raft_keep_min_entries = 10000

relname                            n_live_tup  n_dead_tup
local_consensus_state                       1        8848
local_consensus_snapshot                    1          22
global_consensus_journal                10001      226213
global_consensus_response_journal       10001      226188

In the examples above, bloat percentage is close to 100%, so we can consider that the Raft tables are severely bloated.

2.3- Potential fixes for Raft Bloat

PGD developers fixed the Raft Vacuum routine, to run the equivalent of an ANALYZE after the equivalent of VACUUM FULL is executed, essentially allowing the statistics for the 4 Raft tables to be updated:

Run ANALYZE on the internal raft tables

As the Raft Vacuum routine runs every 24 hours, for most of the workloads this should be enough to keep Raft Bloat under control.

Another fix that helps reducing the Raft Bloat is:

Increased default bdr.raft_keep_min_entries to 1000 from 100.

As we saw in section 1.7 above, Raft Pruning essentially UPDATEs the Raft Snapshot entry and DELETEs all entries from the Raft Journal, which means Raft Pruning generates Raft Bloat. Higher bdr.raft_keep_min_entries means Raft Pruning being executed fewer times, which results in less Raft Bloat.

The following BDR / PGD versions received the 2 fixes above:

  • BDR 3.7.24 ELS;

NOTE: ELS stands for Extended Life Support, for customers with that kind of subscription.

2.4- Caveats about Raft Bloat

For workloads with some or all of the following characteristics:

  • Running BDR 3.7 or PGD 4;

  • Frequent DDL, for example partition creation every minute or every few minutes, can increase Raft Bloat because every DDL requires the Global Lock which needs Raft communication;

  • With HARP intensively making use of the Key-Value Store, or any unadvised additional usage of the Key-Value Store, can increase Raft Bloat because every operation against the Key-Value Store needs Raft communication;

  • Setting bdr.raft_keep_min_entries too low. The default is 100 but it's too low, which leads to more Raft Pruning, which leads to more Raft Bloat. This setting was increased to 1000 in BDR 3.7.24 ELS, PGD 4.3.4 and 5.4.0 which is considered an optimal value for the general use case.

  • Multiple recent cluster operations (joins, parts or Logical Standby promotions), also need Raft communication.

All the above can still result in elevated Raft Bloat, even with the ANALYZE fix mentioned in section 2.3 above.

This is why we have internal ticket BDR-5424 being worked on by the PGD developers. There are 2 potential approaches under discussion:

1- Introduce new setting(s) to make the Raft Vacuum interval configurable and not hard-coded at every 24 hours; OR

2- Make Raft tables no longer have the autovacuum_enabled = false table option, or at least have vacuum_truncate = false only, which prevents autovacuum from acquiring a strong table lock to perform tail truncation.

2.5- Other recommendations about Raft Bloat

2.5.1- Review Raft Consensus settings to account for Raft Bloat

  • bdr.raft_keep_min_entries: Make sure it's set to 1000 or higher. The higher the value, the less often a Raft Pruning is executed. The fewer times a Raft Pruning is executed, the less the bloat. A value of 10000 might still be good, but check logs to make sure Raft messages are not slowing down because the entire Raft Journal needs to be processed depending on the nature of the Raft message.

  • bdr.raft_election_timeout, bdr.raft_global_election_timeout and bdr.raft_group_election_timeout: This setting determines how long to wait for a Raft Election to finish before triggering another Raft Election. The default is 6 seconds for Global Raft and 3 seconds for the Group Raft. If there is Raft Bloat, messages are slower, so it's immediate that Raft Elections will take longer. In this case, we recommend increasing the Election Timeout to 30 seconds for example. If this is too low, it's possible to reach a situation where a Raft Election never finishes. If a Raft Election never finishes, there is no Raft Leader, and without Raft Leader, Raft stops working.

  • bdr.raft_response_timeout: This setting determines how long the node that just executed a Raft Request waits for a Raft Response before retrying the Raft Request. The default is -1 which means it waits for the same as the Election Timeout. You might consider setting it to a higher value to take into account that Raft messages are slower because of the Raft Bloat.

2.5.2- Consider AutoPartition with locally managed tables

If there are several DDL for partition maintenance being transparently replicated, each one of them needs to take the Global Lock, which means more Raft Requests / Responses, which means more entries in the Raft Journal, which means Raft Pruning happens more often, which means more Raft Bloat.

The PGD AutoPartition feature provides a way to automatically manage partitioned tables, including retention policies. PGD runs DDL commands with DDL replication disabled, so it doesn't require a Global Lock. With AutoPartition, each partitioned table can be:

  • Globally managed: AutoPartition leverages Raft Consensus to make sure the partition operation (create or drop) is done at the same time on all nodes. This behaves similar to the Global Lock but it's only for partitioned tables managed by AutoPartition. This is the default; OR

  • Locally managed: In this case, AutoPartition just does partition operations locally, with DDL replication disabled, according to the rules. As the rules are the same on all nodes, then in practice the same partition operations happen more or less at the same time in all nodes, without needing a Global Lock or even Raft Consensus. The only requirement is that the server clocks should be in sync.

So in order to significantly reduce Raft Consensus usage, instead of doing DDL to manage partitioned tables yourself (which requires Global Lock which requires Raft Consensus), consider migrating your partitioned tables to AutoPartition, but use locally managed tables.

2.5.3- Consider upgrading to PGD 5

If you are using BDR 3.7 or PGD 4 with HARP, keep in mind that HARP intensively using the Key-Value Store means more Raft Requests / Responses, which means more entries in the Raft Journal, which means Raft Pruning happens more often, which means more Raft Bloat.

In PGD 5, HARP was replaced with PGD Proxy, which leverages a LISTEN / NOTIFY mechanism instead of communicating everything through the Key-Value Store as HARP does. PGD 5 still leverages Raft Consensus to store which node is the Write Leader, but this is done in separate, dedicated Raft Consensus instances, for the subgroup. Being separate, these instances are not affected by Global Lock, for example.

2.6- Workaround for Raft Bloat

It's strongly recommended to upgrade to at least BDR 3.7.24 ELS, PGD 4.3.4 or PGD 5.4.0 because of the fixes that reduce Raft Bloat.

If even after upgrading, Raft Bloat is still an issue (check section 2.4 above), then you can apply the following workaround.

1- On all nodes except the Raft Leader, ONE BY ONE, do the following:

WARNING!!: Do NOT execute it on all nodes! Instead, do rolling VACUUM FULL as described below.

1.1- Disable Raft Consensus:

SELECT bdr.consensus_disable();

1.2- VACUUM FULL the 4 Raft tables:

VACUUM FULL bdr.global_consensus_response_journal;
VACUUM FULL bdr.global_consensus_journal;
VACUUM FULL bdr.local_consensus_state;
VACUUM FULL bdr.local_consensus_snapshot;

1.3- Re-enable Raft Consensus:

SELECT bdr.consensus_enable();

1.4- Wait for at least 5 minutes before doing the same in another node.

2- After applying the above on all nodes except the Raft Leader:

2.1- Transfer Raft Leadership to another node:

SELECT bdr.raft_leadership_transfer('ANOTHER_NODE_NAME');

Replacing ANOTHER_NODE_NAME with the name of another voting Raft Follower.

2.2- Run steps 1.1, 1.2 and 1.3 on the former Raft Leader.

NOTE!!: It's possible to schedule the above to be automatically performed via cron, but the nature of the operations might make it difficult to automate.

NOTE!!: If you are running BDR 3.6, then there is no fix for Raft Bloat (not even the ANALYZE fix), so you need to rely on the workaround above without leveraging bdr.raft_leadership_transfer() (which was introduced in BDR 3.7), which means that once you disable Raft on the Raft Leader, a new Raft Election will automatically start.

3- Problem: Birthday Paradox

The classical mathematical, statistical and computing problem called "Birthday Paradox" states that:

Considering the computed probability of at least two people sharing a birthday versus the number of people, it's counter-intuitive that only 23 people are needed for that probability to exceed 50%.

In PGD Raft Consensus, the Birthday Paradox is how we call internally the issue where the Raft Consensus worker dies trying to insert an entry with duplicate Raft Request ID into the Raft Journal.

The timeline for this to happen is as follows. Note how Raft Pruning and Raft Requests / Responses run in separate threads:

1- Raft Leader starts to perform Raft Pruning;

2- While the Raft Leader is doing Raft Pruning, a Raft Follower sends a Raft Request (let's say it has Raft Request ID A) to the Raft Leader. The Raft Follower appends this Raft Request with ID A into its own Raft Journal;

3- While the Raft Leader is doing Raft Pruning, it receives the Raft Request with ID A. Raft Leader has already started to perform Raft Pruning with an old target ID, so the Raft Request with ID A is included in the Raft Journal of the Raft Leader, but won't be included in the Raft Snapshot this time;

4- Raft Leader finishes doing Raft Pruning and sends the new Raft Snapshot and Raft Journal to all nodes;

5- The Raft Leader then sends a Raft Response about Raft Request ID A to the Raft Follower;

6- The original Raft Follower receives the new Raft Snapshot and Raft Journal from the Raft Leader and replaces into its local Raft Snapshot and Raft Journal;

7- Raft Follower doesn't receive the Raft Response from the Raft Leader within bdr.raft_response_timeout.

IMPORTANT!!: The Raft Request and/or Raft Response might be slow potentially because of the Raft Bloat as described in the previous section. From our experience, Raft Bloat significantly increases the likelihood of the Birthday Paradox.

8- Raft Follower then retries the Raft Request with the same ID A. When retrying, as it just applied the new Raft Journal, the Raft Follower assumes the Raft Journal doesn't have the Request with ID A and that Request was simply lost;

9- As the Raft Journal actually has an entry that came within the Raft Journal from the Raft Leader, with the same ID A, trying to insert another Raft Request with the same ID A into the local Raft Journal triggers the duplicate ID violation in the Raft Journal.

In other words, the Raft Consensus Birthday Paradox issue is a racing condition caused by a Raft Follower retrying a Raft Request while the Raft Leader is doing Raft Pruning at the same time.

3.1- Symptoms

Symptoms can vary, for example:

  • Frequent Write Leader changes;

  • HARP Proxy or PGD Proxy not routing to the Write Leader at all, with messages like these:

"err": "ERROR: could not find consensus worker (SQLSTATE 55000)", "sql": "SELECT * FROM bdr.get_raft_status()",
  • One or more nodes not able to run DDL;

  • Cluster operations such as join and part error out with:

could not find raft leader; try again later

Depending on the number of nodes affected by the Birthday Paradox issue, essentially Raft can be partially or entirely impaired, manifesting differently depending on the situation and nodes impacted. See section 1.1 above for more details about the impact of Raft Consensus.

3.2- Diagnosing the Birthday Paradox

One or more nodes can have their corresponding entries empty in bdr.group_raft_details, for example:

SELECT * FROM bdr.group_raft_details;

node_id  node_name  state  leader_id  current_term  commit_index  nodes  voting_nodes  protocol_version
3326594109  node1  RAFT_LEADER    3326594109  2122  85463589  4  3  4005
2728265565  node2  RAFT_FOLLOWER  3326594109  2122  85463589  4  3  4005
3326374859  node3
4235231024  node4  RAFT_FOLLOWER  3326594109  2122  85463589  4  3  4005

Which means that either the node is unreachable or the Raft Consensus worker is down.

If Postgres is up and supposed to be reachable, then you can check the bdr.workers view. Typically the following workers should be up, for example:

worker_pid  worker_role worker_role_name    worker_subid    worker_commit_timestamp worker_local_timestamp
29479   1   manager \N  \N  \N
29480   5   bdr consensus 4235231024    \N  \N  \N
29505   5   bdr autopart 4235231024 \N  \N  \N
33388   2   receiver    1214322886  2024-09-17 15:25:23.047782+00   2024-09-17 15:25:23.048896+00
33389   3   writer  1214322886  \N  \N
33390   2   receiver    3477284359  2024-09-17 15:25:23.050432+00   2024-09-17 15:25:23.051373+00
33391   3   writer  3477284359  \N  \N

If the bdr consensus worker is not listed here, and you confirm that there is no consensus process running via ps output, then it means the Consensus Worker is down.

If impacted by the Birthday Paradox, the Consensus Worker is actually trying to restart from time to time. From Postgres logs, you can see messages like this:

2024-08-29 07:17:57.395 UTC 10.52.113.120 [259844 start=2024-08-29 07:17:52 UTC]: u=[postgres] db=[dbaas] app=[pglogical bdr consensus 1136925621 16414:0] c=[] s=[66d020a0.3f704:1] tx=[10/854942:30184769] 23505 ERROR:  duplicate key value violates unique constraint "global_consensus_journal_origin_req_id_key"

Which confirms the node is affected by the Birthday Paradox issue.

3.3- Fix for the Birthday Paradox

As with most racing conditions, the Birthday Paradox is unpredictable. From experience, we have seen:

  • The issue self-healing when the Raft Follower fell behind from the Raft Leader for more messages than bdr.raft_keep_min_entries. This triggers the Raft Follower to request a new Raft Snapshot from the Raft Leader, and sometimes we have seen this being enough for the node to recover itself from the issue.

  • The issue going away after a Postgres restart, which essentially restarts the Consensus Worker.

It's clear that performing Raft Pruning less often reduces the likelihood of the issue from happening. In previous tickets, we suggested increasing bdr.raft_keep_min_entries to 10000 and the issue didn't happen again in this cluster for months.

At the time, the PGD developers also believed that just by increasing bdr.raft_keep_min_entries was enough to prevent the Birthday Paradox from happening. So BDR 3.7.24 ELS, PGD 4.3.4 and PGD 5.4.0 changed the default value of bdr.raft_keep_min_entries from 100 to 1000. From the BDR 3.7.24 ELS release notes:

  • Increase default bdr.raft_keep_min_entries to 1000. (BDR-4367)

In a PGD cluster, the Raft Leader periodically prunes the global_consensus_journal and global_consensus_response_journal tables. This pruning process does not occur simultaneously across all replicas. Previously, Raft Consensus journal pruning was based on the journal size as set by the bdr.raft_keep_min_entries configuration option. However, because Raft Consensus requests are retried using the same origin and request ID for every attempt, a situation could arise where the Raft Consensus journal on the Raft Leader is pruned while a retried command is sent to a replica that has yet to prune its journal. This discrepancy could lead to duplicate primary keys in the Raft Consensus journal table, causing the Raft Consensus worker to crash as it cannot insert new entries. To address this issue, the default value of bdr.raft_keep_min_entries has been increased to 1000. This adjustment ensures more consistent and reliable pruning across replicas, preventing duplicate primary keys and maintaining the stability of the consensus worker.

But when, on a cluster having bdr.raft_keep_min_entries set to 10000, the Birthday Paradox happened again, the PGD developers investigated further and found that there was a separate code path that could still trigger the Birthday Paradox issue.

A definitive fix was implemented per internal BDR-5275, where even when duplicate Request IDs happen in this racing condition situation, they are handled gracefully instead of resulting in the Consensus Worker dying with a duplicate key violation.

The fix per internal BDR-5275 was delivered as a hotfix on top of BDR 3.7.24 ELS and will be included in the next versions:

  • BDR 3.7.25 ELS;

  • PGD 4.3.6;

  • PGD 5.5.2.

3.4- Workaround for the Birthday Paradox

Once the definitive fix for the Birthday Paradox reaches the BDR / PGD major version you are running, the recommendation is to upgrade.

However, before the release, if you are affected by the Birthday Paradox issue, then you need to apply the Generic Raft Consensus Fix Routine as described in the next section.

4- Generic Raft Consensus Fix Routine

The routine below can be applied to any generic issue or bug in order to attempt to re-establish Raft Consensus, including the Birthday Paradox, but not including the Raft Bloat.

4.1- Try restarting Postgres

If the Consensus Worker is not running on the node, try restarting Postgres. The reason is that it's possible that the Manager Worker is no longer trying to start the Consensus Worker due to a previously known bug.

If this doesn't fix Raft Consensus on the node, try the next step.

4.2- Try disabling / re-enabling Raft Consensus

On the affected node:

SELECT bdr.consensus_disable();
SELECT bdr.consensus_enable();

If this doesn't fix Raft Consensus on the node, try the next step.

4.3- Try exporting / re-importing the local Raft Consensus snapshot

On the affected node:

SELECT bdr.consensus_disable();
SELECT bdr.consensus_snapshot_import(bdr.consensus_snapshot_export());
SELECT bdr.consensus_enable();

If this doesn't fix Raft Consensus on the node, try the next step.

4.4- Try copying the Raft Consensus snapshot from a working node

Let's imagine that node1 is the affected node.

Now let's pick node2, where we know Raft is working. In the next steps we will export the Raft Snapshot from node2 and import it into node1.

On both nodes, disable Raft Consensus:

SELECT bdr.consensus_disable();

On node2, export the Raft Consensus snapshot to a file:

\copy (SELECT snap FROM bdr.consensus_snapshot_export() AS snap) TO 'snap.data'

On node2, re-enable Raft Consensus:

SELECT bdr.consensus_enable();

Copy the file snap.data to node1, as postgres or enterprisedb user:

scp snap.data node1:~/

On node1, import the Raft Snapshot:

CREATE TEMPORARY TABLE raftsnap(snap bytea);
\copy raftsnap FROM 'snap.data'
SELECT bdr.consensus_snapshot_import(snap) FROM raftsnap;

On node1, re-enable Raft Consensus:

SELECT bdr.consensus_enable();

If this doesn't fix Raft Consensus on the node, try the next step.

4.5- Try parting the defective node(s)

If there is still Raft Consensus majority, then from any working node you can try to part the defective node out of the cluster:

SELECT bdr.part_node(
  node_name           := 'badnode',
  wait_for_completion := true,
  force               := false
);

If you don't have Raft Consensus majority, then from any working node you can try to force-part the defective node:

SELECT bdr.part_node(
  node_name := 'badnode',
  force     := true
);

If this still doesn't fix Raft Consensus, try the next step.

4.6- Try resetting Raft Consensus on the entire cluster

IMPORTANT!!: This should be attempted only as a last resort, because it will take down Raft Consensus from all nodes, preventing DDL from happening and all other features maintained by Raft Consensus. If application is connecting through HARP Proxy or PGD Proxy, then it will cause a disruption!

On all nodes, disable Raft Consensus:

SELECT bdr.consensus_disable();

On all nodes, check that the Consensus worker is gone:

bdrdb=# SELECT * FROM bdr.workers;
 worker_pid | worker_role | worker_role_name | worker_subid
------------+-------------+------------------+--------------
       6492 |           8 | taskmgr          |
       6494 |           2 | receiver         |    889088549
       6490 |           1 | manager          |
       6493 |          10 | routing_manager  |
       6495 |           2 | receiver         |   1615258951
       6496 |           3 | writer           |    889088549
       6497 |           3 | writer           |   1615258951
       6498 |           3 | writer           |    889088549
       6499 |           3 | writer           |   1615258951
(9 rows)

On any node, run the following command to TRUNCATE the Raft Snapshot and Journal on all nodes:

SELECT jsonb_pretty(
  bdr.run_on_all_nodes($$
    TRUNCATE bdr.global_consensus_journal;
    TRUNCATE bdr.local_consensus_snapshot;
    TRUNCATE bdr.global_consensus_response_journal;
    UPDATE bdr.local_consensus_state SET apply_index = 0, current_term = 1;
  $$)
);

On all nodes, enable Raft Consensus:

SELECT bdr.consensus_enable();

Wait a couple of minutes and check Raft Consensus status on all nodes:

SELECT * FROM bdr.group_raft_details ORDER BY 4, 3;

The output should be similar to:

-[ RECORD 1 ]----+--------------
instance_id      | 1
node_id          | 4245713078
node_name        | bdr1
node_group_name  | rb01
state            | RAFT_FOLLOWER
leader_type      | NODE
leader_id        | 1630662325
leader_name      | bdr3
current_term     | 1
commit_index     | 1
nodes            | 3
voting_nodes     | 3
protocol_version | 5001
-[ RECORD 2 ]----+--------------
instance_id      | 1
node_id          | 1609217846
node_name        | bdr2
node_group_name  | rb01
state            | RAFT_FOLLOWER
leader_type      | NODE
leader_id        | 1630662325
leader_name      | bdr3
current_term     | 1
commit_index     | 1
nodes            | 3
voting_nodes     | 3
protocol_version | 5001
-[ RECORD 3 ]----+--------------
instance_id      | 1
node_id          | 1630662325
node_name        | bdr3
node_group_name  | rb01
state            | RAFT_LEADER
leader_type      | NODE
leader_id        | 1630662325
leader_name      | bdr3
current_term     | 1
commit_index     | 1
nodes            | 3
voting_nodes     | 3
protocol_version | 5001

Note how Raft is working, with the commit_index and the current_term being down to 1.

Wait a couple of minutes and check Raft status again, to confirm the commit_index is advancing.

While not recommended because it's the one of the most disruptive techniques to recover Raft Consensus, from our experience this has proved to fix most of the situations where the previous attempts failed. But if this still doesn't fix Raft Consensus, try the next step.

4.7- Get the cluster down to a single node, reconfigure PGD and rebuild all the other nodes

If none of the attempts above work, unfortunately you might need to:

  • First of all, open a Support Ticket with us;

  • Grab a Lasso report from all nodes for further investigation and attach them to the ticket;

  • Part all nodes except one;

  • Drop the BDR extension;

  • Re-configure PGD on the node;

  • Rebuild all the other nodes.

Was this article helpful?

0 out of 0 found this helpful