This article provides an introduction about Raft Consensus (what it is for, how does it work and concepts involved), to then proceed with discussing in depth the two most common problems with Raft Consensus:
-
Bloat in Raft Consensus Journal tables;
-
Raft Consensus duplicate request IDs (also known as the "Birthday Paradox").
For each issue, we outline a few common symptoms, how to diagnose as well as possible fixes and workarounds where applicable.
At the end we provide a generic procedure to fix Raft Consensus issues, from the safer to the more intrusive approach, depending on the situation.
Raft stands for Replicated Available Fault Tolerance. It's a consensus algorithm that uses votes from a quorum of machines in a distributed cluster. Given the total number of nodes, Raft looks for a consensus of the majority (total number of nodes divided by 2 plus 1) voting for any decision.
The purpose of the Raft Consensus algorithm is to allow multiple servers to agree on the state of something. For more details about the Raft Consensus algorithm, you can find a list of resources here.
PGD has its own implementation of the Raft Consensus algorithm. In PGD, Raft Consensus is not used for data replication. Instead, PGD leverages Raft Consensus for a number of important features as you can see below. Here we consider Raft Consensus not properly working as:
-
Not having majority of nodes online; or
-
Broken because of bugs as discussed later in this article; or
-
Extremely slow because of bugs as discussed later in this article.
Feature | Description | What happens if Raft is not properly working? |
---|---|---|
Global DDL Lock, Global DML Lock | Required for every transparent DDL replication | DDL is not possible by default |
Allocate new chunks for Galloc Sequences | Required every time a node runs out of values for a Galloc Sequence |
INSERT s on tables using a Galloc Sequence start to fail |
Cluster maintenance operations | Join node, part node, promote standby | These operations are not possible. Part is possible if forced |
Advance the BDR Group Replication Slot position | Also known as the cluster-wide LSN horizon | WAL files start to accumulate in all nodes, risking disk exhaustion. The slot also holds back the xmin horizon, causing autovacuum to not work properly, leading to performance issues and even transaction wraparound |
AutoPartition | For maintenance of globally managed partitioned tables | Without new partitions to receive new data, INSERT s fail |
Eager Replication | In Eager, every transaction is orchestrated by Raft | Every Eager transaction fails |
Switch CAMO to Local Mode if require_raft is enabled |
CAMO can be configured to require Raft before switching to Local Mode to prevent split brain | CAMO configured like that doesn't switch to Local Mode and if the CAMO partner is really offline, COMMIT s start to hang indefinitely |
Routing to the Write Leader | Starting with PGD 5, there are separate Raft instances dedicated to keep which node is the Write Leader, per location. Before PGD 5, the same Raft instance is used for everything listed here | Without Raft Consensus to tell the HARP Proxy or PGD Proxy which node to point to, applications won't reach the database, resulting in downtime |
As you can see above, Raft Consensus is essential for PGD, so it's extremely recommended to be monitored at all times. You can monitor Raft Consensus with:
-- Summary:
SELECT * FROM bdr.monitor_group_raft();
-- Details:
SELECT * FROM bdr.group_raft_details;
Examples:
SELECT * FROM bdr.monitor_group_raft();
OK Raft Consensus is working correctly
SELECT * FROM bdr.group_raft_details;
node_id node_name state leader_id current_term commit_index nodes voting_nodes protocol_version
3326594109 node1 RAFT_LEADER 3326594109 2122 85463589 4 3 4005
2728265565 node2 RAFT_FOLLOWER 3326594109 2122 85463589 4 3 4005
3326374859 node3 RAFT_FOLLOWER 3326594109 2122 85463589 4 3 4005
4235231024 node4 RAFT_FOLLOWER 3326594109 2122 85463589 4 3 4005
Both bdr.monitor_group_raft()
and bdr.group_raft_details
leverage bdr.run_on_all_nodes()
to query the Raft Consensus state from all nodes.
Locally, the Consensus Worker keeps the Raft Consensus state for the node in table bdr.local_consensus_state
.
One of the voting nodes is elected to be the Raft Leader. In PGD, Data nodes and Witness nodes are voting nodes and can be Raft Leader.
The other nodes that are not the Raft Leader are known as Raft Followers.
If Raft Followers lose connection to the Raft Leader, they start a new Raft Election, putting themselves as Raft Candidates and voting to themselves or each other until a new Raft Leader is elected.
Every time a new Raft Leader is elected, the Raft Term is incremented by 1. This is the field current_term
as seen in the bdr.group_raft_details
view.
Raft communication starts with a Raft Request. For example a node needs to perform DDL, so it requests the Raft Leader for a Global Lock on a table.
The Raft Leader then consults the other voting nodes in the cluster (including itself) about the request. For example, for a Global Lock to be acquired cluster-wide, each voting node will only approve if it can also acquire a local lock on the table. Each time the Raft Leader communicates with a node to request for voting is also a Raft Request and the each voting node responds with their votes in a Raft Response.
Then the Raft Leader sends a final Raft Response to the node that originated the Raft Request. For example, having the Global Lock being acquired, the requester node can finally perform the DDL.
Each Raft message (Request or Response) has an ID also called Commit Index. This is the field commit_index
as seen in the bdr.group_raft_details
view.
Every piece of information stored by the Raft Consensus algorithm is frozen in a set of information called Raft Snapshot. Each node keeps a copy of the Raft Snapshot in table bdr.local_consensus_snapshot
.
In PGD 4 and older, as there is a single Raft instance, then there is a single Raft Snapshot. Starting with PGD 5, as there can be multiple instances (one per group, including the top group), then each Raft instance has a separate Raft Snapshot.
The Raft Journal contains all the recent Raft Requests and Responses exchanged among all nodes in the cluster.
Here, by recent we mean since the last Raft Pruning as explained in the section below.
-
Table
bdr.global_consensus_journal
contains all Raft messages, being it Raft Requests or Responses, originated from all nodes. -
Table
bdr.global_consensus_response_journal
contains every Raft Request and their corresponding Raft Response side by side.
Both tables are replicated among the nodes and contain the actual Raft message encoded in their payload fields.
- View
bdr.global_consensus_journal_details
is equivalent to thebdr.global_consensus_response_journal
table, but after applying functionbdr.decode_message_payload()
to decode the payloads.
The status of each piece of information maintained by Raft Consensus (let's say, key A
for the sake of this example) is composed of 2 states:
-
First the node checks for the value of
A
in the Raft Snapshot; -
Then the node needs to logically walk through all the messages in the Raft Journal, which can have messages changing the value of
A
; -
After the node has processed the Raft Journal, then the node knows the current value of
A
.
Because of this, the Raft Journal can't grow indefinitely. So every time the number of messages in the Raft Journal reaches bdr.raft_keep_min_entries
, then the Raft Leader will start a process called Raft Pruning, which is composed of the following steps:
-
First the Raft Leader checks which is the newest Raft Request ID in the Raft Journal and considers this the target ID;
-
Then the Raft Leader processes the Raft Journal, i.e., walks over every message in the Raft Journal until the target ID, changing, inserting or removing keys in the Raft Snapshot;
-
When the Raft Leader reaches the target ID, then it
DELETE
s all the messages from the Raft Journal until the target ID and sends the updated Raft Snapshot to all Raft Followers.
Raft Pruning is important because the higher the number of messages in the Raft Journal, the slower some Raft messages will be, if they need to for example check the current value of a key stored in Raft Consensus.
Section above explains why the Raft Journal can't grow indefinitely, which is handled by the Raft Pruning operation happening every time the Raft Journal reaches bdr.raft_keep_min_entries
number of messages.
However, it's expected that tables receiving UPDATE
s and DELETE`s can accumulate dead rows, also called bloat.
Usually Postgres autovacuum
would take care of the bloat on the Raft Consensus tables, but the following tables have autovacuum_enabled = false
:
-
bdr.local_consensus_state
; -
bdr.local_consensus_snapshot
; -
bdr.global_consensus_journal
; -
bdr.global_consensus_response_journal
.
This is by design, to prevent autovacuum
from interfering with the Raft Consensus operation.
However, as bloat on these tables is not supposed to grow indefinitely, the Consensus Worker runs a procedure called Raft Vacuum every 24 hours, or every time the Consensus Worker is started on the node.
From Postgres logs, we can see when Raft Vacuum was executed with messages like:
Aug 22 17:02:46 rt39916n1 postgres[9616]: [12-2] 2024-08-22 17:02:46 UTC [postgres@/pglogical bdr consensus 949376428 17073:0/bdrdb:9616]: [2] STATEMENT: Raft VACUUM
Raft Vacuum runs the low-level equivalent of a VACUUM FULL
on the 4 tables above.
Besides the information the Raft Consensus already tracks by default (such as the Global Lock, which node is the Write Leader, BDR Group Slot LSN, etc), which is crucial for the PGD cluster health and operation, PGD provides an API for persisting any information in the Raft Snapshot. This API consists of:
-
bdr.consensus_kv_data
: Pseudo-table where it's possible to see in tabular format the custom values stored in the Raft Snapshot. Built-in values are not listed here; -
bdr.consensus_kv_store()
: Function that allows users to write any key and its corresponding value into the Key-Value Store. Essentially this function does a Raft Request to the Raft Leader, which then propagates the value to all Raft Followers. -
bdr.consensus_kv_fetch()
: Function that allows users to read the current value, given a key, from the Key-Value Store. As explained above, the Raft Snapshot doesn't contain the current values, sobdr.consensus_kv_fetch()
might need to read the Raft Snapshot and process the Raft Journal in order to get the current value of a key.
HARP stores all its information into the Raft Key-Value Store, leveraging the 2 functions above.
IMPORTANT!!: Even though nothing prevents users from leveraging the Key-Value Store, it's strongly advised to not use the Key-Value Store for anything else besides HARP. The reason is to not overpopulate and slow down Raft Consensus which is needed for other important aspects of PGD. Starting with PGD 5, the PGD Proxy no longer uses Key-Value Store and, in a future PGD version, the Key-Value Store will be removed.
Now that we are familiar with the concepts about Raft Consensus, let's talk about the 2 main problems involving Raft Consensus in the next 2 sections.
We found a problem in the Raft Vacuum routine: While it was running the low-level equivalent of VACUUM FULL
, it was not running the equivalent of ANALYZE
, so the statistics about the 4 Raft Consensus tables were not being updated.
With outdated statistics, queries on these tables (essentially any internal Raft operation such as Requests, Responses, Pruning, Key-Value Store, etc) can have severely degraded performance.
Degraded Raft performance can impact on all operations controlled by Raft, listed in section 1.1 above.
Usually time-sensitive operations requiring Raft Consensus are impacted because if Raft doesn't respond on time, something unintended happens. For example:
By default, bdr.raft_log_min_message_duration
is set to 5s
. Any raft messages taking longer than bdr.raft_log_min_message_duration
will be logged to Postgres logs, for example:
Aug 22 16:51:33 rt39916n1 postgres[8335]: [25-1] 2024-08-22 16:51:33 UTC [postgres@/pglogical manager 17073/bdrdb:8335]: [32] LOG: BDR consensus request id: 2332851996956199055 took 33059ms
As HARP is constantly storing and retrieving data from the Raft Key-Value Store, which as explained in section 1.9 above is completely controlled by Raft Consensus, then HARP in general can be impacted by degraded Raft performance. Common symptoms include but are not limited to:
-
Unintended / unexpected failovers;
-
Constant Write Leader flapping;
-
Complete inability to route traffic to the Write Leader because Raft is timing out.
A few example snippets of HARP log messages indicating Raft Bloat are:
"err": "timeout: context deadline exceeded", "sql": "select * from bdr.consensus_kv_fetch($1)"
no leader to connect to
lost dcs connection, cleaning connection and client information
As in BDR 3.7 and PGD 4 application reaches the database through HARP, Raft Bloat can result in severe and even Severity 1 situations.
Each transparently replicated DDL operation needs to acquire a Global Lock, which is a Raft orchestrated operation. If Raft messages are slow because of Raft bloat, then it's possible that the entire DDL transaction will error out, timing out because bdr.global_lock_timeout
(by default 1min
) wasn't enough.
Sample error message:
ERROR: canceling statement due to global lock timeout
CONTEXT: while handling global lock GLOBAL_LOCK_DDL in database 16926 requested by node-id 1609217846 in local acquire stage acquiring_local
for ddl_epoch 20, ddl_epoch_lsn 0/0
"xact rep": true, "ddl rep": true, "ddl locking": all
Postgres view pg_stat_all_tables
or Lasso report file postgresql/dbs/BDRDB/tables.out
contain the table statistics. In order to define if a table is bloated or not, in percentage, we consider the proportion of dead tuples from the total number of tuples:
bloat_perc = n_dead_tup / (n_live_tup + n_dead_tup)
If you have multiple Lasso reports (usually from different nodes in a PGD cluster) extracted in a folder, you can run a command like below to get the n_live_tup
and n_dead_tup
from the 4 Raft Consensus tables:
find . -type f -name tables.out -print -exec bash -c "grep consensus {} | awk '{print \$4,\$40,\$41}' | grep -v kv; echo" \;
Examples of Raft bloat:
-- bdr.raft_keep_min_entries = 100
relname n_live_tup n_dead_tup
local_consensus_state 1 278
local_consensus_snapshot 1 179
global_consensus_journal 100 17935
global_consensus_response_journal 100 17935
-- bdr.raft_keep_min_entries = 10000
relname n_live_tup n_dead_tup
local_consensus_state 1 8848
local_consensus_snapshot 1 22
global_consensus_journal 10001 226213
global_consensus_response_journal 10001 226188
In the examples above, bloat percentage is close to 100%, so we can consider that the Raft tables are severely bloated.
PGD developers fixed the Raft Vacuum routine, to run the equivalent of an ANALYZE
after the equivalent of VACUUM FULL
is executed, essentially allowing the statistics for the 4 Raft tables to be updated:
Run ANALYZE on the internal raft tables
As the Raft Vacuum routine runs every 24 hours, for most of the workloads this should be enough to keep Raft Bloat under control.
Another fix that helps reducing the Raft Bloat is:
Increased default bdr.raft_keep_min_entries to 1000 from 100.
As we saw in section 1.7 above, Raft Pruning essentially UPDATE
s the Raft Snapshot entry and DELETE
s all entries from the Raft Journal, which means Raft Pruning generates Raft Bloat. Higher bdr.raft_keep_min_entries
means Raft Pruning being executed fewer times, which results in less Raft Bloat.
The following BDR / PGD versions received the 2 fixes above:
- BDR 3.7.24 ELS;
NOTE: ELS stands for Extended Life Support, for customers with that kind of subscription.
For workloads with some or all of the following characteristics:
-
Running BDR 3.7 or PGD 4;
-
Frequent DDL, for example partition creation every minute or every few minutes, can increase Raft Bloat because every DDL requires the Global Lock which needs Raft communication;
-
With HARP intensively making use of the Key-Value Store, or any unadvised additional usage of the Key-Value Store, can increase Raft Bloat because every operation against the Key-Value Store needs Raft communication;
-
Setting
bdr.raft_keep_min_entries
too low. The default is100
but it's too low, which leads to more Raft Pruning, which leads to more Raft Bloat. This setting was increased to1000
in BDR 3.7.24 ELS, PGD 4.3.4 and 5.4.0 which is considered an optimal value for the general use case. -
Multiple recent cluster operations (joins, parts or Logical Standby promotions), also need Raft communication.
All the above can still result in elevated Raft Bloat, even with the ANALYZE
fix mentioned in section 2.3 above.
This is why we have internal ticket BDR-5424
being worked on by the PGD developers. There are 2 potential approaches under discussion:
1- Introduce new setting(s) to make the Raft Vacuum interval configurable and not hard-coded at every 24 hours; OR
2- Make Raft tables no longer have the autovacuum_enabled = false
table option, or at least have vacuum_truncate = false
only, which prevents autovacuum
from acquiring a strong table lock to perform tail truncation.
-
bdr.raft_keep_min_entries
: Make sure it's set to1000
or higher. The higher the value, the less often a Raft Pruning is executed. The fewer times a Raft Pruning is executed, the less the bloat. A value of10000
might still be good, but check logs to make sure Raft messages are not slowing down because the entire Raft Journal needs to be processed depending on the nature of the Raft message. -
bdr.raft_election_timeout
,bdr.raft_global_election_timeout
andbdr.raft_group_election_timeout
: This setting determines how long to wait for a Raft Election to finish before triggering another Raft Election. The default is 6 seconds for Global Raft and 3 seconds for the Group Raft. If there is Raft Bloat, messages are slower, so it's immediate that Raft Elections will take longer. In this case, we recommend increasing the Election Timeout to 30 seconds for example. If this is too low, it's possible to reach a situation where a Raft Election never finishes. If a Raft Election never finishes, there is no Raft Leader, and without Raft Leader, Raft stops working. -
bdr.raft_response_timeout
: This setting determines how long the node that just executed a Raft Request waits for a Raft Response before retrying the Raft Request. The default is-1
which means it waits for the same as the Election Timeout. You might consider setting it to a higher value to take into account that Raft messages are slower because of the Raft Bloat.
If there are several DDL for partition maintenance being transparently replicated, each one of them needs to take the Global Lock, which means more Raft Requests / Responses, which means more entries in the Raft Journal, which means Raft Pruning happens more often, which means more Raft Bloat.
The PGD AutoPartition feature provides a way to automatically manage partitioned tables, including retention policies. PGD runs DDL commands with DDL replication disabled, so it doesn't require a Global Lock. With AutoPartition, each partitioned table can be:
-
Globally managed: AutoPartition leverages Raft Consensus to make sure the partition operation (create or drop) is done at the same time on all nodes. This behaves similar to the Global Lock but it's only for partitioned tables managed by AutoPartition. This is the default; OR
-
Locally managed: In this case, AutoPartition just does partition operations locally, with DDL replication disabled, according to the rules. As the rules are the same on all nodes, then in practice the same partition operations happen more or less at the same time in all nodes, without needing a Global Lock or even Raft Consensus. The only requirement is that the server clocks should be in sync.
So in order to significantly reduce Raft Consensus usage, instead of doing DDL to manage partitioned tables yourself (which requires Global Lock which requires Raft Consensus), consider migrating your partitioned tables to AutoPartition, but use locally managed tables.
If you are using BDR 3.7 or PGD 4 with HARP, keep in mind that HARP intensively using the Key-Value Store means more Raft Requests / Responses, which means more entries in the Raft Journal, which means Raft Pruning happens more often, which means more Raft Bloat.
In PGD 5, HARP was replaced with PGD Proxy, which leverages a LISTEN / NOTIFY
mechanism instead of communicating everything through the Key-Value Store as HARP does. PGD 5 still leverages Raft Consensus to store which node is the Write Leader, but this is done in separate, dedicated Raft Consensus instances, for the subgroup. Being separate, these instances are not affected by Global Lock, for example.
It's strongly recommended to upgrade to at least BDR 3.7.24 ELS, PGD 4.3.4 or PGD 5.4.0 because of the fixes that reduce Raft Bloat.
If even after upgrading, Raft Bloat is still an issue (check section 2.4 above), then you can apply the following workaround.
1- On all nodes except the Raft Leader, ONE BY ONE, do the following:
WARNING!!: Do NOT execute it on all nodes! Instead, do rolling VACUUM FULL
as described below.
1.1- Disable Raft Consensus:
SELECT bdr.consensus_disable();
1.2- VACUUM FULL
the 4 Raft tables:
VACUUM FULL bdr.global_consensus_response_journal;
VACUUM FULL bdr.global_consensus_journal;
VACUUM FULL bdr.local_consensus_state;
VACUUM FULL bdr.local_consensus_snapshot;
1.3- Re-enable Raft Consensus:
SELECT bdr.consensus_enable();
1.4- Wait for at least 5 minutes before doing the same in another node.
2- After applying the above on all nodes except the Raft Leader:
2.1- Transfer Raft Leadership to another node:
SELECT bdr.raft_leadership_transfer('ANOTHER_NODE_NAME');
Replacing ANOTHER_NODE_NAME
with the name of another voting Raft Follower.
2.2- Run steps 1.1, 1.2 and 1.3 on the former Raft Leader.
NOTE!!: It's possible to schedule the above to be automatically performed via cron, but the nature of the operations might make it difficult to automate.
NOTE!!: If you are running BDR 3.6, then there is no fix for Raft Bloat (not even the ANALYZE
fix), so you need to rely on the workaround above without leveraging bdr.raft_leadership_transfer()
(which was introduced in BDR 3.7), which means that once you disable Raft on the Raft Leader, a new Raft Election will automatically start.
The classical mathematical, statistical and computing problem called "Birthday Paradox" states that:
Considering the computed probability of at least two people sharing a birthday versus the number of people, it's counter-intuitive that only 23 people are needed for that probability to exceed 50%.
In PGD Raft Consensus, the Birthday Paradox is how we call internally the issue where the Raft Consensus worker dies trying to insert an entry with duplicate Raft Request ID into the Raft Journal.
The timeline for this to happen is as follows. Note how Raft Pruning and Raft Requests / Responses run in separate threads:
1- Raft Leader starts to perform Raft Pruning;
2- While the Raft Leader is doing Raft Pruning, a Raft Follower sends a Raft Request (let's say it has Raft Request ID A
) to the Raft Leader. The Raft Follower appends this Raft Request with ID A
into its own Raft Journal;
3- While the Raft Leader is doing Raft Pruning, it receives the Raft Request with ID A
. Raft Leader has already started to perform Raft Pruning with an old target ID, so the Raft Request with ID A
is included in the Raft Journal of the Raft Leader, but won't be included in the Raft Snapshot this time;
4- Raft Leader finishes doing Raft Pruning and sends the new Raft Snapshot and Raft Journal to all nodes;
5- The Raft Leader then sends a Raft Response about Raft Request ID A
to the Raft Follower;
6- The original Raft Follower receives the new Raft Snapshot and Raft Journal from the Raft Leader and replaces into its local Raft Snapshot and Raft Journal;
7- Raft Follower doesn't receive the Raft Response from the Raft Leader within bdr.raft_response_timeout
.
IMPORTANT!!: The Raft Request and/or Raft Response might be slow potentially because of the Raft Bloat as described in the previous section. From our experience, Raft Bloat significantly increases the likelihood of the Birthday Paradox.
8- Raft Follower then retries the Raft Request with the same ID A
. When retrying, as it just applied the new Raft Journal, the Raft Follower assumes the Raft Journal doesn't have the Request with ID A
and that Request was simply lost;
9- As the Raft Journal actually has an entry that came within the Raft Journal from the Raft Leader, with the same ID A
, trying to insert another Raft Request with the same ID A
into the local Raft Journal triggers the duplicate ID violation in the Raft Journal.
In other words, the Raft Consensus Birthday Paradox issue is a racing condition caused by a Raft Follower retrying a Raft Request while the Raft Leader is doing Raft Pruning at the same time.
Symptoms can vary, for example:
-
Frequent Write Leader changes;
-
HARP Proxy or PGD Proxy not routing to the Write Leader at all, with messages like these:
"err": "ERROR: could not find consensus worker (SQLSTATE 55000)", "sql": "SELECT * FROM bdr.get_raft_status()",
-
One or more nodes not able to run DDL;
-
Cluster operations such as join and part error out with:
could not find raft leader; try again later
Depending on the number of nodes affected by the Birthday Paradox issue, essentially Raft can be partially or entirely impaired, manifesting differently depending on the situation and nodes impacted. See section 1.1 above for more details about the impact of Raft Consensus.
One or more nodes can have their corresponding entries empty in bdr.group_raft_details
, for example:
SELECT * FROM bdr.group_raft_details;
node_id node_name state leader_id current_term commit_index nodes voting_nodes protocol_version
3326594109 node1 RAFT_LEADER 3326594109 2122 85463589 4 3 4005
2728265565 node2 RAFT_FOLLOWER 3326594109 2122 85463589 4 3 4005
3326374859 node3
4235231024 node4 RAFT_FOLLOWER 3326594109 2122 85463589 4 3 4005
Which means that either the node is unreachable or the Raft Consensus worker is down.
If Postgres is up and supposed to be reachable, then you can check the bdr.workers
view. Typically the following workers should be up, for example:
worker_pid worker_role worker_role_name worker_subid worker_commit_timestamp worker_local_timestamp
29479 1 manager \N \N \N
29480 5 bdr consensus 4235231024 \N \N \N
29505 5 bdr autopart 4235231024 \N \N \N
33388 2 receiver 1214322886 2024-09-17 15:25:23.047782+00 2024-09-17 15:25:23.048896+00
33389 3 writer 1214322886 \N \N
33390 2 receiver 3477284359 2024-09-17 15:25:23.050432+00 2024-09-17 15:25:23.051373+00
33391 3 writer 3477284359 \N \N
If the bdr consensus
worker is not listed here, and you confirm that there is no consensus
process running via ps
output, then it means the Consensus Worker is down.
If impacted by the Birthday Paradox, the Consensus Worker is actually trying to restart from time to time. From Postgres logs, you can see messages like this:
2024-08-29 07:17:57.395 UTC 10.52.113.120 [259844 start=2024-08-29 07:17:52 UTC]: u=[postgres] db=[dbaas] app=[pglogical bdr consensus 1136925621 16414:0] c=[] s=[66d020a0.3f704:1] tx=[10/854942:30184769] 23505 ERROR: duplicate key value violates unique constraint "global_consensus_journal_origin_req_id_key"
Which confirms the node is affected by the Birthday Paradox issue.
As with most racing conditions, the Birthday Paradox is unpredictable. From experience, we have seen:
-
The issue self-healing when the Raft Follower fell behind from the Raft Leader for more messages than
bdr.raft_keep_min_entries
. This triggers the Raft Follower to request a new Raft Snapshot from the Raft Leader, and sometimes we have seen this being enough for the node to recover itself from the issue. -
The issue going away after a Postgres restart, which essentially restarts the Consensus Worker.
It's clear that performing Raft Pruning less often reduces the likelihood of the issue from happening. In previous tickets, we suggested increasing bdr.raft_keep_min_entries
to 10000
and the issue didn't happen again in this cluster for months.
At the time, the PGD developers also believed that just by increasing bdr.raft_keep_min_entries
was enough to prevent the Birthday Paradox from happening. So BDR 3.7.24 ELS, PGD 4.3.4 and PGD 5.4.0 changed the default value of bdr.raft_keep_min_entries
from 100
to 1000
. From the BDR 3.7.24 ELS release notes:
- Increase default bdr.raft_keep_min_entries to 1000. (BDR-4367)
In a PGD cluster, the Raft Leader periodically prunes the global_consensus_journal and global_consensus_response_journal tables. This pruning process does not occur simultaneously across all replicas. Previously, Raft Consensus journal pruning was based on the journal size as set by the bdr.raft_keep_min_entries configuration option. However, because Raft Consensus requests are retried using the same origin and request ID for every attempt, a situation could arise where the Raft Consensus journal on the Raft Leader is pruned while a retried command is sent to a replica that has yet to prune its journal. This discrepancy could lead to duplicate primary keys in the Raft Consensus journal table, causing the Raft Consensus worker to crash as it cannot insert new entries. To address this issue, the default value of bdr.raft_keep_min_entries has been increased to 1000. This adjustment ensures more consistent and reliable pruning across replicas, preventing duplicate primary keys and maintaining the stability of the consensus worker.
But when, on a cluster having bdr.raft_keep_min_entries
set to 10000
, the Birthday Paradox happened again, the PGD developers investigated further and found that there was a separate code path that could still trigger the Birthday Paradox issue.
A definitive fix was implemented per internal BDR-5275
, where even when duplicate Request IDs happen in this racing condition situation, they are handled gracefully instead of resulting in the Consensus Worker dying with a duplicate key violation.
The fix per internal BDR-5275
was delivered as a hotfix on top of BDR 3.7.24 ELS and will be included in the next versions:
-
BDR 3.7.25 ELS;
-
PGD 4.3.6;
-
PGD 5.5.2.
Once the definitive fix for the Birthday Paradox reaches the BDR / PGD major version you are running, the recommendation is to upgrade.
However, before the release, if you are affected by the Birthday Paradox issue, then you need to apply the Generic Raft Consensus Fix Routine as described in the next section.
The routine below can be applied to any generic issue or bug in order to attempt to re-establish Raft Consensus, including the Birthday Paradox, but not including the Raft Bloat.
If the Consensus Worker is not running on the node, try restarting Postgres. The reason is that it's possible that the Manager Worker is no longer trying to start the Consensus Worker due to a previously known bug.
If this doesn't fix Raft Consensus on the node, try the next step.
On the affected node:
SELECT bdr.consensus_disable();
SELECT bdr.consensus_enable();
If this doesn't fix Raft Consensus on the node, try the next step.
On the affected node:
SELECT bdr.consensus_disable();
SELECT bdr.consensus_snapshot_import(bdr.consensus_snapshot_export());
SELECT bdr.consensus_enable();
If this doesn't fix Raft Consensus on the node, try the next step.
Let's imagine that node1
is the affected node.
Now let's pick node2
, where we know Raft is working. In the next steps we will export the Raft Snapshot from node2
and import it into node1
.
On both nodes, disable Raft Consensus:
SELECT bdr.consensus_disable();
On node2
, export the Raft Consensus snapshot to a file:
\copy (SELECT snap FROM bdr.consensus_snapshot_export() AS snap) TO 'snap.data'
On node2
, re-enable Raft Consensus:
SELECT bdr.consensus_enable();
Copy the file snap.data
to node1
, as postgres
or enterprisedb
user:
scp snap.data node1:~/
On node1
, import the Raft Snapshot:
CREATE TEMPORARY TABLE raftsnap(snap bytea);
\copy raftsnap FROM 'snap.data'
SELECT bdr.consensus_snapshot_import(snap) FROM raftsnap;
On node1
, re-enable Raft Consensus:
SELECT bdr.consensus_enable();
If this doesn't fix Raft Consensus on the node, try the next step.
If there is still Raft Consensus majority, then from any working node you can try to part the defective node out of the cluster:
SELECT bdr.part_node(
node_name := 'badnode',
wait_for_completion := true,
force := false
);
If you don't have Raft Consensus majority, then from any working node you can try to force-part the defective node:
SELECT bdr.part_node(
node_name := 'badnode',
force := true
);
If this still doesn't fix Raft Consensus, try the next step.
IMPORTANT!!: This should be attempted only as a last resort, because it will take down Raft Consensus from all nodes, preventing DDL from happening and all other features maintained by Raft Consensus. If application is connecting through HARP Proxy or PGD Proxy, then it will cause a disruption!
On all nodes, disable Raft Consensus:
SELECT bdr.consensus_disable();
On all nodes, check that the Consensus worker is gone:
bdrdb=# SELECT * FROM bdr.workers;
worker_pid | worker_role | worker_role_name | worker_subid
------------+-------------+------------------+--------------
6492 | 8 | taskmgr |
6494 | 2 | receiver | 889088549
6490 | 1 | manager |
6493 | 10 | routing_manager |
6495 | 2 | receiver | 1615258951
6496 | 3 | writer | 889088549
6497 | 3 | writer | 1615258951
6498 | 3 | writer | 889088549
6499 | 3 | writer | 1615258951
(9 rows)
On any node, run the following command to TRUNCATE the Raft Snapshot and Journal on all nodes:
SELECT jsonb_pretty(
bdr.run_on_all_nodes($$
TRUNCATE bdr.global_consensus_journal;
TRUNCATE bdr.local_consensus_snapshot;
TRUNCATE bdr.global_consensus_response_journal;
UPDATE bdr.local_consensus_state SET apply_index = 0, current_term = 1;
$$)
);
On all nodes, enable Raft Consensus:
SELECT bdr.consensus_enable();
Wait a couple of minutes and check Raft Consensus status on all nodes:
SELECT * FROM bdr.group_raft_details ORDER BY 4, 3;
The output should be similar to:
-[ RECORD 1 ]----+--------------
instance_id | 1
node_id | 4245713078
node_name | bdr1
node_group_name | rb01
state | RAFT_FOLLOWER
leader_type | NODE
leader_id | 1630662325
leader_name | bdr3
current_term | 1
commit_index | 1
nodes | 3
voting_nodes | 3
protocol_version | 5001
-[ RECORD 2 ]----+--------------
instance_id | 1
node_id | 1609217846
node_name | bdr2
node_group_name | rb01
state | RAFT_FOLLOWER
leader_type | NODE
leader_id | 1630662325
leader_name | bdr3
current_term | 1
commit_index | 1
nodes | 3
voting_nodes | 3
protocol_version | 5001
-[ RECORD 3 ]----+--------------
instance_id | 1
node_id | 1630662325
node_name | bdr3
node_group_name | rb01
state | RAFT_LEADER
leader_type | NODE
leader_id | 1630662325
leader_name | bdr3
current_term | 1
commit_index | 1
nodes | 3
voting_nodes | 3
protocol_version | 5001
Note how Raft is working, with the commit_index
and the current_term
being down to 1.
Wait a couple of minutes and check Raft status again, to confirm the commit_index
is advancing.
While not recommended because it's the one of the most disruptive techniques to recover Raft Consensus, from our experience this has proved to fix most of the situations where the previous attempts failed. But if this still doesn't fix Raft Consensus, try the next step.
If none of the attempts above work, unfortunately you might need to:
-
First of all, open a Support Ticket with us;
-
Grab a Lasso report from all nodes for further investigation and attach them to the ticket;
-
Part all nodes except one;
-
Drop the BDR extension;
-
Re-configure PGD on the node;
-
Rebuild all the other nodes.