How to properly manage roles in PGD

William Ivanski
William Ivanski

The PGD documentation explains how to manage roles in a PGD cluster. However, we often see Support tickets where not following the documented practices causes replication to break. This article extends the documentation explaining how incorrect role management can cause replication to break, how to fix replication, the root cause and best practices going forward.

Diagnosing broken replication

In a PGD cluster, first we detect replication broken (inactive replication slot), for example from node1 to node2.

From worker errors we can see replication was broken when trying to replicate a GRANT from node1 to node2:

sub_name worker_role worker_pid error_time error_message error_context_message
bdr_cmm_pg_db_bdrgroup_node1_node2 2 98422 2023-04-04 17:29:37.140136-03 writer has died \N
bdr_cmm_pg_db_bdrgroup_node1_node2 3 98423 2023-04-04 17:29:37.139406-03 "role ""role_select"" does not exist" "during execution of queued SQL statement: GRANT USAGE ON SCHEMA ""ICGUAT"" TO role_select
during apply of DDL message in xact with commit-end lsn 4A/A88C25B0 xid 116182475 committs 2023-03-15 18:18:45.67864-03 (action #1) (effective sess origin id=2 lsn=4A/A88C25B0)
while consuming 'M' message from receiver for subscription bdr_cmm_pg_db_bdrgroup_node1_node2 (id=184670765) on node node2 (id=1290460219) from upstream node node1 (id=3797181331, reporiginid=2)"

As we can see above, the GRANT statement, executed on node1, is this:

GRANT USAGE ON SCHEMA "ICGUAT" TO role_select;

However, role role_select doesn't exist on node2.

How to fix replication

We can fix this by doing the following:

1- Connect to node2, on the postgres database.

IMPORTANT: You need to connect to the postgres database or the edb database, not the cmm_pg_db database, which is the BDR-enabled database.

2- While connected to the postgres database, on node2, create the role_select role, in the same way you have created on node1.

After that, replication should get past that point.

Root cause

The root cause is as follows:

1- Roles are global objects;

2- Customer had bdr.role_replication enabled, which means that role-related DDL commands (CREATE ROLE, DROP ROLE, ALTER ROLE, GRANT) are replicated when executed while connected to a BDR-enabled database;

3- Customer executed CREATE ROLE role_select while on the postgres database on node1, so the role role_select was created locally only on node1 and not replicated to the other nodes;

4- Later, customer executed GRANT USAGE ON SCHEMA "ICGUAT" TO role_select while connected to the BDR-enabled database (cmm_pg_db in this case), so the GRANT command was replicated by PGD;

5- As the role role_select doesn't exist on node2, then when the GRANT arrives on node2, it fails and replication breaks.

The PGD documentation explains this behavior:

Users are global objects in a PostgreSQL instance. A CREATE ROLE commands or its alias CREATE USER is replicated automatically if they're executed in a PGD replicated database. If a role or user is created in a non-PGD, un-replicated database, the role will only exist for that PostgreSQL instance. GRANT ROLE and DROP ROLE work in the same way, only replicating if applied to a PGD replicated database.

This automatic replication behavior can be disabled by turning off the bdr.role_replication setting.

However it might not be clear to the reader the order of the commands and that this can actually break replication, which is the point this article explores.

Best practices

There are 3 possible strategies as explained below. The customer needs to choose 1 strategy and follow it consistently.

1- Execute all role-related DDL while connected the BDR-enabled database, so they get replicated by BDR; OR

2- Execute all role-related DDL while connected to a non-BDR-enabled database, so they don't get replicated by BDR; OR

3- Disable bdr.role_replication, so any role-related DDL will never be replicated by BDR, regardless of the database the user is connected to.

In case the strategy of choice is either 2 or 3 above, then the customer needs to make sure they are executing all role-related DDL on all nodes.

Was this article helpful?

0 out of 0 found this helpful