Troubleshooting BDR replay lag caused by lack of unique indexes

Israel Barth
Israel Barth

The purpose of this article is to illustrate replay lag in a BDR cluster, caused by the lack of unique indexes in the downstream nodes. The scenario is used to demonstrate how to detect and fix the issue.

Introduction

Schema design issues can cause performance problems in BDR clusters. The lack of unique indexes in the downstream node is a known issue in BDR.

That problem comes up because of how logical replication works: for each change reported by the upstream node, the downstream node will have to look for the tuple which has the same identity and apply the change.

Assuming a scenario where there is no unique index to uniquely identify the tuple in the downstream, the downstream node will end up performing a sequential on the referenced table for each change that it needs to apply.

That can become a real headache: the more changes that are being replicated to the table, and the bigger the table, the more time the downstream node will spend looking for tuples to apply the changes. In the meantime, the upstream node may continue generating changes, which will start to pile up and increase replay lag between upstream and downstream nodes.

Troubleshooting BDR replay lag caused by lack of unique indexes

In this article we will teach how to troubleshoot the issue that was described earlier, meaning the replay lag caused by lack of unique indexes in BDR clusters.

Creating the reproducible

The test environment that will be analyzed is composed of the following hosts:

  • ts-55-bdr-node-1: BDR master node 1
  • ts-55-bdr-node-2: BDR master node 2
  • ts-55-bdr-witness: BDR witness node

For testing purposes, to make it more lightweight these hosts are actually Docker containers.

The name of the BDR database is bdrdb.

Creating a test environment

For sake of example, you can find below the steps that were used to create a test environment for this issue. The cluster was deployed through TPAexec tool, using version v23.6.

1- Run tpaexec configure to create the cluster configuration:

tpaexec configure /custom/Clusters/ts-55 \
--architecture BDR-Always-ON \
--layout bronze \
--bdr-version 4 \
--harp-consensus-protocol bdr \
--use-volatile-subscriptions \
--platform docker \
--enable-local-repo

2- Adjust the cluster configuration -- to keep only 3 BDR nodes, being 2 masters and 1 witness. This is how the final config.yml looks like:

--
architecture: BDR-Always-ON
cluster_name: ts-55
cluster_tags: {}

cluster_vars:
bdr_database: bdrdb
bdr_node_group: bdrgroup
bdr_version: '4'
enable_pg_backup_api: false
postgres_coredump_filter: '0xff'
postgres_version: '14'
postgresql_flavour: postgresql
preferred_python_version: python3
tpa_2q_repositories:
- products/bdr4/release
- products/harp/release
use_volatile_subscriptions: true

locations:
- Name: a

instance_defaults:
image: tpa/rocky:8
local_source_directories:
- /custom/Clusters/ts-55/local-repo:/var/opt/tpa/local-repo:ro
platform: docker
vars:
ansible_user: root

instances:
- Name: ts-55-bdr-node-1
location: a
node: 1
role:
- bdr
- Name: ts-55-bdr-node-2
location: a
node: 2
role:
- bdr
- Name: ts-55-bdr-witness
location: a
node: 3
role:
- bdr
- witness

3- Pre-download the packages to make the deploy faster later:

tpaexec download-packages /custom/Clusters/ts-55 \
--os RedHat \
--os-version 8

4- Provision the cluster:

tpaexec provision /custom/Clusters/ts-55

5- Deploy the cluster:

tpaexec deploy /custom/Clusters/ts-55

When the above command finishes you will have a function BDR cluster.

Now, the following steps will be used to introduce the issue described earlier:

1- Create a test table with no indexes and fill it with around 400MB worth of data:

docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
CREATE TABLE test_table (
id integer NOT NULL,
value text NOT NULL
)
"

docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
INSERT INTO test_table (
id,
value
)
SELECT generate_series(1, 10000000),
'test'
"

2- Wait for ts-55-bdr-node-2 to catch up with ts-55-bdr-node-1:

docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
SELECT bdr.wait_slot_confirm_lsn(NULL, NULL)
"

3- Change the test table replica identity to full:

docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
ALTER TABLE test_table REPLICA IDENTITY FULL
"

4- Introduce replay lag from ts-55-bdr-node-1 to ts-55-bdr-node-2. The lag will be introduced by deleting all tuples from the test table in node ts-55-bdr-node-1, which will cause the node ts-55-bdr-node-2 to start performing a lot of sequential scans when replicating the changes:

docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
DELETE
FROM test_table
"

Troubleshooting the issue

This section will describe how to troubleshoot and identify the issue. We will describe both how to do that both by using PostgreSQL queries and bash commands on the PostgreSQL hosts, and by using Lasso reports from the 2 nodes involved in the issue.

Using PostgreSQL queries and bash commands on PostgreSQL hosts

1- Check replay lag from ts-55-bdr-node-1 to ts-55-bdr-node-2:

  • Command:
docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
SELECT origin_name,
target_name,
target_dbname,
active,
state,
replay_lag_size
FROM bdr.node_slots
WHERE NOT is_group_slot;
"
  • Sample output:
origin_name | target_name | target_dbname | active | state | replay_lag_size
ts-55-bdr-node-1 | ts-55-bdr-witness | bdrdb | t | streaming | 0 bytes
ts-55-bdr-node-1 | ts-55-bdr-node-2 | bdrdb | t | streaming | 777 MB
(2 rows)

Note: in this case state should be either catchup or streaming, which indicates the downstream node is actively trying to apply the changes performed by the upstream node.

As shown above, the node ts-55-bdr-node-2 is struggling to replay the changes generated by ts-55-bdr-node-1. So we need to further analyze ts-55-bdr-node-2.

2- Check that there are no worker errors in node ts-55-bdr-node-2:

  • Command:
docker exec -u postgres -it ts-55-bdr-node-2 psql -d bdrdb -c "
SELECT origin_name,
target_name,
sub_name,
worker_role_name,
worker_pid,
error_time,
error_message
FROM bdr.worker_errors
"
  • Sample output:
origin_name | target_name | sub_name | worker_role_name | worker_pid | error_time | error_message
(0 rows)

There are no worker errors, so the replay lag is not piling up due to errors in the worker.

3- Check the PID and CPU sage of the pglogical writer process. That process is the one in charge of applying changes received by the downstream node. In this particular issue (lack of unique indexes), it is very likely that the pglogical writer process will be using around 100% of CPU due to the high amount of sequential scans being performed:

  • Command:
docker exec -it ts-55-bdr-node-2 bash -c "
top -b -c -n 1 | grep -P 'pglogical writer|%CPU' | grep -v grep
"
  • Sample output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5887 postgres 20 0 1224120 454996 452940 R 100.0 11.3 1:32.23 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:0
5888 postgres 20 0 1223844 16464 14580 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:0
5889 postgres 20 0 1223844 16464 14580 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:1
5890 postgres 20 0 1223848 16464 14584 S 0.0 0.4 0:00.09 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:1 waiting

Please note that the pglogical writer process command line looks like this:

pglogical writer <datoid>:<subid>:<seq>

<datoid> being the OID of the BDR database, <subid> being the ID of the BDR subscription, and <seq> being the sequence of the parallel worker of that subscription.

In this case, we can see that we have these processes:

  • 5887 and 5890 for replicating changes of subscription 4150863556
  • 5888 and 5889 for replicating changes of subscription 249268445

4- Check the BDR subscriptions to understand which process is replicating changes from which BDR node:

  • Command:
docker exec -u postgres -it ts-55-bdr-node-2 psql -d bdrdb -c "
SELECT sub_id,
origin_name,
target_name
FROM bdr.subscription_summary;
"
  • Sample output:
sub_id | origin_name | target_name
4150863556 | ts-55-bdr-node-1 | ts-55-bdr-node-2
249268445 | ts-55-bdr-witness | ts-55-bdr-node-2
(2 rows)

So, processes 5887 and 5890 are replicating changes from ts-55-bdr-node-1, and processes 5888 and 5889 are replicating changes from ts-55-bdr-witness.

As the node ts-55-bdr-node-2 is having issues to replicate the changes originated by ts-55-bdr-node-1, we are particularly interested in processes 5887 and 5890.

5- Check which locks are being held by the processes of interest:

  • Command:
docker exec -u postgres -it ts-55-bdr-node-2 psql -d bdrdb -c "
SELECT locktype,
relation AS relation_oid,
relation::regclass AS relation,
pid,
mode,
granted
FROM pg_catalog.pg_locks
WHERE pid IN (5887, 5888);
"
  • Sample output:
locktype | relation_oid | relation | pid | mode | granted
relation | 18177 | test_table | 5887 | RowExclusiveLock | t
virtualxid | | | 5887 | ExclusiveLock | t
transactionid | | | 5887 | ExclusiveLock | t
(3 rows)

We can see locks on relations test_table is being held by process 5887.

6- Check the relation information and statistics:

  • Command:
docker exec -u postgres -it ts-55-bdr-node-2 psql -d bdrdb -c "
SELECT c.oid AS relation_oid,
c.oid::regclass AS relation,
c.relreplident,
coalesce(i.has_unique_index, false) AS has_unique_index,
coalesce(i.has_primary_key, false) AS has_primary_key,
s.seq_tup_read
FROM pg_catalog.pg_class c
INNER JOIN pg_catalog.pg_stat_user_tables s
ON c.oid = s.relid
LEFT JOIN (
SELECT indrelid,
bool_or(indisunique) AS has_unique_index,
bool_or(indisprimary) AS has_primary_key
FROM pg_catalog.pg_index
GROUP BY 1
) i
ON c.oid = i.indrelid
WHERE c.oid = 'test_table'::regclass;
"
  • Sample output:
relation_oid | relation | relreplident | has_unique_index | has_primary_key | seq_tup_read
18177 | test_table | f | f | f | 0
(1 row)

As we can see, the relation has REPLICA IDENTITY FULL and has no unique indexes nor a primary key. That is likely the culprit of the issue as the pglogical writer process is very busy applying changes on that table.

Important: please note that in BDR 3.6, if the table has REPLICA IDENTITY FULL, BDR will always ignore the unique indexes for UPDATE and DELETE changes being replicated. So, it would not even help if we had unique indexes in this case.

Important: please note that table statistics are only updated at the end of the transaction. In this case the pglogical writer process is performing a lot of sequential reads on the table test_table, but that will only be reflected in its statistics once the transaction is either committed or rolled back in the by that process.

Using Lasso reports

In all commands that will be shared in this section, please replace LASSO_REPORT_NODE_1 and LASSO_REPORT_NODE_2 with the path to the Lasso reports of ts-55-bdr-node-1 and ts-55-bdr-node-2, respectively.

Besides that, please replace RELOID with the OID of the related table.

1- Check replay lag from ts-55-bdr-node-1 to ts-55-bdr-node-2:

  • Command:
awk -F '\t' '$10 != "t" {print($4,$5,$1,$17,$30,$45)}' \
LASSO_REPORT_NODE_1/postgresql/dbs/bdrdb/bdr/node_slots.data
  • Sample output:
origin_name target_name target_dbname active state replay_lag_size
ts-55-bdr-node-1 ts-55-bdr-witness bdrdb t streaming 0 bytes
ts-55-bdr-node-1 ts-55-bdr-node-2 bdrdb t streaming 777 MB

Note: in this case state should be either catchup or streaming, which indicates the downstream node is actively trying to apply the changes performed by the upstream node.

As shown above, the node ts-55-bdr-node-2 is struggling to replay the changes generated by ts-55-bdr-node-1. So we need to further analyze ts-55-bdr-node-2.

2- Check that there are no worker errors in node ts-55-bdr-node-2:

  • Command:
awk -F '\t' '{print($2,$4,$5,$7,$8,$9,$11)}' \
LASSO_REPORT_NODE_2/postgresql/dbs/bdrdb/bdr/worker_errors.data
  • Sample output:
origin_name target_name sub_name worker_role_name worker_pid error_time error_message

There are no worker errors, so the replay lag is not piling up due to errors in the workers.

3- Check the PID and CPU sage of the pglogical writer process. That process is the one in charge of applying changes received by the downstream node. In this particular issue (lack of unique indexes), it is very likely that the pglogical writer process will be using around 100% of CPU due to the high amount of sequential scans being performed:

  • Command:
grep -P 'pglogical writer|%CPU' \
LASSO_REPORT_NODE_2/linux/top.data
"
  • Sample output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5887 postgres 20 0 1224120 454992 452940 R 100.0 11.3 3:51.31 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:0
5888 postgres 20 0 1223844 16460 14580 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:0
5889 postgres 20 0 1223844 16456 14576 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:1
5890 postgres 20 0 1223848 16452 14580 S 0.0 0.4 0:00.15 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:1 waiting
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5887 postgres 20 0 1224120 454992 452940 R 99.7 11.3 3:54.31 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:0
5888 postgres 20 0 1223844 16460 14580 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:0
5889 postgres 20 0 1223844 16456 14576 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:1
5890 postgres 20 0 1223848 16452 14580 S 0.0 0.4 0:00.15 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:1 waiting
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5887 postgres 20 0 1224120 454992 452940 R 100.0 11.3 3:57.31 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:0
5888 postgres 20 0 1223844 16460 14580 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:0
5889 postgres 20 0 1223844 16456 14576 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:1
5890 postgres 20 0 1223848 16452 14580 S 0.0 0.4 0:00.15 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:1 waiting

Please note that the pglogical writer process command line looks like this:

pglogical writer <datoid>:<subid>:<seq>

<datoid> being the OID of the BDR database, <subid> being the ID of the BDR subscription, and <seq> being the sequence of the parallel worker of that subscription.

In this case, we can see that we have these processes:

  • 5887 and 5890 for replicating changes of subscription 4150863556
  • 5888 and 5889 for replicating changes of subscription 249268445

4- Check the BDR subscriptions to understand which process is replicating changes from which BDR node:

  • Command:
awk -F '\t' '{print($15,$3,$5)}' \
LASSO_REPORT_NODE_2/postgresql/dbs/bdrdb/bdr/subscription_summary.data
  • Sample output:
sub_id origin_name target_name
4150863556 ts-55-bdr-node-1 ts-55-bdr-node-2
249268445 ts-55-bdr-witness ts-55-bdr-node-2

So, processes 5887 and 5890 are replicating changes from ts-55-bdr-node-1, and processes 5888 and 5889 are replicating changes from ts-55-bdr-witness.

As the node ts-55-bdr-node-2 is having issues to replicate the changes originated by ts-55-bdr-node-1, we are particularly interested in processes 5887 and 5890.

5- Check which locks are being held by the processes of interest:

  • Command:
awk -F '\t' '{print($1,$3,$12,$13,$14)}' \
LASSO_REPORT_NODE_2/postgresql/running_locks.out
"
  • Sample output:
locktype relation pid mode granted
relation 18177 5887 RowExclusiveLock t
virtualxid \N 5887 ExclusiveLock t
transactionid \N 5890 ShareLock f
transactionid \N 5887 ExclusiveLock t

We can see locks on relations which OID is 18177 is being held by process 5887.

6- Check the relation information and statistics. Note the filter for the relation, using grep:

  • Commands:
awk -F '\t' '{print($1,$4,$24,$32)}' \
LASSO_REPORT_NODE_2/postgresql/dbs/bdrdb/tables.out \
| grep -P "^(reloid|RELOID)"
awk -F '\t' 'NR == 1 || $15 == "t" || $16 == "t" {print($4,$5,$15,$16,$24,$26)}' \
LASSO_REPORT_NODE_2/postgresql/dbs/bdrdb/indexes.out \
| grep -P "^(reloid|RELOID)"
  • Sample outputs:
reloid relname relreplident seq_tup_read
18177 test_table f 0
reloid relname indisunique indisprimary indisreplident indexdef

As we can see, the relation has REPLICA IDENTITY FULL and has no unique indexes nor a primary key. That is likely the culprit of the issue as the pglogical writer process is very busy applying changes on that table.

Fixing the issue

In order to fix this issue you should create either a primary key or an unique index over the bad performing table. That will allow the pglogical writer process to perform much faster lookups in the table while replaying changes from its upstream.

During the troubleshooting performed earlier we identified that the problematic table is test_table. It lacks unique indexes and primary key, has around 400 MB in size, and the pglogical writer worker is performing a delete of every single tuple in that table.

In order to fix this issue, you should perform the following steps in the node that is struggling to replay the changes (ts-55-bdr-node-2):

1- Stop PostgreSQL

2- Remove bdr from shared_preload_libraries

Note: if you are running a BDR major version older than 4, then you will need to also remove pglogical from shared_preload_libraries

3- Start PostgreSQL

4- Create a unique index (or a primary key) on the test_table:

BEGIN;
SET LOCAL bdr.ddl_replication TO off;
SET LOCAL bdr.xact_replication TO off;
CREATE UNIQUE INDEX ON test_table(id);
COMMIT;

Note: we are disabling transaction and DDL replication so that change won't require Raft to be performed, and will not be replicated to other nodes. Please note that there is replication lag among nodes, and Raft is negotiated through normal WAL streaming.

5- Stop PostgreSQL

6- Add bdr back to shared_preload_libraries

Note: if you are running a BDR major version older than 4, then you will need to also add pglogical back to shared_preload_libraries

7- Start PostgreSQL

After applying these steps in the node ts-55-bdr-node-2, it should eventually catchup with ts-55-bdr-node-1 if there are no other issues.

Important: it's strongly recommended that you later apply these steps in all other nodes of the cluster, so you get a indexing of the table across the entire BDR cluster.

Important: the above steps are touching shared_preload_libraries and restarting PostgreSQL because otherwise the pglogical writer process would hold the lock on table test_table, which would prevent our CREATE UNIQUE INDEX to be performed. While applying these steps to the other nodes, or when the object is not locked by the pglogical writer, you will want to only perform the step 4- above, meaning there is no need to touch shared_preload_libraries and restart the cluster twice.

Was this article helpful?

0 out of 0 found this helpful