The purpose of this article is to illustrate replay lag in a BDR cluster, caused by the lack of unique indexes in the downstream nodes. The scenario is used to demonstrate how to detect and fix the issue.
Schema design issues can cause performance problems in BDR clusters. The lack of unique indexes in the downstream node is a known issue in BDR.
That problem comes up because of how logical replication works: for each change reported by the upstream node, the downstream node will have to look for the tuple which has the same identity and apply the change.
Assuming a scenario where there is no unique index to uniquely identify the tuple in the downstream, the downstream node will end up performing a sequential on the referenced table for each change that it needs to apply.
That can become a real headache: the more changes that are being replicated to the table, and the bigger the table, the more time the downstream node will spend looking for tuples to apply the changes. In the meantime, the upstream node may continue generating changes, which will start to pile up and increase replay lag between upstream and downstream nodes.
In this article we will teach how to troubleshoot the issue that was described earlier, meaning the replay lag caused by lack of unique indexes in BDR clusters.
The test environment that will be analyzed is composed of the following hosts:
-
ts-55-bdr-node-1
: BDR master node 1 -
ts-55-bdr-node-2
: BDR master node 2 -
ts-55-bdr-witness
: BDR witness node
For testing purposes, to make it more lightweight these hosts are actually Docker containers.
The name of the BDR database is bdrdb
.
For sake of example, you can find below the steps that were used to create a test
environment for this issue. The cluster was deployed through TPAexec
tool, using
version v23.6
.
1- Run tpaexec configure
to create the cluster configuration:
tpaexec configure /custom/Clusters/ts-55 \
--architecture BDR-Always-ON \
--layout bronze \
--bdr-version 4 \
--harp-consensus-protocol bdr \
--use-volatile-subscriptions \
--platform docker \
--enable-local-repo
2- Adjust the cluster configuration -- to keep only 3 BDR nodes, being 2 masters
and 1 witness. This is how the final config.yml
looks like:
--
architecture: BDR-Always-ON
cluster_name: ts-55
cluster_tags: {}
cluster_vars:
bdr_database: bdrdb
bdr_node_group: bdrgroup
bdr_version: '4'
enable_pg_backup_api: false
postgres_coredump_filter: '0xff'
postgres_version: '14'
postgresql_flavour: postgresql
preferred_python_version: python3
tpa_2q_repositories:
- products/bdr4/release
- products/harp/release
use_volatile_subscriptions: true
locations:
- Name: a
instance_defaults:
image: tpa/rocky:8
local_source_directories:
- /custom/Clusters/ts-55/local-repo:/var/opt/tpa/local-repo:ro
platform: docker
vars:
ansible_user: root
instances:
- Name: ts-55-bdr-node-1
location: a
node: 1
role:
- bdr
- Name: ts-55-bdr-node-2
location: a
node: 2
role:
- bdr
- Name: ts-55-bdr-witness
location: a
node: 3
role:
- bdr
- witness
3- Pre-download the packages to make the deploy faster later:
tpaexec download-packages /custom/Clusters/ts-55 \
--os RedHat \
--os-version 8
4- Provision the cluster:
tpaexec provision /custom/Clusters/ts-55
5- Deploy the cluster:
tpaexec deploy /custom/Clusters/ts-55
When the above command finishes you will have a function BDR cluster.
Now, the following steps will be used to introduce the issue described earlier:
1- Create a test table with no indexes and fill it with around 400MB worth of data:
docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
CREATE TABLE test_table (
id integer NOT NULL,
value text NOT NULL
)
"
docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
INSERT INTO test_table (
id,
value
)
SELECT generate_series(1, 10000000),
'test'
"
2- Wait for ts-55-bdr-node-2
to catch up with ts-55-bdr-node-1
:
docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
SELECT bdr.wait_slot_confirm_lsn(NULL, NULL)
"
3- Change the test table replica identity to full:
docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
ALTER TABLE test_table REPLICA IDENTITY FULL
"
4- Introduce replay lag from ts-55-bdr-node-1
to ts-55-bdr-node-2
. The lag
will be introduced by deleting all tuples from the test table in node ts-55-bdr-node-1
,
which will cause the node ts-55-bdr-node-2
to start performing a lot of sequential
scans when replicating the changes:
docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
DELETE
FROM test_table
"
This section will describe how to troubleshoot and identify the issue. We will describe both how to do that both by using PostgreSQL queries and bash commands on the PostgreSQL hosts, and by using Lasso reports from the 2 nodes involved in the issue.
1- Check replay lag from ts-55-bdr-node-1
to ts-55-bdr-node-2
:
- Command:
docker exec -u postgres -it ts-55-bdr-node-1 psql -d bdrdb -c "
SELECT origin_name,
target_name,
target_dbname,
active,
state,
replay_lag_size
FROM bdr.node_slots
WHERE NOT is_group_slot;
"
- Sample output:
origin_name | target_name | target_dbname | active | state | replay_lag_size
ts-55-bdr-node-1 | ts-55-bdr-witness | bdrdb | t | streaming | 0 bytes
ts-55-bdr-node-1 | ts-55-bdr-node-2 | bdrdb | t | streaming | 777 MB
(2 rows)
Note: in this case state
should be either catchup
or streaming
, which
indicates the downstream node is actively trying to apply the changes performed
by the upstream node.
As shown above, the node ts-55-bdr-node-2
is struggling to replay the changes
generated by ts-55-bdr-node-1
. So we need to further analyze ts-55-bdr-node-2
.
2- Check that there are no worker errors in node ts-55-bdr-node-2
:
- Command:
docker exec -u postgres -it ts-55-bdr-node-2 psql -d bdrdb -c "
SELECT origin_name,
target_name,
sub_name,
worker_role_name,
worker_pid,
error_time,
error_message
FROM bdr.worker_errors
"
- Sample output:
origin_name | target_name | sub_name | worker_role_name | worker_pid | error_time | error_message
(0 rows)
There are no worker errors, so the replay lag is not piling up due to errors in the worker.
3- Check the PID and CPU sage of the pglogical writer
process. That process is
the one in charge of applying changes received by the downstream node. In this particular
issue (lack of unique indexes), it is very likely that the pglogical writer
process
will be using around 100% of CPU due to the high amount of sequential scans being
performed:
- Command:
docker exec -it ts-55-bdr-node-2 bash -c "
top -b -c -n 1 | grep -P 'pglogical writer|%CPU' | grep -v grep
"
- Sample output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5887 postgres 20 0 1224120 454996 452940 R 100.0 11.3 1:32.23 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:0
5888 postgres 20 0 1223844 16464 14580 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:0
5889 postgres 20 0 1223844 16464 14580 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:1
5890 postgres 20 0 1223848 16464 14584 S 0.0 0.4 0:00.09 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:1 waiting
Please note that the pglogical writer
process command line looks like this:
pglogical writer <datoid>:<subid>:<seq>
<datoid>
being the OID of the BDR database, <subid>
being the ID of the
BDR subscription, and <seq>
being the sequence of the parallel worker of that
subscription.
In this case, we can see that we have these processes:
-
5887
and5890
for replicating changes of subscription4150863556
-
5888
and5889
for replicating changes of subscription249268445
4- Check the BDR subscriptions to understand which process is replicating changes from which BDR node:
- Command:
docker exec -u postgres -it ts-55-bdr-node-2 psql -d bdrdb -c "
SELECT sub_id,
origin_name,
target_name
FROM bdr.subscription_summary;
"
- Sample output:
sub_id | origin_name | target_name
4150863556 | ts-55-bdr-node-1 | ts-55-bdr-node-2
249268445 | ts-55-bdr-witness | ts-55-bdr-node-2
(2 rows)
So, processes 5887
and 5890
are replicating changes from ts-55-bdr-node-1
,
and processes 5888
and 5889
are replicating changes from ts-55-bdr-witness
.
As the node ts-55-bdr-node-2
is having issues to replicate the changes originated
by ts-55-bdr-node-1
, we are particularly interested in processes 5887
and 5890
.
5- Check which locks are being held by the processes of interest:
- Command:
docker exec -u postgres -it ts-55-bdr-node-2 psql -d bdrdb -c "
SELECT locktype,
relation AS relation_oid,
relation::regclass AS relation,
pid,
mode,
granted
FROM pg_catalog.pg_locks
WHERE pid IN (5887, 5888);
"
- Sample output:
locktype | relation_oid | relation | pid | mode | granted
relation | 18177 | test_table | 5887 | RowExclusiveLock | t
virtualxid | | | 5887 | ExclusiveLock | t
transactionid | | | 5887 | ExclusiveLock | t
(3 rows)
We can see locks on relations test_table
is being held by process 5887
.
6- Check the relation information and statistics:
- Command:
docker exec -u postgres -it ts-55-bdr-node-2 psql -d bdrdb -c "
SELECT c.oid AS relation_oid,
c.oid::regclass AS relation,
c.relreplident,
coalesce(i.has_unique_index, false) AS has_unique_index,
coalesce(i.has_primary_key, false) AS has_primary_key,
s.seq_tup_read
FROM pg_catalog.pg_class c
INNER JOIN pg_catalog.pg_stat_user_tables s
ON c.oid = s.relid
LEFT JOIN (
SELECT indrelid,
bool_or(indisunique) AS has_unique_index,
bool_or(indisprimary) AS has_primary_key
FROM pg_catalog.pg_index
GROUP BY 1
) i
ON c.oid = i.indrelid
WHERE c.oid = 'test_table'::regclass;
"
- Sample output:
relation_oid | relation | relreplident | has_unique_index | has_primary_key | seq_tup_read
18177 | test_table | f | f | f | 0
(1 row)
As we can see, the relation has REPLICA IDENTITY FULL
and has no unique indexes
nor a primary key. That is likely the culprit of the issue as the pglogical writer
process is very busy applying changes on that table.
Important: please note that in BDR 3.6, if the table has REPLICA IDENTITY FULL
,
BDR will always ignore the unique indexes for UPDATE
and DELETE
changes being
replicated. So, it would not even help if we had unique indexes in this case.
Important: please note that table statistics are only updated at the end of
the transaction. In this case the pglogical writer
process is performing a lot
of sequential reads on the table test_table
, but that will only be reflected
in its statistics once the transaction is either committed or rolled back in the
by that process.
In all commands that will be shared in this section, please replace LASSO_REPORT_NODE_1
and LASSO_REPORT_NODE_2
with the path to the Lasso reports of ts-55-bdr-node-1
and ts-55-bdr-node-2
, respectively.
Besides that, please replace RELOID
with the OID of the related table.
1- Check replay lag from ts-55-bdr-node-1
to ts-55-bdr-node-2
:
- Command:
awk -F '\t' '$10 != "t" {print($4,$5,$1,$17,$30,$45)}' \
LASSO_REPORT_NODE_1/postgresql/dbs/bdrdb/bdr/node_slots.data
- Sample output:
origin_name target_name target_dbname active state replay_lag_size
ts-55-bdr-node-1 ts-55-bdr-witness bdrdb t streaming 0 bytes
ts-55-bdr-node-1 ts-55-bdr-node-2 bdrdb t streaming 777 MB
Note: in this case state
should be either catchup
or streaming
, which
indicates the downstream node is actively trying to apply the changes performed
by the upstream node.
As shown above, the node ts-55-bdr-node-2
is struggling to replay the changes
generated by ts-55-bdr-node-1
. So we need to further analyze ts-55-bdr-node-2
.
2- Check that there are no worker errors in node ts-55-bdr-node-2
:
- Command:
awk -F '\t' '{print($2,$4,$5,$7,$8,$9,$11)}' \
LASSO_REPORT_NODE_2/postgresql/dbs/bdrdb/bdr/worker_errors.data
- Sample output:
origin_name target_name sub_name worker_role_name worker_pid error_time error_message
There are no worker errors, so the replay lag is not piling up due to errors in the workers.
3- Check the PID and CPU sage of the pglogical writer
process. That process is
the one in charge of applying changes received by the downstream node. In this particular
issue (lack of unique indexes), it is very likely that the pglogical writer
process
will be using around 100% of CPU due to the high amount of sequential scans being
performed:
- Command:
grep -P 'pglogical writer|%CPU' \
LASSO_REPORT_NODE_2/linux/top.data
"
- Sample output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5887 postgres 20 0 1224120 454992 452940 R 100.0 11.3 3:51.31 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:0
5888 postgres 20 0 1223844 16460 14580 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:0
5889 postgres 20 0 1223844 16456 14576 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:1
5890 postgres 20 0 1223848 16452 14580 S 0.0 0.4 0:00.15 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:1 waiting
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5887 postgres 20 0 1224120 454992 452940 R 99.7 11.3 3:54.31 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:0
5888 postgres 20 0 1223844 16460 14580 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:0
5889 postgres 20 0 1223844 16456 14576 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:1
5890 postgres 20 0 1223848 16452 14580 S 0.0 0.4 0:00.15 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:1 waiting
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5887 postgres 20 0 1224120 454992 452940 R 100.0 11.3 3:57.31 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:0
5888 postgres 20 0 1223844 16460 14580 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:0
5889 postgres 20 0 1223844 16456 14576 S 0.0 0.4 0:00.05 postgres: ts-55-bdr-node-2: pglogical writer 16562:249268445:1
5890 postgres 20 0 1223848 16452 14580 S 0.0 0.4 0:00.15 postgres: ts-55-bdr-node-2: pglogical writer 16562:4150863556:1 waiting
Please note that the pglogical writer
process command line looks like this:
pglogical writer <datoid>:<subid>:<seq>
<datoid>
being the OID of the BDR database, <subid>
being the ID of the
BDR subscription, and <seq>
being the sequence of the parallel worker of that
subscription.
In this case, we can see that we have these processes:
-
5887
and5890
for replicating changes of subscription4150863556
-
5888
and5889
for replicating changes of subscription249268445
4- Check the BDR subscriptions to understand which process is replicating changes from which BDR node:
- Command:
awk -F '\t' '{print($15,$3,$5)}' \
LASSO_REPORT_NODE_2/postgresql/dbs/bdrdb/bdr/subscription_summary.data
- Sample output:
sub_id origin_name target_name
4150863556 ts-55-bdr-node-1 ts-55-bdr-node-2
249268445 ts-55-bdr-witness ts-55-bdr-node-2
So, processes 5887
and 5890
are replicating changes from ts-55-bdr-node-1
,
and processes 5888
and 5889
are replicating changes from ts-55-bdr-witness
.
As the node ts-55-bdr-node-2
is having issues to replicate the changes originated
by ts-55-bdr-node-1
, we are particularly interested in processes 5887
and 5890
.
5- Check which locks are being held by the processes of interest:
- Command:
awk -F '\t' '{print($1,$3,$12,$13,$14)}' \
LASSO_REPORT_NODE_2/postgresql/running_locks.out
"
- Sample output:
locktype relation pid mode granted
relation 18177 5887 RowExclusiveLock t
virtualxid \N 5887 ExclusiveLock t
transactionid \N 5890 ShareLock f
transactionid \N 5887 ExclusiveLock t
We can see locks on relations which OID is 18177
is being held by process 5887
.
6- Check the relation information and statistics. Note the filter for the relation,
using grep
:
- Commands:
awk -F '\t' '{print($1,$4,$24,$32)}' \
LASSO_REPORT_NODE_2/postgresql/dbs/bdrdb/tables.out \
| grep -P "^(reloid|RELOID)"
awk -F '\t' 'NR == 1 || $15 == "t" || $16 == "t" {print($4,$5,$15,$16,$24,$26)}' \
LASSO_REPORT_NODE_2/postgresql/dbs/bdrdb/indexes.out \
| grep -P "^(reloid|RELOID)"
- Sample outputs:
reloid relname relreplident seq_tup_read
18177 test_table f 0
reloid relname indisunique indisprimary indisreplident indexdef
As we can see, the relation has REPLICA IDENTITY FULL
and has no unique indexes
nor a primary key. That is likely the culprit of the issue as the pglogical writer
process is very busy applying changes on that table.
In order to fix this issue you should create either a primary key or an unique
index over the bad performing table. That will allow the pglogical writer
process
to perform much faster lookups in the table while replaying changes from its upstream.
During the troubleshooting performed earlier we identified that the problematic table
is test_table
. It lacks unique indexes and primary key, has around 400 MB in size,
and the pglogical writer
worker is performing a delete of every single tuple in that
table.
In order to fix this issue, you should perform the following steps in the node that
is struggling to replay the changes (ts-55-bdr-node-2
):
1- Stop PostgreSQL
2- Remove bdr
from shared_preload_libraries
Note: if you are running a BDR major version older than 4, then you will need
to also remove pglogical
from shared_preload_libraries
3- Start PostgreSQL
4- Create a unique index (or a primary key) on the test_table
:
BEGIN;
SET LOCAL bdr.ddl_replication TO off;
SET LOCAL bdr.xact_replication TO off;
CREATE UNIQUE INDEX ON test_table(id);
COMMIT;
Note: we are disabling transaction and DDL replication so that change won't require Raft to be performed, and will not be replicated to other nodes. Please note that there is replication lag among nodes, and Raft is negotiated through normal WAL streaming.
5- Stop PostgreSQL
6- Add bdr
back to shared_preload_libraries
Note: if you are running a BDR major version older than 4, then you will need
to also add pglogical
back to shared_preload_libraries
7- Start PostgreSQL
After applying these steps in the node ts-55-bdr-node-2
, it should eventually
catchup with ts-55-bdr-node-1
if there are no other issues.
Important: it's strongly recommended that you later apply these steps in all other nodes of the cluster, so you get a indexing of the table across the entire BDR cluster.
Important: the above steps are touching shared_preload_libraries
and restarting
PostgreSQL because otherwise the pglogical writer
process would hold the lock
on table test_table
, which would prevent our CREATE UNIQUE INDEX
to be performed.
While applying these steps to the other nodes, or when the object is not locked by the
pglogical writer
, you will want to only perform the step 4-
above, meaning there
is no need to touch shared_preload_libraries
and restart the cluster twice.