How to triage PGD

William Ivanski
William Ivanski
  • Updated

Among other useful information, Lasso gathers PGD metadata from PostgreSQL instances into a tarball. Having multiple Lasso tarballs (one per node from a PGD cluster) downloaded locally, it's possible to run some commands to quickly check the PGD cluster state and health.

The following instructions are useful for BDR 3.7 and newer, but we also include the command for BDR 3.6.

Download Lasso reports

In our example case, the customer has a 4-node BDR cluster, so they sent 4 Lasso tarballs. The first step is to download them all and extract them in the same directory, for example:

$ ls -lh
total 1.6M
drwxr-xr-x 8 william.ivanski  256 Oct 10  2022 edb-lasso-report-ist01-vm-db-ps01-5432-20221010-092948
-rw-r--r-- 1 william.ivanski 423K Nov  7  2022 edb-lasso-report-ist01-vm-db-ps01-5432-20221010-092948-bba36c2c201309f348d34aef10e5f1e2.tar.bz2
drwxr-xr-x 9 william.ivanski  288 Nov  7  2022 edb-lasso-report-ist01-vm-db-ps02-5432-20221010-093413
-rw-r--r-- 1 william.ivanski 418K Nov  7  2022 edb-lasso-report-ist01-vm-db-ps02-5432-20221010-093413-8382b7797ca90629cb162082f096ec5d.tar.bz2
drwxr-xr-x 8 william.ivanski  256 Oct 10  2022 edb-lasso-report-ist02-vm-db-ps01-5432-20221010-092836
-rw-r--r-- 1 william.ivanski 386K Nov  7  2022 edb-lasso-report-ist02-vm-db-ps01-5432-20221010-092836-d7c39d35160faa6f45c6ca23e0c61156.tar.bz2
drwxr-xr-x 9 william.ivanski  288 Nov  7  2022 edb-lasso-report-ist02-vm-db-ps02-5432-20221010-092842
-rw-r--r-- 1 william.ivanski 380K Nov  7  2022 edb-lasso-report-ist02-vm-db-ps02-5432-20221010-092842-d291d0b52c845d8c9ef206a938a59015.tar.bz2

Locate BDR-enabled databases

Each Postgres instance can have multiple databases, but we need to identify the databases where the bdr extension is installed. When Lasso detects the bdr extension installed, then it creates a bdr directory under that database directory, so the command to locate the BDR-enabled databases in all Lasso reports is as simple as:

find . -type d -name bdr

Output will be, for example:

$ find . -type d -name bdr
./edb-lasso-report-ist01-vm-db-ps01-5432-20221010-092948/postgresql/dbs/paysourcing/bdr
./edb-lasso-report-ist01-vm-db-ps02-5432-20221010-093413/postgresql/dbs/paysourcing/bdr
./edb-lasso-report-ist02-vm-db-ps01-5432-20221010-092836/postgresql/dbs/paysourcing/bdr
./edb-lasso-report-ist02-vm-db-ps02-5432-20221010-092842/postgresql/dbs/paysourcing/bdr

So we know that the database name is paysourcing in all 4 nodes.

Linux distribution and version

Knowing the operating system version is important to provide targeted instructions, as well as for the developers to generate hotfix packages for the customer system if needed. The command to check that is:

find . -type f -name release.data -print -exec bash -c "cat {}; echo" \;

Sample output:

$ find . -type f -name release.data -print -exec bash -c "cat {}; echo" \;
./edb-lasso-report-ist01-vm-db-ps01-5432-20221010-092948/linux/release.data
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal
./edb-lasso-report-ist01-vm-db-ps02-5432-20221010-093413/linux/release.data
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal
./edb-lasso-report-ist02-vm-db-ps01-5432-20221010-092836/linux/release.data
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal
./edb-lasso-report-ist02-vm-db-ps02-5432-20221010-092842/linux/release.data
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal

So we know the customer is running Ubuntu 20.04 on all nodes.

BDR version and edition

Another crucial piece of information in terms of Support ticket metadata is the BDR version and edition. If running BDR 3.7 or newer, the command to check this information is:

find . -type f -name group_versions_details.data -print -exec bash -c "cat {}; echo" \;

Sample output:

$ find . -type f -name group_versions_details.data -print -exec bash -c "cat {}; echo" \;
./edb-lasso-report-ist01-vm-db-ps01-5432-20221010-092948/postgresql/dbs/paysourcing/bdr/group_versions_details.data
node_id	node_name	postgres_version	pglogical_version	bdr_version	bdr_edition
285269477	ist02_vm_db_ps02	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
835501687	ist01_vm_db_ps01	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
3115261467	ist01_vm_db_ps02	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
1600578929	ist02_vm_db_ps01	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
./edb-lasso-report-ist01-vm-db-ps02-5432-20221010-093413/postgresql/dbs/paysourcing/bdr/group_versions_details.data
node_id	node_name	postgres_version	pglogical_version	bdr_version	bdr_edition
285269477	ist02_vm_db_ps02	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
835501687	ist01_vm_db_ps01	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
3115261467	ist01_vm_db_ps02	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
1600578929	ist02_vm_db_ps01	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
./edb-lasso-report-ist02-vm-db-ps01-5432-20221010-092836/postgresql/dbs/paysourcing/bdr/group_versions_details.data
node_id	node_name	postgres_version	pglogical_version	bdr_version	bdr_edition
3115261467	ist01_vm_db_ps02	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
835501687	ist01_vm_db_ps01	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
1600578929	ist02_vm_db_ps01	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
285269477	ist02_vm_db_ps02	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
./edb-lasso-report-ist02-vm-db-ps02-5432-20221010-092842/postgresql/dbs/paysourcing/bdr/group_versions_details.data
node_id	node_name	postgres_version	pglogical_version	bdr_version	bdr_edition
1600578929	ist02_vm_db_ps01	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
3115261467	ist01_vm_db_ps02	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
835501687	ist01_vm_db_ps01	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE
285269477	ist02_vm_db_ps02	13.8 (2ndQPG 13.8r1.1.10) (Ubuntu 2:13.8r2ndq1.1.10-1.focal+1)	3.7.17	3.7.17	EE

So we know the customer is running BDR 3.7.17, Enterprise Edition, on top of Postgres Extended (formerly known as 2ndQPG).

If the customer was on BDR 3.6, then the command would be this one instead:

find . -type f -name monitor_group_versions_details.data -print -exec bash -c "cat {}; echo" \;

Raft Consensus status

In a PGD cluster, Raft Consensus needs to be working at all times so checking this is one of the first steps when assessing a PGD cluster health. The command is:

find . -type f -name bdr_monitor_group_raft.out -print -exec bash -c "cat {}; echo" \;

We know Raft Consensus is working when the output is:

$ find . -type f -name bdr_monitor_group_raft.out -print -exec bash -c "cat {}; echo" \;
./edb-lasso-report-ist01-vm-db-ps01-5432-20221010-092948/postgresql/dbs/paysourcing/bdr_monitor_group_raft.out
monitor_group_raft
"(OK,""Raft Consensus is working correctly"")"
./edb-lasso-report-ist01-vm-db-ps02-5432-20221010-093413/postgresql/dbs/paysourcing/bdr_monitor_group_raft.out
monitor_group_raft
"(OK,""Raft Consensus is working correctly"")"
./edb-lasso-report-ist02-vm-db-ps01-5432-20221010-092836/postgresql/dbs/paysourcing/bdr_monitor_group_raft.out
monitor_group_raft
"(OK,""Raft Consensus is working correctly"")"
./edb-lasso-report-ist02-vm-db-ps02-5432-20221010-092842/postgresql/dbs/paysourcing/bdr_monitor_group_raft.out
monitor_group_raft
"(OK,""Raft Consensus is working correctly"")"

Replication status and lag

When there are any issues with BDR replication, then we will be able to see the replication slots inactive or replication lag increasing. The command to check this is:

find . -type f -name node_slots.data -print -exec bash -c "awk -F'\t' '{print \$4,\$5,\$9,\$17,\$42,\$43,\$44,\$45}' {}; echo" \;

Sample output:

$ find . -type f -name node_slots.data -print -exec bash -c "awk -F'\t' '{print \$4,\$5,\$9,\$17,\$42,\$43,\$44,\$45}' {}; echo" \;
./edb-lasso-report-ist01-vm-db-ps01-5432-20221010-092948/postgresql/dbs/paysourcing/bdr/node_slots.data
origin_name target_name slot_name active sent_lag_size write_lag_size flush_lag_size replay_lag_size
ist01_vm_db_ps01 ist02_vm_db_ps02 bdr_paysourcing_bdr_paysourcing_ist02_vm_db_ps02 t 60 MB 107 MB 108 MB 108 MB
ist01_vm_db_ps01 ist01_vm_db_ps02 bdr_paysourcing_bdr_paysourcing_ist01_vm_db_ps02 t 0 bytes 0 bytes 0 bytes 0 bytes
ist01_vm_db_ps01 ist02_vm_db_ps01 bdr_paysourcing_bdr_paysourcing_ist02_vm_db_ps01 t 0 bytes 0 bytes 0 bytes 0 bytes
\N \N bdr_paysourcing_bdr_paysourcing f 108 MB 108 MB 108 MB 108 MB
./edb-lasso-report-ist01-vm-db-ps02-5432-20221010-093413/postgresql/dbs/paysourcing/bdr/node_slots.data
origin_name target_name slot_name active sent_lag_size write_lag_size flush_lag_size replay_lag_size
ist01_vm_db_ps02 ist02_vm_db_ps02 bdr_paysourcing_bdr_paysourcing_ist02_vm_db_ps02 t 0 bytes 0 bytes 0 bytes 0 bytes
ist01_vm_db_ps02 ist01_vm_db_ps01 bdr_paysourcing_bdr_paysourcing_ist01_vm_db_ps01 t 0 bytes 0 bytes 0 bytes 0 bytes
ist01_vm_db_ps02 ist02_vm_db_ps01 bdr_paysourcing_bdr_paysourcing_ist02_vm_db_ps01 t 0 bytes 0 bytes 0 bytes 0 bytes
\N \N bdr_paysourcing_bdr_paysourcing f 4745 kB 4745 kB 4745 kB 4745 kB
./edb-lasso-report-ist02-vm-db-ps01-5432-20221010-092836/postgresql/dbs/paysourcing/bdr/node_slots.data
origin_name target_name slot_name active sent_lag_size write_lag_size flush_lag_size replay_lag_size
ist02_vm_db_ps01 ist01_vm_db_ps02 bdr_paysourcing_bdr_paysourcing_ist01_vm_db_ps02 t 5608 bytes 5608 bytes 5608 bytes 5608 bytes
ist02_vm_db_ps01 ist01_vm_db_ps01 bdr_paysourcing_bdr_paysourcing_ist01_vm_db_ps01 t 5608 bytes 5608 bytes 5608 bytes 5608 bytes
ist02_vm_db_ps01 ist02_vm_db_ps02 bdr_paysourcing_bdr_paysourcing_ist02_vm_db_ps02 t 5608 bytes 5608 bytes 5608 bytes 5608 bytes
\N \N bdr_paysourcing_bdr_paysourcing f 6674 kB 6674 kB 6674 kB 6674 kB
./edb-lasso-report-ist02-vm-db-ps02-5432-20221010-092842/postgresql/dbs/paysourcing/bdr/node_slots.data
origin_name target_name slot_name active sent_lag_size write_lag_size flush_lag_size replay_lag_size
ist02_vm_db_ps02 ist02_vm_db_ps01 bdr_paysourcing_bdr_paysourcing_ist02_vm_db_ps01 t 0 bytes 0 bytes 0 bytes 0 bytes
ist02_vm_db_ps02 ist01_vm_db_ps02 bdr_paysourcing_bdr_paysourcing_ist01_vm_db_ps02 t 0 bytes 0 bytes 0 bytes 0 bytes
ist02_vm_db_ps02 ist01_vm_db_ps01 bdr_paysourcing_bdr_paysourcing_ist01_vm_db_ps01 t 0 bytes 0 bytes 0 bytes 0 bytes
\N \N bdr_paysourcing_bdr_paysourcing f 3463 kB 3463 kB 3463 kB 3463 kB

In this case, all replication slots are active (except the BDR group slot, which is always inactive). However, there is one peer replication slot with considerable replication lag:

origin_name target_name slot_name active sent_lag_size write_lag_size flush_lag_size replay_lag_size
ist01_vm_db_ps01 ist02_vm_db_ps02 bdr_paysourcing_bdr_paysourcing_ist02_vm_db_ps02 t 60 MB 107 MB 108 MB 108 MB

So we can try to investigate further replication between these 2 nodes.

If the customer was running BDR 3.6, then the command would be the following instead:

find . -type f -name node_slots.data -print -exec bash -c "awk -F'\t' '{print \$4,\$5,\$9,\$15,\$40,\$41,\$42,\$43}' {}; echo" \;

Worker errors

In most of the cases, an inactive replication slot and/or replication lag can be explained by checking the BDR worker errors on the target. The following command shows worker errors in all nodes:

find . -type f -name worker_errors.data  -print -exec bash -c "cat {}; echo" \;

Sample output:

$ find . -type f -name worker_errors.data  -print -exec bash -c "cat {}; echo" \;
./edb-lasso-report-ist01-vm-db-ps01-5432-20221010-092948/postgresql/dbs/paysourcing/bdr/worker_errors.data
node_group_name	origin_name	source_name	target_name	sub_name	worker_role	worker_role_name	worker_pid	error_time	error_ageerror_message	error_context_message	remoterelid	subwriter_id	subwriter_name
bdr_paysourcing	ist01_vm_db_ps02	ist01_vm_db_ps02	ist01_vm_db_ps01	bdr_paysourcing_bdr_paysourcing_ist0b9af1e1_ist031ccbe7	3	writer	596548	2022-10-10 07:59:40.404119+00	01:31:01.129739	pglogical worker received fast finish request, exiting	"while handling global lock GLOBAL_LOCK_DDL in database 16384 requested by node-id 3115261467 in local acquire stage acquiring_local
for ddl_epoch 94500184816528, ddl_epoch_lsn 55F2/8982A4C0
during apply of message in xact with commit-end lsn 0/0 xid 0 committs <unset> (action #0) (effective sess origin id=2 lsn=293/B2388601)
while consuming 'M' message from receiver for subscription bdr_paysourcing_bdr_paysourcing_ist0b9af1e1_ist031ccbe7 (id=2759508289) on node ist01_vm_db_ps01 (id=835501687) from upstream node ist01_vm_db_ps02 (id=3115261467, reporiginid=2)"	0	2617452294	bdr_paysourcing_bdr_paysourcing_ist0b9af1e1_ist031ccbe7
./edb-lasso-report-ist01-vm-db-ps02-5432-20221010-093413/postgresql/dbs/paysourcing/bdr/worker_errors.data
node_group_name	origin_name	source_name	target_name	sub_name	worker_role	worker_role_name	worker_pid	error_time	error_ageerror_message	error_context_message	remoterelid	subwriter_id	subwriter_name
./edb-lasso-report-ist02-vm-db-ps01-5432-20221010-092836/postgresql/dbs/paysourcing/bdr/worker_errors.data
node_group_name	origin_name	source_name	target_name	sub_name	worker_role	worker_role_name	worker_pid	error_time	error_ageerror_message	error_context_message	remoterelid	subwriter_id	subwriter_name
./edb-lasso-report-ist02-vm-db-ps02-5432-20221010-092842/postgresql/dbs/paysourcing/bdr/worker_errors.data
node_group_name	origin_name	source_name	target_name	sub_name	worker_role	worker_role_name	worker_pid	error_time	error_ageerror_message	error_context_message	remoterelid	subwriter_id	subwriter_name

Check specific configuration

The full PostgreSQL configuration (which includes PGD configuration) can be found in the configuration.out file in each Lasso report. However, in some cases we want to compare a single setting among the nodes. It's possible to do this with the following command, for example comparing setting *writers_per_subscription:

find . -type f -name configuration.out -print -exec bash -c "grep writers_per_subscription {}; echo" \;

Sample output:

$ find . -type f -name configuration.out -print -exec bash -c "grep writers_per_subscription {}; echo" \;
./edb-lasso-report-ist01-vm-db-ps01-5432-20221010-092948/postgresql/configuration.out
pglogical.max_writers_per_subscription	16	configuration file	16	8	16	/data/postgres/conf.d/bdr_configuration.conf	11	f
pglogical.writers_per_subscription	4	configuration file	4	1	4	/data/postgres/conf.d/bdr_configuration.conf	12	f
./edb-lasso-report-ist01-vm-db-ps02-5432-20221010-093413/postgresql/configuration.out
pglogical.max_writers_per_subscription	16	configuration file	16	8	16	/data/postgres/conf.d/bdr_configuration.conf	11	f
pglogical.writers_per_subscription	4	configuration file	4	1	4	/data/postgres/conf.d/bdr_configuration.conf	12	f
./edb-lasso-report-ist02-vm-db-ps01-5432-20221010-092836/postgresql/configuration.out
pglogical.max_writers_per_subscription	16	configuration file	16	8	16	/data/postgres/conf.d/bdr_configuration.conf	11	f
pglogical.writers_per_subscription	1	configuration file	1	1	1	/data/postgres/conf.d/bdr_configuration.conf	12	f
./edb-lasso-report-ist02-vm-db-ps02-5432-20221010-092842/postgresql/configuration.out
pglogical.max_writers_per_subscription	16	configuration file	16	8	16	/data/postgres/conf.d/bdr_configuration.conf	11	f
pglogical.writers_per_subscription	1	configuration file	1	1	1	/data/postgres/conf.d/bdr_configuration.conf	12	f

We can clearly see the difference (4 in half of the nodes and 1 in the other half).

Was this article helpful?

0 out of 0 found this helpful