When pglogical is in use within a cluster of Postgres nodes, there are many possible failure scenarios where logical replication may cease. In these cases, it's important to collect several pieces of forensic information from Postgres catalogs and logs.
On every provider and subscriber with the pglogical
extension installed, open a psql
command line utility and execute these commands, with "nodename" changed to the node name as specified in pglogical.node
, and "dbname" changed to the name of the database. Depending on the version of postgres you are running the query you'll have to choose one or another of the following SQL
scripts to run.
If the node is running postgres 9.4 or 9.5 use the follow snippet:
\o nodename-dbname-pglogical-info.txt
SELECT pglogical.pglogical_version();
\echo "pglogical replication sets"
SELECT * FROM pglogical.replication_set;
\echo "pglogical subscriptions"
SELECT * FROM pglogical.subscription;
\echo "pglogical nodes"
SELECT * FROM pglogical.node;
\echo "pglogical node_interface"
SELECT * FROM pglogical.node_interface;
\echo "pglogical stat activity"
SELECT * FROM pg_stat_activity
WHERE application_name LIKE '%pglogical%';
\echo "pglogical tables in replication sets"
SELECT s.set_name, st.*
FROM pglogical.replication_set_table st
JOIN pglogical.replication_set s USING (set_id);
\echo "pglogical replication stats and lag"
SELECT *,
pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) as restart_lsn_lag
FROM pg_replication_slots;
\echo "pglogical queue"
SELECT *
FROM pglogical.queue
ORDER BY queued_at DESC
LIMIT 20;
\o
On postgres 9.6 use the following code instead:
\o nodename-dbname-pglogical-info.txt
SELECT pglogical.pglogical_version();
\echo "pglogical replication sets"
SELECT * FROM pglogical.replication_set;
\echo "pglogical subscriptions"
SELECT * FROM pglogical.subscription;
\echo "pglogical nodes"
SELECT * FROM pglogical.node;
\echo "pglogical node_interface"
SELECT * FROM pglogical.node_interface;
\echo "pglogical stat activity"
SELECT * FROM pg_stat_activity
WHERE application_name LIKE '%pglogical%';
\echo "pglogical tables in replication sets"
SELECT s.set_name, st.*
FROM pglogical.replication_set_table st
JOIN pglogical.replication_set s USING (set_id);
\echo "pglogical replication stats and lag"
SELECT *,
pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) as restart_lsn_lag,
pg_xlog_location_diff(pg_current_xlog_insert_location(), confirmed_flush_lsn) as confirmed_flush_lag
FROM pg_replication_slots;
\echo "pglogical queue"
SELECT *
FROM pglogical.queue
ORDER BY queued_at DESC
LIMIT 20;
\o
Finally if running postgres 10 or better, execute the SQL
script below:
\o nodename-dbname-pglogical-info.txt
SELECT pglogical.pglogical_version();
\echo "pglogical replication sets"
SELECT * FROM pglogical.replication_set;
\echo "pglogical subscriptions"
SELECT * FROM pglogical.subscription;
\echo "pglogical nodes"
SELECT * FROM pglogical.node;
\echo "pglogical node_interface"
SELECT * FROM pglogical.node_interface;
\echo "pglogical stat activity"
SELECT * FROM pg_stat_activity
WHERE application_name LIKE '%pglogical%';
\echo "pglogical tables in replication sets"
SELECT s.set_name, st.*
FROM pglogical.replication_set_table st
JOIN pglogical.replication_set s USING (set_id);
\echo "pglogical replication stats and lag"
SELECT *,
pg_wal_lsn_diff(pg_current_wal_insert_lsn(), restart_lsn) as restart_lsn_lag,
pg_wal_lsn_diff(pg_current_wal_insert_lsn(), confirmed_flush_lsn) as confirmed_flush_lag
FROM pg_replication_slots;
\echo "pglogical queue"
SELECT *
FROM pglogical.queue
ORDER BY queued_at DESC
LIMIT 20;
\o
This file will contain several diagnostic elements necessary for determining the current state of pglogical within the cluster, as well as what may potentially be wrong.
In addition to the Postgres catalog, quite a bit of information may be available in the Postgres instance logs as well. Before obtaining this, it's best to enable extended logging by setting the following parameter in postgresql.conf
:
log_line_prefix='%m [%p] [%d] [%a] '
The log_line_prefix
parameter does not need to change if it already contains all of the listed placeholders; it's merely important that they're included. Afterwards, Postgres should be reloaded so the new settings are applied.
Once the extended prefix is enabled, the easiest way to obtain informative log lines is with a simple egrep
command on the existing Postgres log file.
egrep 'ERROR|FATAL|PANIC|pglogical' postgresql.log
Again, this should be repeated on all physical nodes running pglogical, whether as a provider or subscriber.