This article provides a way to fix the pg_depend
catalog function in order to reestablish replication on a BDR cluster.
In some specific scenarios, where the BDR Autopartition worker cannot create the partition due to a lock and retries,
the pg_depend
is stored duplicated.
- When using a previous BDR version than 3.7.16 (fix provided since that version)
- And also having the following error message in the database logs:
"ERROR: unexpected number of extension dependency records"
Before proceeding with the workaround of removing the duplicates from pg_depend
, you will need to:
a)- Schedule an application downtime;
b)- As root
, stop Postgres on the affected node. In our example, we are using node01
as the affected node;
c)- Remove pglogical,bdr
from the shared_preload_libraries
in Postgres configuration;
d)- Start Postgres again; this time, it will not start the pglogical/BDR workers, which is a pre-condition needed for the below procedure.
We are providing first a TEST procedure to be used in a similar way for the removal of the rows from the custom created
catalog tables.
1- We need to check what's the oid
for pg_class
and pg_extension
:
bdrdb=# SELECT oid, relname FROM pg_class WHERE relname IN ('pg_class', 'pg_extension');
oid | relname
1259 | pg_class
3079 | pg_extension
(2 rows)
Those values are going to be used on other steps.
2- We are creating and loading the same data that is in pg_depend
and pg_class
into custom tables, in order to test this
procedure before doing it in the real catalog tables:
SELECT * INTO pg_depend_e_node01 FROM pg_depend WHERE deptype = 'e';
SELECT oid,* INTO pg_class_node01 FROM pg_class;
NOTE: BDR is disabled on the node, so tables will not be replicated on other nodes.
3- After loading your pg_depend
and pg_class
data into custom catalog tables, we can use the following query to check
the duplicated record (note the custom pg_depend
and pg_class
table names and the OIDs
values from above):
SELECT c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype,
count(*)
FROM pg_depend_e_node01 d
JOIN pg_class_node01 c ON c.oid = d.objid
WHERE d.deptype = 'e'
AND d.classid = 1259
AND d.refclassid = 3079
AND c.relname LIKE 'conflict_history%'
GROUP BY c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype
HAVING count(*) > 1
ORDER BY 1;
That query returns all problematic tables, in our test example case just only 1:
relname | classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptyp
e | count
conflict_history_part_1147479802_547832982 | 1259 | 18516 | 0 | 3079 | 17675 | 0 | e
| 1
You can confirm that situation by checking the table bdr.autopartition_local_work_queue_status
on node node01
, it should
have that work still pending:
workid relname partition workkind sql status
40 bdr.conflict_history conflict_history_part_1147479802_547832982 d SELECT bdr.autopartition_drop_partition('bdr.conflict_history_part_1147479802_547832982'); \N
4- After identifying the duplicates, we rely on using the ctid
hidden system column to actually choose one of the row versions
to delete. We can see both row versions with this query:
SELECT c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype,
count(*),
min(d.ctid),
max(d.ctid)
FROM pg_depend_e_node01 d
JOIN pg_class_node01 c ON c.oid = d.objid
WHERE d.deptype = 'e'
AND d.classid = 1259
AND d.refclassid = 3079
AND c.relname LIKE 'conflict_history%'
GROUP BY c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype
HAVING count(*) > 1
ORDER BY 1;
In our example case, the output will be (note the min
and max
columns):
relname | classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
| count | min | max
conflict_history_part_1147479802_547832982 | 1259 | 18516 | 0 | 3079 | 17675 | 0 | e
| 2 | (6,23) | (6,34)
(1 row)
5- Then, we can try the DELETE
statement (over the testing custom catalog tables). As postgres
user on the bdrdb
database on node node01
, start a new transaction:
BEGIN;
6- Execute the DELETE
:
WITH cte AS (
SELECT c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype,
count(*),
min(d.ctid),
max(d.ctid)
FROM pg_depend_e_node01 d
JOIN pg_class_node01 c ON c.oid = d.objid
WHERE d.deptype = 'e'
AND d.classid = 1259
AND d.refclassid = 3079
AND c.relname LIKE 'conflict_history%'
GROUP BY c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype
HAVING count(*) > 1
)
DELETE
FROM pg_depend_e_node01
USING cte
WHERE cte.deptype = 'e'
AND cte.classid = 1259
AND cte.refclassid = 3079
AND cte.min = ctid;
DELETE 1
IMPORTANT: If you don't see the correct number of rows being deleted, then abort with ROLLBACK;
!!
7- Before finishing the transaction, execute the check query again:
WITH cte AS (
SELECT c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype,
count(*),
min(d.ctid),
max(d.ctid)
FROM pg_depend_e_node01 d
JOIN pg_class_node01 c ON c.oid = d.objid
WHERE d.deptype = 'e'
AND d.classid = 1259
AND d.refclassid = 3079
AND c.relname LIKE 'conflict_history%'
GROUP BY c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype
HAVING count(*) > 1
)
SELECT d.*
FROM pg_depend_e_node01 d
JOIN cte c ON c.min = d.ctid;
It should return no duplicates now:
classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
(0 rows)
8- Commit the previous transaction:
COMMIT;
1- Once you have TEST the previous procedure you need to perform similar steps, but with the REAL catalog tables:
SELECT c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype,
count(*),
min(d.ctid),
max(d.ctid)
FROM pg_depend d
JOIN pg_class c ON c.oid = d.objid
WHERE d.deptype = 'e'
AND d.classid = 1259
AND d.refclassid = 3079
AND c.relname LIKE 'conflict_history%'
GROUP BY c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype
HAVING count(*) > 1;
IMPORTANT: You can use the above query to figure out whether or not the node is affected by the bug.
2- Translating the DELETE
query to use REAL catalog table names is easy, just paste the query above into the cte
:
WITH cte AS (
SELECT c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype,
count(*),
min(d.ctid),
max(d.ctid)
FROM pg_depend d
JOIN pg_class c ON c.oid = d.objid
WHERE d.deptype = 'e'
AND d.classid = 1259
AND d.refclassid = 3079
AND c.relname LIKE 'conflict_history%'
GROUP BY c.relname,
d.classid,
d.objid,
d.objsubid,
d.refclassid,
d.refobjid,
d.refobjsubid,
d.deptype
HAVING count(*) > 1
)
DELETE
FROM pg_depend
USING cte
WHERE cte.deptype = 'e'
AND cte.classid = 1259
AND cte.refclassid = 3079
AND cte.min = ctid;
3- After carefully deleting the duplicated rows from pg_depend
, on node node01
:
a- Stop Postgres;
b- Add pglogical,bdr
back to shared_preload_libraries
;
c- Start Postgres again.