In this article we explain the concepts and provide recommendation on how to restore a dump into a PGD cluster.
There are several things to consider when restoring a pg_dump
to a PGD cluster:
- Global Lock
- Raft Consensus
- Replication lag
PGD transparently replicates DDL across the cluster by leveraging the cluster-wide BDR Global Lock. The full documentation on DDL replication can be found here.
The BDR Global Lock is orchestrated by Raft Consensus. The difference among the 3 possible values for bdr.ddl_locking
are outlined as follows:
-
true
,on
orddl
(Global DDL Lock): Single, cluster-wide lock to serialize general DDL execution, regardless of relation; -
dml
(Global DML Lock): Single, cluster-wide lock to serialize general DDL execution per relation. It can be used to allow concurrent DDL on different relations.
The name "Global DML Lock" can cause some confusion, because the Global DML Lock is also related with DDL replication. It has this name because of what it needs to wait for (pending DMLs on the relation) before acquiring the Global Lock on the relation. In that regard, the Global DML Lock behaves very similarly to the Postgres local relation lock. In fact, both Global DDL Lock and Global DML Lock acquires local relation locks on all nodes in order to apply the DDL.
-
false
oroff
: Does not perform any checks and simply replicates DDL through WAL.
If there is replication lag, DDL might fail because it was not possible to acquire the Global Lock after bdr.global_lock_timeout
. Transaction is cancelled if holding the Global Lock for more than bdr.global_lock_statement_timeout
(if active
) or bdr.global_lock_idle_timeout
(if idle-in-transaction
).
The absolute 2 initial conditions for the Global Lock to work properly are:
1- Raft Consensus needs to be working and able to achieve the majority of nodes. Without this, it's impossible to acquire the Global Lock;
2- Replication lag should be as low as it needs for:
-
The DDL Epoch to be the same on all nodes. DDL Epoch is an internal control similar to schema version. From the point of view of PGD DDL replication, nodes with the same DDL Epoch have the same schema. You can observe the DDL Epoch on each node by inspecting the
bdr.ddl_epoch
catalog, and its fields are explained in the documentation; -
The actual local lock
ExclusiveLock
to be acquired on the relation. Acquiring the local lock takes longer if there is a queue of pending transactions to be applied against the relation.
One additional strong requirement, for consistency, when using the bdr.ddl_locking = 'dml'
is that you somehow enforce that only 1 of the nodes will run DDL at the same time.
The problem when restoring a pg_dump
(or any other large batch job composed of mixed DDL and DML), is that it generates replication lag, enough for the backlog of pending transactions to be so much that the Global Lock request times out even before the actual DDL arrives to the peer nodes through the WAL.
Because of this it's common to see the dump restore failing with a timeout error similar to this:
pg_restore: [archiver (db)] could not execute query: ERROR: cannot drain replication queue for DML lock, another lock is already draining, 16174, 16176
CONTEXT: while acquiring global lock GLOBAL_LOCK_DML ("myschema"."mytable") in database 17731; stage acquiring_local
Increasing bdr.global_lock_timeout
from 1 minute to a higher value can help if the replication lag isn't excessive. But still, even if you set it to 1 hour, and replication lag is longer than 1 hour (i.e., a transaction on the origin node has not arrived yet on the target node after 1 hour), then you will still see Global Lock timeouts.
The fastest way to quickly ingest such large dump to the PGD cluster is by temporarily disabling the Global Lock completely:
bdr.ddl_locking = off
IMPORTANT: As you are skipping the Global Lock, PGD won't be ensuring DDL consistency across the cluster. To account for this it is crucial that you don't execute DDL on any other node while the dump is being restored and afterward until the cluster-wide replication lag gets back down to zero, thereby ensuring that the restored schema and data has reached all of the nodes in your cluster.
IMPORTANT: If you have geographically distant PGD nodes, then restoring a large dump with Global Lock skipped can easily create a huge replication lag. Please watch closely for replication lag and disk usage since you started the restore, until replication lag gets down to zero again.