The purpose of this article is to discuss the concept of configuring BDR Replication Sets to implement Data Sharding in a BDR cluster, where standard data is available in all nodes in the cluster but nodes in a different location can potentially have different data from each other.
By "Data Sharding" we understand the following concepts:
1- Same data in same location;
2- Locations have different data among each other;
3- Optionally a global set of data is replicated among all nodes;
4- A "Shard", a "Silo" or a "Region" is a group of BDR nodes that replicate common data.
It's possible to implement these concepts in BDR 3.6 or newer, provided that:
1- There will be a single "Global" replication set, and all nodes should subscribe to this replication set; and
2- There will be multiple "Shard" replication sets, one per Shard, Silo or Region you are defining. All nodes subscribing to the same Shard replication set are part of the same Shard.
Let's consider as an example a cluster composed of 2 shards, with 2 nodes each:
-
s1node1
; -
s1node2
; -
s2node1
; -
s2node2
.
Nodes s1node1
and s1node2
belong to Shard 1, while nodes s2node1
and s2node2
belong to Shard 2.
When you initialize a BDR cluster, it already has a replication set associated with the BDR Group. By default this replication set has autoadd_tables
enabled, which means that every new table will be already included in the replication set.
We are using the default BDR Group replication set as our "Global" replication set. Let's consider the name of this replication set is bdrgroup
. So first we need to configure this replication set to not automatically add any new table:
SELECT bdr.alter_replication_set(
set_name := 'bdrgroup',
autoadd_tables := false
);
Now you should remove, from this replication set, all tables that are not supposed to belong to the Global replication set. This is done with:
SELECT bdr.replication_set_remove_table(
relation := 'mytable',
set_name := 'bdrgroup'
);
Then you can create the Shard replication sets. In our case we are creating only 2, but you can create as many as you need:
SELECT bdr.create_replication_set(
set_name := 'bdrs01',
replicate_insert := true,
replicate_update := true,
replicate_delete := true,
replicate_truncate := true,
autoadd_tables := false,
autoadd_existing := false
);
SELECT bdr.create_replication_set(
set_name := 'bdrs02',
replicate_insert := true,
replicate_update := true,
replicate_delete := true,
replicate_truncate := true,
autoadd_tables := false,
autoadd_existing := false
);
The tables you don't want in the Global replication set are called "Sharded Tables". Those tables need to be added to all Shard replication sets individually. So for each table:
SELECT bdr.replication_set_add_table(
relation := 'mytable',
set_name := 'bdrs01'
);
SELECT bdr.replication_set_add_table(
relation := 'mytable',
set_name := 'bdrs02'
);
Now, on each node, you specify which Shard they belong to. Each node should subscribe to exactly 2 replication sets:
1- The Global replication set; and
2- One Shard replication set.
This is done using the bdr.alter_node_replication_sets
function, which needs to be executed locally on the node you are configuring. So you need to go to all nodes and run the command, changing the set_names
argument as appropriate:
-- On s1node1:
SELECT bdr.alter_node_replication_sets(
node_name := 's1node1',
set_names := ARRAY['bdrgroup', 'bdrs01']::TEXT[]
);
-- On s1node2:
SELECT bdr.alter_node_replication_sets(
node_name := 's1node2',
set_names := ARRAY['bdrgroup', 'bdrs01']::TEXT[]
);
-- On s2node1:
SELECT bdr.alter_node_replication_sets(
node_name := 's2node1',
set_names := ARRAY['bdrgroup', 'bdrs02']::TEXT[]
);
-- On s2node2:
SELECT bdr.alter_node_replication_sets(
node_name := 's2node2',
set_names := ARRAY['bdrgroup', 'bdrs02']::TEXT[]
);
After this, you can use the following query to check the replication set configuration:
WITH node_repsets AS (
SELECT jsonb_array_elements(
bdr.run_on_all_nodes($$
SELECT s.node_id, s.node_name,
COALESCE(
i.replication_sets,
ARRAY[s.default_repset_name]::NAME[]
) AS replication_sets
FROM bdr.local_node_summary s
INNER JOIN bdr.node_local_info i ON i.node_id = s.node_id
$$)::jsonb
) AS j
)
SELECT j->'response'->0->>'node_id' AS node_id,
j->'response'->0->>'node_name' AS node_name,
j->'response'->0->>'replication_sets' AS replication_sets
FROM node_repsets;
The output should be something like this:
node_id | node_name | replication_sets
2875305183 | s1node1 | {bdrs01,bdrgroup}
1017466418 | s1node2 | {bdrs01,bdrgroup}
3995307724 | s2node1 | {bdrs02,bdrgroup}
2445991795 | s2node2 | {bdrs02,bdrgroup}
(4 rows)
In BDR 3.7 and newer, the query is different, because there are 2 columns about replication sets for each node:
WITH node_repsets AS (
SELECT jsonb_array_elements(
bdr.run_on_all_nodes($$
SELECT s.node_id,
s.node_name,
COALESCE(i.pub_repsets, s.pub_repsets) AS pub_repsets,
COALESCE(i.sub_repsets, s.sub_repsets) AS sub_repsets
FROM bdr.local_node_summary s
INNER JOIN bdr.node_local_info i
ON i.node_id = s.node_id;
$$)::jsonb
) AS j
)
SELECT j->'response'->'command_tuples'->0->>'node_id' AS node_id,
j->'response'->'command_tuples'->0->>'node_name' AS node_name,
j->'response'->'command_tuples'->0->>'pub_repsets' AS pub_repsets,
j->'response'->'command_tuples'->0->>'sub_repsets' AS sub_repsets
FROM node_repsets
ORDER BY 2;
But the output should be similar:
node_id | node_name | pub_repsets | sub_repsets
1017466418 | s1node1 | {bdrs01,bdrgroup} | {bdrs01,bdrgroup}
2875305183 | s1node2 | {bdrs01,bdrgroup} | {bdrs01,bdrgroup}
3995307724 | s2node1 | {bdrs02,bdrgroup} | {bdrs02,bdrgroup}
2445991795 | s2node2 | {bdrs02,bdrgroup} | {bdrs02,bdrgroup}
(4 rows)
If you are working with tables that already have some data, then for the tables you are associating with Shard replication sets, you need to decide what to do with the existing data. One possibility is to consider existing data as part of the first shard, and the other shards will be empty. In this case, you can truncate the table on the other shards, with transaction replication disabled, for example:
-- On nodes belonging to Shard 2:
BEGIN;
SET LOCAL bdr.xact_replication = off;
TRUNCATE TABLE mytable;
COMMIT;
Similarly, you can specify a condition to split the data, for example:
-- On nodes belonging to Shard 1:
BEGIN;
SET LOCAL bdr.xact_replication = off;
DELETE
FROM mytable
WHERE <condition to NOT belong to Shard 1>;
COMMIT;
-- On nodes belonging to Shard 2:
BEGIN;
SET LOCAL bdr.xact_replication = off;
DELETE
FROM mytable
WHERE <condition to NOT belong to Shard 2>;
COMMIT;
With the Sharding configuration, having a table mytable
, for example:
1- If it's part of the bdrgroup
replication set, then all nodes will have the same data for the table;
2- If the table is part of the bdrs01
and bdrs02
replication sets, then nodes s1node1
and s1node2
will see a set of data for this table, that is different from the set of data nodes s2node1
and s2node2
will see.
Bear in mind that, when joining a node, you need to know beforehand which Shard this node will belong to, and join a peer that is already a member of this same Shard. When joining using the logical mode, you should call bdr.alter_node_replication_sets
before calling bdr.join_node_group
, and when joining using bdr_init_physical
, you should specify the --replication-sets
option.
Last but not least, this approach also implies that you should have a backup from at least 1 node belonging to each shard.
In BDR 3.7 and newer, while it's still possible to configure sharding with replication sets, there is a new feature called BDR Groups. Essentially the main group is the global one, and it's possible to create subgroups to define the shards. Each group already has its own default replication set. The steps would be as follows:
1- Create the bdrgroup
group on node s1node1
;
2- Create (and join) the bdrs01
sub-group (parent is the bdrgroup
group) on node s1node1
;
3- Create (and don't join) the bdrs02
sub-group (parent is the bdrgroup
group) on node s1node1
;
4- Join node s1node2
to s1node1
via sub-group bdrs01
;
5- Join node s1node1
to s1node1
via sub-group bdrs02
;
6- Join node s2node2
to s2node1
via sub-group bdrs02
.
IMPORTANT: Please perform any of the steps above in a lower environment before applying to production.
IMPORTANT: Re-configuring replication sets requires application downtime during the whole process.