PostgreSQL, an open-source relational database management system, provides a comprehensive set of features for managing and manipulating data. One such feature which has been introduced in PostgreSQL version 15 is the SQL-standard MERGE
statement. It is a powerful tool used for simultaneously inserting, updating and deleting data in a table based on specified conditions. This feature is especially useful when working with large amounts of data and allows to save programming effort on the application side.
This article dives into the merge
feature, covering its usage and benefits, and provides a practical test case to demonstrate its capabilities.
Note: This feature is primarily suited for database use cases that follow an OLAP approach rather than an OLTP one. For OLTP, INSERT ON CONFLICT
offers better control and performance, so if that is the use case you aim to resolve, we recommend refraining from attempting to use merge
.
To understand and utilize the merge
feature in PostgreSQL, readers should have a basic understanding of SQL queries, and PostgreSQL syntax, and have PostgreSQL installed on their system. Familiarity with table structures and data manipulation operations will be beneficial.
MERGE
provides a single SQL statement that can conditionally INSERT
, UPDATE
or DELETE
rows in a target_table
using a data_source
, a task that would otherwise require multiple procedural language statements.
First, the MERGE
command performs a join from data_source
to target_table
producing zero or more candidate change rows. For each candidate change row, the status of MATCHED
or NOT MATCHED
is determined, after which the WHEN
clauses for that status are evaluated in the order specified. For each candidate change row, the first clause to evaluate as true is executed. No more than one WHEN clause is executed for any candidate change row.
The syntax of the merge command in PostgreSQL is as follows:
[ WITH with_query [, ...] ]
MERGE INTO [ ONLY ] target_table [ * ] [ [ AS ] target_alias ]
USING data_source ON join_condition
when_clause [...]
The data_source
can be another table, or it can be a subquery. This is the general form:
{ [ ONLY ] source_table_name [ * ] |
( source_query ) }
[ [ AS ] source_alias ]
The when_clause
determines what action to execute for each row that is matched or not matched. These are the two possible forms:
WHEN MATCHED [ AND condition ] THEN { merge_update | DELETE | DO NOTHING } |
WHEN NOT MATCHED [ AND condition ] THEN { merge_insert | DO NOTHING }
When a source row is matched in the target table, the one in the target table can be either updated or deleted. To delete it, simply write delete
. To update it, the merge_update
clause is used, which looks like this:
UPDATE SET { column_name = { expression | DEFAULT } |
( column_name [, ...] ) = ( { expression | DEFAULT } [, ...] ) } [, ...]
When a source row is not matched in the target table, one can be inserted:
INSERT [( column_name [, ...] )] [ OVERRIDING { SYSTEM | USER } VALUE ]
{ VALUES ( { expression | DEFAULT } [, ...] ) | DEFAULT VALUES }
To know more about the parameters used in the merge command, refer to the MERGE reference page.
Note: MERGE does not currently support the case where target_table is a materialized view, foreign table, or if it has any rules defined on it.
Let's see a few examples to understand how merge
works in PostgreSQL.
Suppose we have two tables books_available
, books_updated
where
books_available
is the currently available number of books of particular
genre, books_updated
contains updated inventory and number of books sold.
-- Create the books_available table
CREATE TABLE books_available (
product_id INT primary key,
genre VARCHAR(50),
qty_in_stock INT
);
-- Create the updated books_updated table
CREATE TABLE books_updated (
product_id INT primary key,
genre VARCHAR(50),
books_arrival_qty INT
);
-- Insert values into the books_available table
INSERT INTO books_available (product_id, genre, qty_in_stock)
VALUES
(1, 'Thriller', 10),
(2, 'Romance', 15),
(3, 'Fantasy', 12),
(4, 'Fiction', 10),
(5, 'Sci-fi', 6);
-- Insert values into the updated books_updated table
INSERT INTO books_updated (product_id, genre, books_arrival_qty)
VALUES
(1, 'Thriller', 7),
(2, 'Romance', 9),
(4, 'Fiction', 10),
(6, 'Contemporary', 2),
(7, 'Mystery', 4);
Now let's run a merge
that reads a series of changes from books_updated
and apply them to books_available
:
MERGE INTO books_available AS b
USING books_updated AS u
ON b.product_id = u.product_id
WHEN MATCHED THEN
UPDATE SET b.qty_in_stock = b.qty_in_stock + u.books_arrival_qty
WHEN NOT MATCHED THEN
INSERT (product_id, genre, qty_in_stock)
VALUES (u.product_id, u.genre, u.books_arrival_qty);
edb=# SELECT * FROM books_available ORDER BY product_id ASC;
product_id | genre | qty_in_stock
1 | Thriller | 17
2 | Romance | 24
3 | Fantasy | 12
4 | Fiction | 20
5 | Sci-fi | 6
6 | Contemporary | 2
7 | Mystery | 4
(7 rows)
As we can see, the books_available
table has been updated with the data from the books_updated
table, and a new row has been inserted for the products with genres "Contemporary" and "Mystery"
We will look into another example where all the conditions are utilised in the merge command.
The modified values of the same tables are as below:
Before performing merge operation:
CREATE TABLE books_available (
product_id SERIAL PRIMARY KEY,
genre VARCHAR(255),
qty_in_stock INT,
avg_rating NUMERIC(3, 1), -- Assuming average rating is stored with one decimal place
price_per_unit DECIMAL(10, 2) -- Assuming price per unit is stored with two decimal places
);
CREATE TABLE books_updated (
product_id SERIAL PRIMARY KEY,
genre VARCHAR(255),
books_sold_qty INT,
books_arrival_qty INT,
average_rating_update NUMERIC(3, 1), -- Assuming average rating is stored with one decimal place
price_update DECIMAL(10, 2) -- Assuming price per unit is stored with two decimal places
);
INSERT INTO books_available (product_id, genre, qty_in_stock, avg_rating, price_per_unit)
VALUES
(1, 'Thriller', 10, 4.5, 20.0),
(2, 'Romance', 15, 4.2, 18.0),
(3, 'Fantasy', 12, 4.7, 22.0),
(4, 'Fiction', 10, 4.0, 16.0),
(5, 'Sci-fi', 6, 4.3, 24.0);
INSERT INTO books_updated (product_id, genre, books_sold_qty, books_arrival_qty, average_rating_update, price_update)
VALUES
(1, 'Thriller', 7, 6, 4.6, 21.0),
(2, 'Romance', 9, 5, 4.3, 19.0),
(4, 'Fiction', 10, 0, 4.1, 17.0),
(6, 'Contemporary', 0, 0, 0.0, 0.0),
(7, 'Mystery', 0, 10, 4.8, 25.0);
Merge operation:
MERGE INTO books_available A
USING books_updated B
ON B.product_id = A.product_id
WHEN MATCHED AND qty_in_stock - books_sold_qty + books_arrival_qty > 0 THEN
UPDATE SET
qty_in_stock = qty_in_stock - books_sold_qty + books_arrival_qty,
avg_rating = B.average_rating_update,
price_per_unit = B.price_update
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED AND books_arrival_qty > 0 THEN
INSERT VALUES (
B.product_id,
B.genre,
B. books_arrival_qty,
B.average_rating_update,
B.price_update
)
WHEN NOT MATCHED THEN
DO NOTHING;
Updated values post-merge operation:
edb=# SELECT * FROM books_available ORDER BY product_id ASC;
product_id | genre | qty_in_stock | avg_rating | price_per_unit
1 | Thriller | 9 | 4.6 | 21.00
2 | Romance | 11 | 4.3 | 19.00
3 | Fantasy | 12 | 4.7 | 22.00
5 | Sci-fi | 6 | 4.3 | 24.00
7 | Mystery | 10 | 4.8 | 25.00
(5 rows)
As we can see the table books_available
, the genres "Thriller" and "Romance" got updated, genre "Fiction" got deleted as the updated quantity of books is 0, genre "Mystery" got inserted as its not available in the table books_available
, genre "Contemporary" from books_updated
does not get inserted as the books_arrival_qty
is not greater than 0.
WITH provides a way to write auxiliary statements for use in a larger query. These statements, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for one query. Each auxiliary statement in a WITH clause can be a SELECT, INSERT, UPDATE, or DELETE; and the WITH clause itself is attached to a primary statement that can be a SELECT, INSERT, UPDATE, DELETE, or MERGE.
Let's use the same above books_available
and books_updated
table with the below WITH clause.
Merge operation:
WITH updated_books AS (
SELECT
B.product_id AS updated_product_id,
B.genre AS updated_genre,
B. books_arrival_qty AS updated_ books_arrival_qty,
B.price_update AS updated_price,
B.average_rating_update AS updated_rating,
B. books_arrival_qty * B.price_update AS total_value
FROM books_updated B
WHERE B. books_arrival_qty > 0
)
MERGE INTO books_available A
USING updated_books U
ON A.product_id = U.updated_product_id
WHEN MATCHED THEN
UPDATE SET
qty_in_stock = A.qty_in_stock + U.updated_ books_arrival_qty,
price_per_unit = U.updated_price
WHEN NOT MATCHED THEN
INSERT (product_id, genre, qty_in_stock, avg_rating, price_per_unit)
VALUES (U.updated_product_id, U.updated_genre, U.updated_ books_arrival_qty, U.updated_rating, U.updated_price);
Updated values post merge operation:
SELECT * FROM books_available order by product_id ASC;
product_id | genre | qty_in_stock | avg_rating | price_per_unit
1 | Thriller | 16 | 4.6 | 21.00
2 | Romance | 20 | 4.3 | 19.00
3 | Fantasy | 12 | 4.7 | 22.00
4 | Fiction | 10 | 4.0 | 16.00
5 | Sci-fi | 6 | 4.3 | 24.00
7 | Mystery | 10 | 4.8 | 25.00
(6 rows)
This CTE selects the product ID, genre, new book quantity, updated average rating, and updated price from the books_updated
table but only includes rows where books_arrival_qty
is greater than 0.
We then use the MERGE statement to match and update records in the books_available
table based on the condition that the product ID matches and there are new books to add (where books_arrival_qty
> 0). If a match is found, we update the quantity in stock, average rating, and price. If there's no match, we insert a new record into the books_available table.
If merge command was not in use, these above operations would take many lines of code with a lot of complexity involved which in return would increase the overhead (in terms of time) as data has to be shifted from database to application multiple times. From a data processing perspective, this command is quite helpful as it reduces the I/O operations from the disk for each of the three statements individually and now data is being read from the source only once.
Note: Also, the performance of the MERGE statement greatly depends on the proper indexes being used to match both the source and the target tables. Apart from indexes, it is also essential that the join conditions are optimized. We should also try to filter the source table so that only necessary rows are being fetched by the statement to do the necessary operations.
Initially, a determination of MATCHED
or NOT MATCHED
status is determined for each row. If a row is MATCHED
and is modified, but still exists according to the join conditions, then all the WHEN MATCHED
clauses are tested again. On the other hand, if it matches and is modified in a way that makes it no longer visible to the join, then the WHEN NOT MATCHED
conditions are executed for it.
For WHEN NOT MATCHED
, this does not occur. If a row is considered as NOT MATCHED
and, because of a concurrent modification becomes visible (i.e. it becomes MATCHED
), then the whole command is aborted with an error, which might be one of:
ERROR: duplicate key value violates unique constraint
ERROR: tuple to be locked was already moved to another partition due to a concurrent update
How to deal with concurrency:
-
The easiest is to not run two MERGEs concurrently targeting the same table •(perhaps: LOCK TABLE books_available IN SHARE MODE)
-
If you must have two, make them not have WHEN NOT MATCHED THEN INSERT clauses
-
If you must allow concurrent insertion, rewrite to INSERT ON CONFLICT UPDATE
-
If you do not lock the table, test extensively for concurrent scenarios •plan to spend much longer testing than developing
After running merge, the source data is presumably no longer useful; but we
will want to run MERGE
again in the future. To allow applications to write
new data in the source table, several approaches are possible:
-
Use a transaction in
serializable
orrepeatable read
isolation mode, and delete the rows in the source table after runningmerge
in the same transaction. This will delete the rows that were seen bymerge
; any other rows that were inserted in the source table by transactions that were committed after MERGE read the data will persist. -
Have two "source" tables: one which is being inserted into by the application (
source_active
), and a separate one that is used bymerge
(source_merging
). Oncemerge
finishes, truncatesource_merging
, and then ALTER TABLE RENAME each table in place of the other. The application can be written continuously tosource_active
, whichmerge
will not touch. -
Use a CTE that deletes the source data at the same time as it returns it.
WITH source_cte AS (DELETE FROM source RETURNING *)
MERGE INTO target FROM source_cte ...
- As above, but use a flag that tells that the row has been processed, and ignore rows that have it:
WITH source_cte AS (
UPDATE source
SET row_merged=true
WHERE <some condition> AND NOT row_merged
RETURNING *
) MERGE INTO target_tbl FROM source_cte ...
Note that in this last case, history is preserved: rows that have been processed are still in the
source table after merge
is done.
As a more elaborate alternative of this last one, you might make row_merged
a timestamp, which is
initially NULL
and this source_cte
turns into current_timestamp
(so the WHERE condition
would be AND row_merged IS NULL
). This case lets you know exactly when was each source row
merged into the target table.
No separate triggers are defined for MERGE
. Instead, statement-level and row-level UPDATE
, DELETE
, and INSERT
triggers are fired depending on (for statement-level triggers) what actions are specified in the MERGE query and (for row-level triggers) what actions are performed.
While running a MERGE
command, statement-level BEFORE
and AFTER
triggers are fired for events specified in the actions of the MERGE
command, irrespective of whether or not the action is ultimately performed. This is the same as an UPDATE statement that updates no rows: statement-level triggers are fired for it, yet no row-level triggers are fired. The row-level triggers are fired only when a row is actually updated, inserted or deleted. So it's perfectly legal that while statement-level triggers are fired for certain types of action, no row-level triggers are fired for the same kind of action.
Trigger behaviour:
- BEFORE EACH STATEMENT •INSERT •UPDATE •DELETE
- BEFORE EACH ROW •Each row as scanned by the join
- AFTER EACH ROW •Each row in the same order as above
- AFTER EACH STATEMENT •DELETE •UPDATE
To know more about trigger behaviour, please refer to Overview of Trigger Behavior.
Example: Use the same tables from above (example 2) and create a function and trigger as below
-- Create an AFTER INSERT OR UPDATE trigger on books_available
CREATE OR REPLACE FUNCTION check_stock_trigger()
RETURNS TRIGGER AS $$
BEGIN
-- Check if the stock quantity is zero
IF NEW.qty_in_stock = 0 THEN
RAISE WARNING 'Call the editor to place an order for book %', NEW.product_id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Attach the trigger to the books_available table
CREATE TRIGGER after_stock_update
AFTER INSERT OR UPDATE
ON books_available
FOR EACH ROW
EXECUTE FUNCTION check_stock_trigger();
Run the merge command:
edb=# MERGE INTO books_available AS A
USING books_updated AS B
ON A.product_id = B.product_id
WHEN MATCHED THEN
UPDATE SET qty_in_stock = B. books_arrival_qty;
WARNING: Call the editor to place an order for book 4
MERGE 3
edb=# select * from books_available order by product_id ASC;
product_id | genre | qty_in_stock | avg_rating | price_per_unit
1 | Thriller | 6 | 4.5 | 20.00
2 | Romance | 5 | 4.2 | 18.00
3 | Fantasy | 12 | 4.7 | 22.00
4 | Fiction | 0 | 4.0 | 16.00
5 | Sci-fi | 6 | 4.3 | 24.00
(5 rows)
Here a trigger is generated when the stock became '0' for a book with product 4
We have shown how the merge
command in PostgreSQL is a powerful tool for modifying data into a table based on specified conditions. It simplifies complex data change operations and saves time and resources when working with large amounts of data. Database administrators and developers can leverage these features to improve the performance and scalability of their applications while making their application code easier to maintain.