Demonstration of deadlock and ways to fix it

Jakub Wartak
Jakub Wartak

A deadlock happens when two or more transactions are waiting for each other to release locks on resources, resulting in a dependency where none of the transactions can proceed. In this article we will show how an example of a deadlock and ways to solve it.

drop table t1;
create table t1 as select g AS id_pk, MOD(g,10)::bigint AS app_id, 'false'::bool AS changed from generate_series(1,1000) g(i);
alter table t1 add primary key (id_pk);
analyze t1;
deadlocks=# select * from t1 limit 15;
 id_pk | app_id | changed
-------+--------+---------
     1 |      1 | f
     2 |      2 | f
     3 |      3 | f
     4 |      4 | f
     5 |      5 | f
     6 |      6 | f
     7 |      7 | f
     8 |      8 | f
     9 |      9 | f
    10 |      0 | f
    11 |      1 | f
    12 |      2 | f
    13 |      3 | f
    14 |      4 | f
    15 |      5 | f

Please note the that app_id is cyclic and that in future queries we are going to adjust app_id.

Let's assume we have application that was running fine:

postgres@hive:~$ cat app_v1.sql
\set r1 random(1,10)
\set r2 random(1,10)
\set r3 random(1,10)
BEGIN;
        UPDATE t1 SET changed='true' WHERE app_id IN (:r1, :r2, :r3);
        \sleep 2 ms
END;

Let's run it:

postgres@hive:~$ /usr/lib/postgresql/13/bin/pgbench -p 5434 -f app_v1.sql -n -c 1 -T 5 -P 1 deadlocks
progress: 1.0 s, 75.6 tps, lat 12.896 ms stddev 13.097
progress: 2.0 s, 72.3 tps, lat 14.116 ms stddev 13.977
progress: 3.0 s, 82.9 tps, lat 11.853 ms stddev 12.230
progress: 4.0 s, 71.1 tps, lat 14.229 ms stddev 15.078
progress: 5.0 s, 69.0 tps, lat 14.568 ms stddev 14.507
[..]

So far so good. But then the increased traffic occured OR data distribution has changed and we have started having problems (note the --clients to 1 to 5 parameter change / parallel backends executing the same transaction):

postgres@hive:~$ /usr/lib/postgresql/13/bin/pgbench -p 5434 -f app_v1.sql -n -c 5 -T 5 -P 1 deadlocks
progress: 1.1 s, 5.7 tps, lat 18.142 ms stddev 9.681
pgbench: error: client 3 script 0 aborted in command 4 query 0: ERROR:  deadlock detected
DETAIL:  Process 3019 waits for ShareLock on transaction 1076122; blocked by process 3017.
Process 3017 waits for ShareLock on transaction 1076121; blocked by process 3019.
HINT:  See server log for query details.
CONTEXT:  while rechecking updated tuple (8,84) in relation "t1"
progress: 2.0 s, 1.0 tps, lat 1030.910 ms stddev NaN
pgbench: error: client 0 script 0 aborted in command 4 query 0: ERROR:  deadlock detected
DETAIL:  Process 3016 waits for ShareLock on transaction 1076123; blocked by process 3020.
Process 3020 waits for ExclusiveLock on tuple (9,1) of relation 117528 of database 117490; blocked by process 3017.
Process 3017 waits for ShareLock on transaction 1076119; blocked by process 3016.
HINT:  See server log for query details.
CONTEXT:  while rechecking updated tuple (11,117) in relation "t1"
progress: 3.0 s, 4.0 tps, lat 771.326 ms stddev 842.976
pgbench: error: client 4 script 0 aborted in command 4 query 0: ERROR:  deadlock detected
DETAIL:  Process 3020 waits for ShareLock on transaction 1076125; blocked by process 3018.
Process 3018 waits for ShareLock on transaction 1076129; blocked by process 3020.
HINT:  See server log for query details.
CONTEXT:  while rechecking updated tuple (22,5) in relation "t1"
progress: 4.0 s, 178.6 tps, lat 32.840 ms stddev 239.454
progress: 5.0 s, 199.9 tps, lat 10.242 ms stddev 5.395
transaction type: app_v1.sql

This is a self-deadlocking simple UPDATE due to concurrency as the rows are not locked in order across many backends, thus causing hazard of backends deadlocking each other. The only thing PostgreSQL can do in this situation is to perform cyclic deadlock detection and abort one of the conflicting transactions, so that the other one succeeds. According to the SQL standard there is no ORDER BY stanza in the UPDATE syntax that would allow to prevent this.

Simple ordering of the values coming from the application is not enough as evidenced below:

postgres@hive:~$ cat app_v1b.sql
BEGIN;
        UPDATE t1 SET changed='true' WHERE app_id IN (1, 2, 3);
        \sleep 2 ms
END;
postgres@hive:~$ /usr/lib/postgresql/13/bin/pgbench -p 5434 -f app_v1b.sql -n -c 5 -T 5 -P 1 deadlocks
progress: 1.0 s, 5.0 tps, lat 29.318 ms stddev 12.094
pgbench: error: client 0 script 0 aborted in command 1 query 0: ERROR:  deadlock detected
DETAIL:  Process 3110 waits for ShareLock on transaction 1076514; blocked by process 3112.
Process 3112 waits for ShareLock on transaction 1076519; blocked by process 3110.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (15,1) in relation "t1"
progress: 2.0 s, 1.0 tps, lat 1065.824 ms stddev NaN
pgbench: error: client 1 script 0 aborted in command 1 query 0: ERROR:  deadlock detected
DETAIL:  Process 3111 waits for ExclusiveLock on tuple (16,142) of relation 117533 of database 117490; blocked by process 3114.
Process 3114 waits for ShareLock on transaction 1076517; blocked by process 3113.
Process 3113 waits for ShareLock on transaction 1076520; blocked by process 3112.
Process 3112 waits for ExclusiveLock on tuple (16,142) of relation 117533 of database 117490; blocked by process 3111.
HINT:  See server log for query details.
pgbench: error: client 3 script 0 aborted in command 1 query 0: ERROR:  deadlock detected
DETAIL:  Process 3113 waits for ShareLock on transaction 1076520; blocked by process 3112.
Process 3112 waits for ExclusiveLock on tuple (16,142) of relation 117533 of database 117490; blocked by process 3114.
Process 3114 waits for ShareLock on transaction 1076517; blocked by process 3113.
HINT:  See server log for query details.
CONTEXT:  while rechecking updated tuple (14,7) in relation "t1"
progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
pgbench: error: client 2 script 0 aborted in command 1 query 0: ERROR:  deadlock detected
DETAIL:  Process 3112 waits for ShareLock on transaction 1076518; blocked by process 3114.
Process 3114 waits for ShareLock on transaction 1076520; blocked by process 3112.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (16,142) in relation "t1"
^C

The solution is to rewrite such conflicting DML statements (in this case UPDATE) in the application or stored procedures & functions, so that in each case there is strict ordering of row-locks (consistent sort order):

postgres@hive:~$ cat app_v2_fixed.sql
\set r1 random(1, 10)
\set r2 random(1, 10)
\set r3 random(1, 10)
BEGIN;
        UPDATE t1 SET changed='t' FROM
                (SELECT ctid FROM t1 WHERE app_id IN (:r3, :r1, :r2) ORDER BY id_pk FOR UPDATE) orderedrows
                WHERE t1.ctid = orderedrows.ctid;
        \sleep 2 ms
END;

Let's try it out:

postgres@hive:~$ /usr/lib/postgresql/13/bin/pgbench -p 5434 -f app_v2_fixed.sql -n -c 10 -T 5 -P 1 deadlocks
progress: 1.0 s, 205.8 tps, lat 44.199 ms stddev 35.332
progress: 2.0 s, 170.2 tps, lat 60.539 ms stddev 70.530
progress: 3.0 s, 245.9 tps, lat 40.550 ms stddev 28.843
progress: 4.0 s, 204.1 tps, lat 49.509 ms stddev 47.044
progress: 5.0 s, 249.8 tps, lat 39.220 ms stddev 27.578

It is crucial to provide ORDER BY in SELECT FOR UPDATE above to guarantee lock ordering to eliminate deadlock(s).

As the root-cause is lack of strict row-locks, ordering the issue can also manifest for other DML statements like DELETE and SELECT FOR UPDATE too. The same queries may also work properly, but the issue may start happening simple because of changed data distribution over time. NOTE: in the below example we've just changed app_id to be not modulated, but instead app_id is now derivative of id_pk column:

drop table t1;
create table t1 as select g AS id_pk, (g/100)::bigint AS app_id, 'false'::bool AS changed from generate_series(1,1000) g(i);
alter table t1 add primary key (id_pk);
analyze t1;

deadlocks=# select * from t1 limit 15;
 id_pk | app_id | changed
-------+--------+---------
     1 |      0 | f
     2 |      0 | f
     3 |      0 | f
     4 |      0 | f
     5 |      0 | f
     6 |      0 | f
     7 |      0 | f
     8 |      0 | f
     9 |      0 | f
    10 |      0 | f
    11 |      0 | f
    12 |      0 | f
    13 |      0 | f
    14 |      0 | f
    15 |      0 | f
(15 rows)

deadlocks=# select * from t1 order by id_pk desc limit 15;
 id_pk | app_id | changed
-------+--------+---------
  1000 |     10 | f
   999 |      9 | f
   998 |      9 | f
   997 |      9 | f
   996 |      9 | f
   995 |      9 | f
   994 |      9 | f
   993 |      9 | f
   992 |      9 | f
   991 |      9 | f
   990 |      9 | f
   989 |      9 | f
   988 |      9 | f
   987 |      9 | f
   986 |      9 | f
(15 rows)

So let's take our orginal unfixed application and run it at high concurrency (-c 5):

postgres@hive:~$ /usr/lib/postgresql/13/bin/pgbench -p 5434 -f app_v1.sql -n -c 5 -T 5 -P 1 deadlocks
progress: 1.0 s, 184.9 tps, lat 26.112 ms stddev 16.586
progress: 2.0 s, 193.0 tps, lat 25.580 ms stddev 15.862
progress: 3.0 s, 199.1 tps, lat 25.286 ms stddev 13.049
progress: 4.0 s, 167.9 tps, lat 29.536 ms stddev 20.188
progress: 5.0 s, 189.0 tps, lat 26.568 ms stddev 17.543
postgres@hive:~$ /usr/lib/postgresql/13/bin/pgbench -p 5434 -f app_v1.sql -n -c 5 -T 5 -P 1 deadlocks
progress: 1.0 s, 160.0 tps, lat 30.188 ms stddev 21.525
progress: 2.0 s, 209.0 tps, lat 23.983 ms stddev 13.532
progress: 3.0 s, 55.0 tps, lat 29.024 ms stddev 25.831
pgbench: error: client 3 script 0 aborted in command 4 query 0: ERROR:  deadlock detected
DETAIL:  Process 3585 waits for ShareLock on transaction 1079695; blocked by process 3582.
Process 3582 waits for ShareLock on transaction 1079703; blocked by process 3586.
Process 3586 waits for ShareLock on transaction 1079704; blocked by process 3585.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (338,51) in relation "t1"
progress: 4.0 s, 133.1 tps, lat 50.593 ms stddev 167.500
progress: 5.0 s, 204.0 tps, lat 19.591 ms stddev 11.077
[..]

As evidenced there's just 1 deadlock across two runs of 5 seconds each (much less rare and the query and system are identical, the only thing changed is the layout of the data). Layout of the data is subject to the application.

With different RDMBS product it is the identicial situation, Oracle RDBMS even logs that this is not database product issue, but application one:

DEADLOCK DETECTED
Current SQL statement for this session:
[..SQL query]
The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:
[..]

References:

Was this article helpful?

0 out of 0 found this helpful