PostgreSQL gaps in sequences

Rafik Mulla
Rafik Mulla
  • Updated

This article explores the causes of gaps in stored values generated by sequences, which are special, single-row relations created directly with CREATE SEQUENCE or automatically for table columns specified with the serial or bigserial data types or with GENERATED ... AS IDENTITY.

1. Gaps in Sequences Caused by Transaction Rollbacks

Obtaining new values from sequences is done via the nextval(<sequence_name>) function, and the documentation makes clear that the results of calling that function are not rolled back if the transaction that called it is rolled back:

To avoid blocking concurrent transactions that obtain numbers from the same sequence, the value obtained by nextval is not reclaimed for re-use if the calling transaction later aborts. This means that transaction aborts or database crashes can result in gaps in the sequence of assigned values. That can happen without a transaction abort, too. For example an INSERT with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any required nextval calls, before detecting any conflict that would cause it to follow the ON CONFLICT rule instead. Thus, PostgreSQL sequence objects cannot be used to obtain “gapless” sequences.

Consider the following table created with an id column specified as its primary key with the bigserial data type, which is a shortcut for a bigint field that takes its default values from nextval() a sequence that is automatically created and linked to the column:

testdb=# CREATE TABLE test (
testdb(#     id bigserial primary key,
testdb(#     data text unique
testdb(# );
CREATE TABLE
testdb=# \d test
                            Table "public.test"
 Column |  Type  | Collation | Nullable |             Default
--------+--------+-----------+----------+----------------------------------
 id     | bigint |           | not null | nextval('test_id_seq'::regclass)
 data   | text   |           |          |
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)
    "test_data_key" UNIQUE CONSTRAINT, btree (data)

testdb=# \d test_id_seq
                        Sequence "public.test_id_seq"
  Type  | Start | Minimum |       Maximum       | Increment | Cycles? | Cache
--------+-------+---------+---------------------+-----------+---------+-------
 bigint |     1 |       1 | 9223372036854775807 |         1 | no      |     1
Owned by: public.test.id

Now let's insert a row, commit, and check the table's contents:

testdb=# BEGIN;
BEGIN

testdb=*# INSERT INTO test (data) VALUES ('val1');
INSERT 0 1

testdb=*# COMMIT;
COMMIT

testdb=# SELECT * FROM test;
 id | data
----+------
  1 | val1
(1 row)

Now let's insert a row and roll it back then insert another row using the same value for the data column:

testdb=# BEGIN;
BEGIN

testdb=*# INSERT INTO test (data) VALUES ('val2');
INSERT 0 1

testdb=*# SELECT * FROM test;
 id | data
----+------
  1 | val1
  2 | val2
(2 rows)

testdb=*# ROLLBACK;
ROLLBACK

testdb=# SELECT * FROM test;
 id | data
----+------
  1 | val1
(1 row)

testdb=# BEGIN;
BEGIN

testdb=*# INSERT INTO test (data) values ('val2');
INSERT 0 1

testdb=*# COMMIT;
COMMIT

testdb=# SELECT * FROM test;
 id | data
----+------
  1 | val1
  3 | val2
(2 rows)

From this we can see:

  1. Since the first transaction was rolled back the stored value the data column with the unique constraint was effectively not used, allowing us to use that value again in the subsequent insert.
  2. However, the sequence value of 2 generated by the first transaction was "lost" by the rollback as the next transaction generated and store a 3 from the sequence.

2. Gaps in Sequences Due to Caching

There are actually two types of that can be involved in sequence generation:

  1. Cache values specified by the CACHE clause of CREATE SEQUENCE.
  2. Internal caching done by the server to minimize WAL writes from sequence updates.

CREATE SEQUENCE's CACHE clause

Let's recreate the table and sequence but this time we'll set up the sequence separately so that we can specify a CACHE value:

testdb=# drop table test;
DROP TABLE

testdb=# CREATE SEQUENCE myseq AS bigint CACHE 10;
CREATE SEQUENCE

testdb=# CREATE TABLE test(
testdb(#   id bigint primary key DEFAULT nextval('myseq'),
testdb(#   data text
testdb(# );
CREATE TABLE

testdb=# ALTER SEQUENCE myseq OWNED BY test.id;
ALTER SEQUENCE

What this cache value does or means is that:

  1. The first time a given session calls nextval() for the sequence that session will be allocated a chunk of 10 unused values from the sequence relation with the nextval() call retuning the first value from the chunk.
  2. The next 9 nextval() calls for the sequence by that same session will return successive values from the already allocated chunk rather than pulling new values from the sequence relation.
  3. After that 10-value chunk of values have been returned by 10 total nextval() calls in that same session the next call to nextval() will allocate another chunk of 10 values from the sequence relation and the whole process repeats.

Note that we can query the sequence relation directly to see its current values:

testdb=# SELECT * FROM myseq;
 last_value | log_cnt | is_called
------------+---------+-----------
          1 |       0 | f
(1 row)

The last_value determines what will be returned by calls to nextval() that actually update the sequence. If is_called is FALSE then the last_value value is returned and if it is TRUE then last_value + 1 is returned. (Note: We'll be covering the log_cnt value in the next session).

Now we'll create a table to use to store and view values generated from the sequence directly and insert a single row using nextval() directly to generate an value for the id column and also note the session's backend pid as a simple session identifier. (Note that we could instead set the id up to take values from nextval() call by default just like a bigserial would but for these examples direct calls will make things a bit more clear.)

testdb=# BEGIN;
BEGIN

testdb=*# INSERT INTO test (data)
testdb-*#   SELECT 'foo'
testdb-*#   RETURNING id, data;
 id | data
----+------
  1 | foo
(1 row)

INSERT 0 1

testdb=*# COMMIT;
COMMIT

testdb=# SELECT * FROM myseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         10 |      32 | t
(1 row)

testdb=# SELECT pg_backend_pid();
 pg_backend_pid
----------------
        2094214
(1 row)

Here we can see that that nextval('myseq') call in the INSERT INTO ... SELECT query generated a value of 1 but the sequence relation's last_value is not set to 10 with is_called set to TRUE, so it propertly generated 10 values and returned the first, i.e. the next 9 calls to nextval('myseql') should return the rest of those 10 values without updating the sequence relation. Let's use generate_series() to quickly do that:

testdb=# SELECT pg_backend_pid();
 pg_backend_pid
----------------
        2094214
(1 row)

testdb=# BEGIN;
BEGIN

testdb=*# INSERT INTO test (data)
testdb-*#   SELECT 'foo' FROM generate_series(1, 9)
testdb-*#   RETURNING id, data;
 id | data
----+------
  2 | foo
  3 | foo
  4 | foo
  5 | foo
  6 | foo
  7 | foo
  8 | foo
  9 | foo
 10 | foo
(9 rows)

INSERT 0 9

testdb=*# COMMIT;
COMMIT

testdb=# SELECT * FROM myseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         10 |      32 | t
(1 row)

As expected the 9 nextval('myseq') generated the still-cached values but did not update the sequence relation. How does this then relate to gaps? Let's call nextval('myseq') one more time in this session to allocate 10 more values to it.

testdb=# SELECT pg_backend_pid();
 pg_backend_pid
----------------
        2094214
(1 row)

testdb=# BEGIN;
BEGIN

testdb=*# INSERT INTO test (data)
testdb-*#   SELECT 'foo'
testdb-*#   RETURNING id, data;
 id | data
----+------
 11 | foo
(1 row)

INSERT 0 1

testdb=*# COMMIT;
COMMIT

testdb=# SELECT * FROM myseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         20 |      32 | t
(1 row)

So that allocated the next 10 values chunk, updated the sequence relation to reflect that, and returned the first value, which was 11. Now let's do the same thing in a different session.

testdb=# SELECT pg_backend_pid();
 pg_backend_pid
----------------
        2100840
(1 row)

testdb=# BEGIN;
BEGIN

testdb=*# INSERT INTO test (id, data)
testdb-*#   SELECT nextval('myseq'), 'foo'
testdb-*#   RETURNING id, data;
 id | data
----+------
 21 | foo
(1 row)

INSERT 0 1

testdb=*# COMMIT;
COMMIT

testdb=# SELECT * FROM myseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         30 |      32 | t
(1 row)

Because the values 11-20 were cached by the first session 2094214 they can not be used by the new session 2100840 so it allocated the next 10 value chunk for its own use. Here's the table's contents now:

testdb=# select * from test;
 id | data
----+------
  1 | foo
  2 | foo
  3 | foo
  4 | foo
  5 | foo
  6 | foo
  7 | foo
  8 | foo
  9 | foo
 10 | foo
 11 | foo
 21 | foo
(12 rows)

If session 2094214 exits without any further nextval('myseq') calls then the sequence will never return values 12-20 and if session 2100840 exits without any further nextval('museq') calls then values 22-30 will never be returned, unless the sequence is reset so that it again starts from 1.

Internal sequence value caching to reduce WAL writes

Consider that a large database with hundreds of tables that receives constant inserts may be generating many thousands or more sequence values per second. If Postgres were create a WAL record for each of those generated sequence relation updates that would result in a huge amount of WAL data to write and replicate. To avoid that Postgres internally allocates and caches sequence values in chunks of 32, with those allocated sequence values being available to all sessions, and this is what the log_cnt value that sequence relations have is about.

Let's work with a sequence directly for these examples:

testdb=# SELECT * FROM myseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         20 |      22 | t
(1 row)

testdb=# create sequence newseq;
CREATE SEQUENCE
testdb=# select * from newseq;
 last_value | log_cnt | is_called
------------+---------+-----------
          1 |       0 | f
(1 row)

So, before any sequence values have been generated with nextval('newseq') log_cnt starts at 0. Now let's generate a value from it, and we'll again use pg_backend_pid() as a session id for the following examples:

testdb=# select pg_backend_pid();
 pg_backend_pid
----------------
          16788
(1 row)

testdb=# select nextval('newseq');
 nextval
---------
       1
(1 row)

testdb=# select * from newseq;
 last_value | log_cnt | is_called
------------+---------+-----------
          1 |      32 | t
(1 row)

Now log_cnt is 32, indicating that a new WAL log record will not be generated until 32 new values have been generated from the sequence. Calling it again will increase last_value while decreasing log_cnt:

testdb=# select pg_backend_pid();
 pg_backend_pid
----------------
          16788
(1 row)

testdb=# select nextval('newseq');
 nextval
---------
       2
(1 row)

testdb=# select * from newseq;
 last_value | log_cnt | is_called
------------+---------+-----------
          2 |      31 | t
(1 row)

Now let's generate 15 more values in another session:

testdb=# select pg_backend_pid();
 pg_backend_pid
----------------
          16791
(1 row)

testdb=# select nextval('newseq') from generate_series(1, 15);
 nextval
---------
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
(15 rows)

testdb=# select * from newseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         17 |      16 | t
(1 row)

And if we generate log_cnt more values in either session that will be down to 0:

testdb=# select pg_backend_pid();
 pg_backend_pid
----------------
          16788
(1 row)

testdb=# select nextval('newseq') from generate_series(1, 16);
 nextval
---------
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
(16 rows)

testdb=# select * from newseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         33 |       0 | t

And one more will show it reset back to 32:

testdb=# select pg_backend_pid();
 pg_backend_pid
----------------
          16791
(1 row)

testdb=# select nextval('newseq');
 nextval
---------
      34
(1 row)

testdb=# select * from newseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         34 |      32 | t
(1 row

Okay, so if these internally cached sequence values are available to all sessions how can this create gaps in stored values generated from it? Three ways that are all at their core based on WAL recovery:

1. If the server crashes when it comes back up it will perform WAL recovery, in which case the WAL record for 32 values is used and last_val is set to the end of that block.

Let's cause a crash recovery by hard killing the session with pid 16788 then logging back in and checking the contents of our newseq sequence:

kill -9 16788

Now back on the 16788 our first attempt to run another query indicates that the connection was lost due to the server terminating abnormally, it is reset, and the second attempt to run the query works:

testdb=# select * from newseq;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
testdb=# select * from newseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         66 |       0 | t
(1 row)

testdb=# select nextval('newseq');
 nextval
---------
      67
(1 row)

testdb=# select * from newseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         67 |      32 | t
(1 row)

So before the sever performed recovery due to our hard killing a server connection process the sequence's last_val value was 34 and now it is 66 with is_called at TRUE so the next call to nextval('newseq') returns 67, meaning the previously cached values from 35 to 66 were lost during the crash.

2. The last_val value for a sequence will always follow the 32 value WAL record increases since the cached values only exist on the primary.

Let's generate five more values on the primary:

testdb=# select pg_backend_pid();
 pg_backend_pid
----------------
          18665
(1 row)

testdb=# select nextval('newseq') from generate_series(1, 5);
 nextval
---------
      68
      69
      70
      71
      72
(5 rows)

testdb=# select * from newseq;
 last_value | log_cnt | is_called
------------+---------+-----------
         72 |      27 | t
(1 row)

And here is what the contents of newseq look like on this server's standby server:

testdb=# select * from newseq ;
 last_value | log_cnt | is_called
------------+---------+-----------
         99 |       0 | t
(1 row)

So if this standby were to be promoted at this point and put into production the values 73 - 99 would be gone.

3. Recovery from a base backup.

Since recovering from a base backup is also based on WAL recovery all sequences in the newly recovered server will have last_val values set per the last 32 value WAL record logged by the backed up server before the end of the base backup creation.

3. GENERATED ... AS IDENTITY

Columns that get values from GENERATED ALWAYS AS IDENTITY and GENERATED BY DEFAULT AS IDENTITY specifications introduced in PostgreSQL 10 still use sequences to generate their values, and while they aren't shown directly by \d <tablename> in psql sessions they are still visible in \ds output and can be looked up with pg_get_serial_sequence().

testdb=# CREATE TABLE generated_test (
testdb(#   id bigint primary key generated always as identity,
testdb(#   data text
testdb(# );
CREATE TABLE

testdb=# \d generated_test
                     Table "public.generated_test"
 Column |  Type  | Collation | Nullable |           Default
--------+--------+-----------+----------+------------------------------
 id     | bigint |           | not null | generated always as identity
 data   | text   |           |          |
Indexes:
    "generated_test_pkey" PRIMARY KEY, btree (id)

testdb=# SELECT pg_get_serial_sequence('generated_test', 'id');
    pg_get_serial_sequence
------------------------------
 public.generated_test_id_seq
(1 row)

Thus, gaps in these columns can still appear for all of the rollback and caching reasons demonstrated above.

Was this article helpful?

0 out of 0 found this helpful