On systems doing non-trivial amount of writes, tuning checkpoints properly is crucial for getting good performance. Yet checkpoints are one of the areas where we often run into configuration issues, both on the community mailing lists and during performance tuning reviews for our customers.
PostgreSQL relies on write-ahead log (WAL) – all changes are written to WAL first, and only then applied to the data files. The primary purpose of WAL is durability - in case of a crash the database may read changes from WAL and re-apply them to data files, rebuilding them into consistent state.
Let's say the system crashed and the database needs to perform recovery. The simplest approach would be to start from scratch and replay the whole WAL from the moment the database was initialized. In the end we should get a complete (and consistent) database. An obvious disadvantage is of course the need to keep and replay the whole WAL. We often deal with databases that are not extremely big (say, few hundred GBs), but produce several TBs of WAL per day because they have a high rate of updates and deletes. So starting from scratch would not be very practical, both because of disk space required to keep the WAL and time needed to replay all the changes.
Checkpoints are a way to address this by limiting the amount of WAL needed to re-apply during recovery. The database regularly ensures that all changes up to a given point in WAL are durably written to data files, so that it's not necessary to look at older WAL during recovery. That means the database does not need to keep the WAL nor replay it during recovery, solving both the disk space and duration issues.
The question then is how often should the checkpoints be created. Clearly, doing them very infrequently increases the amount of WAL we need to keep and the duration of recovery. Doing them very often (every minute or so) does minimize the amount of WAL / recovery duration, but it also negatively impacts performance of the database when not in recovery (which should be most of the time).
How do checkpoints improve performance? While in extreme case writing the data to WAL and data files may effectively double the amount of writes, that's rarely the case because the blocks are usually modified repeatedly and the checkpoint then writes this modified data page just once in the background, effectively combining all the writes. This is particularly important because while writes to WAL are naturally sequential, writes to data files (backing tables and indexes) are inherently more random. The checkpointer performs various additional optimizations to make the writes more sequential (by sorting the data etc.), and all these benefits however only work with enough data to write out.
The database also needs to protect against "torn pages" by writing the whole page (and not just the modified part) into WAL when modifying it for the first time after a checkpoint. So frequent checkpoints may cause write amplification in WAL (which may trigger even more frequent checkpoints).
Finally, checkpoints must ensure that all data is really on disk, not in the operating system buffer cache or a disk controller's volatile write cache. So PostgreSQL must ask the operating system to flush all writes to durable storage, usually with fsync()
. This can be time consuming and can interrupt other concurrent activity.
So doing very frequent checkpoints (say, every couple of seconds or every minute or so) would minimize the amount of WAL required for recovery and recovery duration. But it would also effectively turn the asynchronous background writes into synchronous ones, seriously disrupting the user activity (e.g. increasing COMMIT
latency, reducing throughput etc.). In fact, this is one of the issues we see with the default configuration values on large and/or busy databases.
So in practice we want the checkpoints to happen infrequently not to affect the users, but frequently enough to reasonably limit time for recovery and disk space requirements. The goal of the tuning is to find a reasonable compromise.
There are about three or four reasons why a checkpoint can be triggered:
- executing the
CHECKPOINT
command directly - executing a command that requires a checkpoint (e.g.
pg_start_backup
,CREATE DATABASE
, orpg_ctl stop|restart
and a couple of others) - reaching a configured amount of time since the last checkpoint
- generating a configured amount of WAL since the previous checkpoint (aka “running out of WAL” or “filling WAL”)
The first two cases are rather irrelevant here – those are rare, manually triggered events. This article is about configuring regular periodic checkpoints, triggered either because or time and/or amount of WAL generated since the last checkpoint.
Those time/size limits are set using two configuration options:
checkpoint_timeout = 5min
max_wal_size = 1GB (before PostgreSQL 9.5 this was checkpoint_segments)
With these (default) values, PostgreSQL will trigger a CHECKPOINT
every 5 minutes, or after the WAL grows to about 1GB on disk.
Note: The max_wal_size
is a soft limit for total WAL size, which has two consequences. Firstly, the database will try not to exceed it but is allowed to, so keep enough free space on the partition and monitor it. Secondly, it’s not a “per checkpoint” limit – due to spread checkpoints (explained later) the WAL quota is divided between 2 – 3 checkpoints. So with max_wal_size
the database will start a CHECKPOINT
after writing 300 – 500 MB of WAL, depending on checkpoint_completion_target
.
The default values are rather low, like most other default values in the sample configuration file, sized to work even on small systems like Raspberry Pi.
But how to determine good values for your system? The goal is not to do checkpoints too often or too infrequently, and our tuning “best practice” consists of two steps:
- choose a “reasonable”
checkpoint_timeout
value - set
max_wal_size
high enough to be rarely reached
What a “reasonable” value for checkpoint_timeout
means depends mostly on your recovery time objective (RTO), i.e. what is the acceptable maximum recovery duration.
One caveat is that checkpoint_timeout
is a limit on how long it took to generate the WAL, not directly on recovery time. It's not possible to say exactly how long a recovery will take for a couple of reasons. Firstly, WAL is usually generated by multiple concurrent processes (queries etc.), while recovery is performed by a single process (and this limitation is unlikely to go away anytime soon). Secondly, recovery often happens right after a reboot, when file system caches are cold.
But in general, the default value (5 minutes) is too low and values between 30 minutes and 1 hour are fairly common. PostgreSQL 9.6 even increases the maximum from 1 hour to 1 day. Anyway, 30 minutes may not be perfect, but it's likely more reasonable than the default.
checkpoint_timeout = 30min
The next step is to estimate how much WAL the database produces over 30 minutes, so that we can use that for max_wal_size
. There are several ways to determine how much WAL is generated:
- Look at the actual WAL position (essentially offset in a file) using
pg_current_xlog_insert_location
every 30 minutes, and compute difference between positions. - Enable
log_checkpoints=on
and then extract the information from the server log (there will be detailed statistics for each completed checkpoint, including the amount of WAL). - Using data from
pg_stat_bgwriter,
which also includes information about number of checkpoints (which you can combine with the knowledge of currentmax_wal_size
value).
Let’s use the first approach, for example. On a test machine running pgbench
, you might see this:
postgres=# SELECT pg_current_xlog_insert_location(); pg_current_xlog_insert_location 3D/B4020A58 (1 row)
... after 5 minutes ...
postgres=# SELECT pg_current_xlog_insert_location(); pg_current_xlog_insert_location 3E/2203E0F8 (1 row)
postgres=# SELECT pg_xlog_location_diff('3E/2203E0F8', '3D/B4020A58'); pg_xlog_location_diff 1845614240 (1 row)
This means that every 5 minutes, the database generated ~1.8GB of WAL. So that would be about 10GB of WAL every 30 minutes. However as mentioned before, max_wal_size
is a quota for 2 – 3 checkpoints combined, so max_wal_size = 30GB
(3 x 10GB) seems about right.
The other approaches use different sources of data, but the idea is the same.
Tuning checkpoint_timeout
and max_wal_size
isn't the whole truth. There’s another parameter, called checkpoint_completion_target
but to tune it, you need to understand what a “spread checkpoint” means.
During a CHECKPOINT
, the database needs to perform these three basic steps:
- identify all dirty (modified) blocks in shared buffers
- write all those buffers to disk (or rather to file system cache)
-
fsync
all the modified files to disk
Only when all those steps finish, the checkpoint can be considered complete. You could do these steps “as fast as possible” i.e. write all the dirty buffers in one go and then call fsync
on the files (and that’s what PostgreSQL up to 8.2 did). But that leads to I/O stalls due to filling file system caches and saturating the devices, negatively affecting user sessions.
To address this, PostgreSQL 8.3 introduced the concept of “spread checkpoints” – instead of writing all the data at once, the writes are spread over a longer period of time. That gives the OS time to flush the dirty data in the background, making the final fsync
much cheaper. This comes at the cost of reducing PostgreSQL's ability to do write-combining and de-duplicate repeated writes to the same page, so it may increase overall I/O as well as smooth it out. The latency and performance smoothing improvements offered by checkpoints are almost always worth this cost.
The writes are throttled based on progress towards the next checkpoint – the database knows how much time / WAL do we have left until another checkpoint will be needed, and it can compute how many buffers should be already written out. The database however must not issue writes until the very end – that would mean the last batch of writes would still be in file system cache, making the final fsync
calls (issued right before starting the next checkpoint) expensive again.
So we want to spread the writes over the checkpoint duration (essentialy checkpoint_timeout
) but leave the kernel enough time to flush the data to disk in the background. The eviction from page cache (Linux filesystem cache) is generally driven by time, particularly by this kernel parameter:
vm.dirty_expire_centisecs = 3000
which says that data expire after 30 seconds (by default).
Note: When it comes to kernel parameters, it’s important to tune vm.dirty_background_bytes
. On systems with a lot of memory the default value is far too high, allowing the kernel to accumulate a lot of dirty data in filesystem cache. The kernel often decides to flush them all at once, reducing the benefit of spread checkpoints. See the related KB article "Linux writeback configuration", linked below.
The spread checkpoints are configured by this PostgreSQL parameter:
checkpoint_completion_target = 0.5
which says how far towards the next checkpoint should all the writes be complete. For example assuming checkpoints are triggered only by checkpoint_timeout = 5min
, the database will throttle the writes so that the last write is done after 2.5 minutes since the start of the checkpoint. The OS then has another 2.5 minutes to flush the data to disk, so that the fsync
calls issued at the end of a checkpoint are cheap and fast.
Leaving the system 2.5 minutes is somewhat excessive, considering the expiration timeout is only 30 seconds, and you might increase checkpoint_completion_target
e.g. to 0.85
which would leave the system about 45 second, a bit more than the 30 seconds it needs. That’s not recommended though, because in case of intensive writes the checkpoint may be triggered by max_wal_size
much sooner than after the 5 minutes, leaving OS less than the 30 seconds.
Systems that handle write-intensive workloads are however likely to run with much higher checkpoint_timeouts values, making the default completion_target
value definitely too low. For example with the timeout set to 30 minutes, it forces the database to do all writes in the first 15 minutes (at double the write rate), and then sit mostly idly for the remaining 15 minutes. So instead, you might try setting checkpoint_completion_target roughly using this formula
(checkpoint_timeout - 2min) / checkpoint_timeout
which for 30 minutes is about 0.93. Sometimes it recommended not to go above 0.9 – that’s probably OK, and you’re unlikely to observe any significant difference between those two values. (This may change when using very high checkpoint_timeout values
, which now can be up to 1 day since PostgreSQL 9.6).
So now you should know what’s the purpose of checkpoints, and also basics of tuning them. To sum this up:
- most checkpoints should be time-based, i.e. triggered by
checkpoint_timeout
- compromise between performance (infrequent checkpoints) and time needed for recovery (frequent checkpoints)
- values between 15-30 minutes are the most common, but going up to 1h can be reasonable
- after deciding on the timeout, choose
max_wal_size
by estimating the amount of WAL - set
checkpoint_completion_target
so that kernel has enough (but not unnecessarily much) time to flush data to disk - also tune
vm.dirty_background_bytes
to prevent kernel from accumulating a lot of dirty data in page cache - consider how long you can tolerate recovery running before the system becomes available after a crash when you size your checkpoints