Our primary goal is to configure our postgres architecture in such a way that we are unlikely to run out of memory. Secondary to that, in case that we do run out of memory, our overcommit setting will determine how we deal with such a situation.
This article will discuss both factors.
Authors: Jakub Wartak & Matthew Gwillam.
Technical review: Tomas Vondra
Firstly, we will define Total RAM
as == MemTotal
from man 5 proc
:
MemTotal %lu
Total usable RAM (i.e., physical RAM minus a few reserved bits and the kernel binary code).
Good starting points (adjusted based on performance testing) for tuning PostgreSQL memory usage to try and prevent reaching the CommitLimit
is as follows:
- Running a single PostgreSQL cluster per server/VM
- shared_buffers =
0.25 * Total RAM
- max_connections =
GREATEST(4 * CPU cores, 100)
. - work_mem =
(0.25 * Total RAM) / max_connections
- If you want higher values of
shared_buffers
,max_connections
andwork_mem
that are allowed with these formula and current RAM, we recommend increasingTotal RAM
. - Use of a connection pooler is strongly recommended beyond 100 connections.
- Huge Pages configured (recommended for servers/VMs with >= 64 GBs of memory).
Main points are shown here, these and others are elaborated on in section "High number of backends (database sessions) vs memory usage".
References:
- https://www.enterprisedb.com/postgres-tutorials/how-tune-postgresql-memory
- https://www.enterprisedb.com/promote/white_paper/configuring-and-tuning-postgresql-and-epas-linux
For Linux servers or VMs, that are fully dedicated to running PostgreSQL, EDB recommends disabling overcommit, by setting vm.overcommit_memory=2
, vm.overcommit_ratio=80
.
First of all, note that this is for the majority of use cases, and should not be implemented blindly in production without analysis first. Please see section "Disclaimer".
Disabling memory overcommitment reduces the chance of OOM killer killing PostgreSQL processes which causes it to self-restart the whole cluster and is one of the most frequent reasons of downtime.
When disabled, instead of OOM killer, any OS process (including PostgreSQL ones) may start observing memory allocation errors such as malloc: Cannot allocate memory
, ERROR: out of memory
, and DETAIL: Failed on request of size X in memory context "Y"
if the memory allocations hit the Linux's kernel's memory CommitLimit
.
The main problem with overcommit in general is that the allocation seemingly works, but at some point later the process may unexpectedly crash when trying to access (write) the memory.
Virtual memory
In Linux, Programs see a virtual address space (virtual memory) that may or not be mapped to actual physical memory.
Virtual memory is the memory that the process has requested from the operating system, not the physical amount of RAM that it is currently using. It includes shared memory and some other aspects.
Allocation vs real use
In Linux, there is a distinction between allocating and using memory:
-
Memory allocation: Allocating memory refers to reserving a portion of a process's virtual address space for future use, typically through system calls to functions like
brk()
ormmap()
(ormalloc()
which wraps around those system calls). At this stage, no physical memory has been assigned, and the allocated memory remains in an uninitialized state. -
Memory usage: Using memory involves writing data into the allocated virtual address space, causing the kernel to assign physical pages for that process's use. At this stage, the allocated memory transitions from an uninitialized state to a usable one, and the amount of available physical memory decreases accordingly.
The difference between allocating and using memory is significant in the context of overcommitting because it allows Linux systems to support more concurrent processes than what would be possible with a strict one-to-one mapping of virtual to physical memory. With overcommit it's possible that allocation seemingly works, but then accessing the memory later may fail in strange ways (when no memory can be freed).
Committed_AS
Committed_AS
in /proc/meminfo
is Linux's track of how much virtual memory it has allocated system wide. As per man 5 proc
:
The amount of memory presently allocated on the system. The committed memory is a sum of all of the memory which has been allocated by processes, even if it has not been "used" by them as of yet. A process which allocates 1 GB of memory (using malloc(3) or similar), but touches only 300 MB of that memory will show up as using only 300 MB of memory even if it has the address space allocated for the entire 1 GB.
This 1 GB is memory which has been "committed" to by the VM and can be used at any time by the allocating application. With strict overcommit enabled on the system (mode 2 in /proc/sys/vm/overcommit_memory), allocations which would exceed the CommitLimit will not be permitted. This is useful if one needs to guarantee that processes will not fail due to lack of memory once that memory has been successfully allocated.
Committed_AS
: as per above, the committed address space reflects the current allocation status for all tasks in the system. It includes both anonymous and swap cache memory but excludes file-backed pages. Another (old) definition known from meminfo documentation take 2 is the following - "An estimate of how much RAM you would need to make a 99.99% guarantee that there never is OOM (out of memory) for this workload."
The Linux VM subsystem has three built-in strategies that can be configured depending on the workload. The knob to control those strategies is called overcommit_mode
.
Overcommit means, do not set any realistic or hard limits on total virtual memory usage - practically speaking that means overcommit_memory=0/overcommit_memory=1
.
In reverse, disabling overcommit means to set a limit (overcommit_memory=2
strategy).
The default linux setting of overcommit_memory
allows for overcommitting of memory, as per man 5 proc
:
/proc/sys/vm/overcommit_memory
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE are not checked, and the default check is very weak,
leading to the risk of getting a process "OOM-killed".
For linux servers running PostgreSQL, EDB recommends disabling overcommit, by setting overcommit_memory=2
, overcommit_ratio=80
for the majority of use cases. Please see section "Disclaimer". More details:
The PostgreSQL documentation Linux Memory Overcommit states two methods with respect to overcommit and OOM killer on PostgreSQL servers:
-
System wide disabling of overcommit using
overcommit_memory=2
(note that we still recommend usage ofOOMScoreAdjust=-1000
when disabling overcommit). -
Not disabling overcommit system wide (
overcommit_memory=0
,overcommit_memory=1
), but ensuring postmaster cannot be targeted by OOM killer (systemd process value of OOMScoreAdjust=-1000
).
Nowadays most PostgreSQL / EPAS installations should be having OOMScoreAdjust=-900
or OOMScoreAdjust=-1000
in systemd startup scripts as installed from packages.
As per /usr/include/linux/oom.h
the OOM_SCORE_ADJ_MIN (-1000)
value is the better choice as it completely prevents the process from being targeted by OOM.
In case of running HA installations, please ensure that other critical services including HA agents (e.g Patroni) have this set, so they are less-likely candidates for OOMs.
As with most decisions, there are positives and negatives involved:
- When disabling system-wide overcommit, this can result in failures to request memory.
- When not disabling overcommit increases the chance of child processes being killed ungracefully by OOM killer. This results in a PostgreSQL outage where the PostgreSQL instance restarts and perform crash recovery, in response to the ungraceful termination:
[postgres@postgres15 log]$ tail -f postgresql-Wed.log
2024-05-22 13:39:08.521 UTC [110] LOG: terminating any other active server processes
2024-05-22 13:39:08.523 UTC [110] LOG: all server processes terminated; reinitializing
2024-05-22 13:39:08.535 UTC [18581] LOG: database system was interrupted; last known up at 2024-05-22 13:34:19 UTC
2024-05-22 13:39:08.572 UTC [18581] LOG: database system was not properly shut down; automatic recovery in progress
2024-05-22 13:39:08.573 UTC [18581] LOG: redo starts at 0/221DE50
2024-05-22 13:39:08.573 UTC [18581] LOG: invalid record length at 0/221DFA0: wanted 24, got 0
2024-05-22 13:39:08.573 UTC [18581] LOG: redo done at 0/221DF68 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2024-05-22 13:39:08.575 UTC [18582] LOG: checkpoint starting: end-of-recovery immediate wait
2024-05-22 13:39:08.580 UTC [18582] LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.001 s, total=0.006 s; sync files=3, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB
In our opinion, method 1) overcommit_memory=2
is more suitable for PostgreSQL servers in the vast majority of cases as reducing the occurrences of out-of-memory conditions and PostgreSQL crash recovery outages is a top priority, even if it means denying memory requests:
- PostgreSQL processes (or it's related 3rd party daemons like monitoring agents, backups utilities) are not hit by OOM (which is delivered as a signal), which may crash or abort some useful work, If a single PostgreSQL processes/connection is killed like that, then the whole database self-restarts to avoid any potential inconsistency in shared memory areas. This is by design.
- During normal conditions PostgreSQL may receive
out of memory
error (ENOMEM) during (over)allocation, which PostgreSQL handles by writingPostgreSQL memory context dumps
into log file, which can be used by user or support to pinpoint to the source of repeating problem. - Often fork() (syscall/function for spawning new processes / which is the way to start new connectioon) may also receive a
out of memory
error preventing spawning next backends (connections) in fully allocated memory situations (CommitLimit approching Committed_AS). This may cause downtime for new connections, but it is also protecting the stability of existing connections and limits the damage done to those. - When overcommit memory is enabled and an OOM condition occurs, the kernel must select victim processes based on its internal algorithms, which may not take NUMA architecture into account (usally NUMA is enabled on servers with >=32 VCPUs). As a result, this can lead to suboptimal or even counterproductive decisions that negatively impact system performance - the OOM killer may choose a process residing in memory local to one NUMA node while other NUMA nodes have available memory.
- The process that got terminated by OOM killer is not necessarily the process responsible for any memory pressure; it is simply the process that requested some memory after the free memory fell below the minimum watermark.
The negative aspect of disabling overcommit is that while memory load is too high for the server, this setting may cause an application outage as it leaves no room for PostgreSQL to request additional memory, resulting in errors such as ERROR: out of memory
, and DETAIL: Failed on request of size 14 in memory context "ExprContext"
. Accordingly, the process in which an allocation failed may not be the process consuming excessive amounts of memory.
Disabling overcommit should be tested in a development environment with similar load before changing in production. Committed_AS
vs CommitLimit
should be graphed during high load production hours beforehand.
The current system may be consuming too many resources (e.g max_connections
), and may cause an application outage on production if implemented without checking pre-requisites.
Disabling overcommit works well for the majority of our customers, but a small number of customers have a preference for allowing overcommit, preferring PostgreSQL crash recovery and the additional memory allowance.
Information provided is mostly guideline and certain behaviour of the Linux kernel may have changed over time, the information provided here should be accurate as standard RHEL kernel versions as used in RHEL 7.9 and 8.x.
Now that you have decided to disable overcommit server wide, we recommend testing the following in a load tested QA / UAT environment:
overcommit_memory=2
overcommit_ratio=80
When overcommit is disabled, the kernel uses following formula to set limits:
CommitLimit = (total_RAM - total_huge_pages) * overcommit_ratio / 100 + total_swap
A user tunes the system limit with respect to this formula using either overcommit_ratio
(a % of the systems total RAM, default 50), or overcommit_kbytes
(a raw number of how many virtual kilobytes can be allocated).
For overcommit disabled servers, we recommends the use of overcommit_ratio=80
, a value known to be performant in PostgreSQL circles, allowing more virtual memory usage than the default.
When the previously discussed Committed_AS
value hits CommitLimit
on our overcommit disabled server, requests for memory will fail. When the Committed_AS
gets close to the CommitLimit
(in mode 2), it might mean that your system has a limited capacity to allocate additional memory resources, even temporarily. This can easily lead to issues such as:
- Performance deterioration: OS might swap (if there's swap), memory might be fragmented and both of those can result in limited performance.
- Increased risk of OOM conditions and lack of memory: OS might start hiting OOMs and/or
out of memory
errors.
We recommend graphing CommitLimit vs Committed_AS if possible to spot the trend before it happens.
These settings are tweaked via /proc/sys/vm/
OR by using sysctl
tool. The simplest way to do it on runtime is:
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=80
To persist such configuration after reboots, for most Linux OSs, we put values
vm.overcommit_memory=2
vm.overcommit_ratio=80
into /etc/sysctl.conf
or into /etc/sysctl.d/01-overcommit.conf
(please check any other file in that directory first. There might be present previous overcommit configuration from previous attempts ). One can run sysctl -p
to make them effective immediately from those files too without reboot.
Each PostgreSQL connection runs in a dedicated OS process, that has certain memory requirements.
- 1] Andres Freund's blog shows that a fresh idle connection can use up to ~2MB without huge pages, and ~8MB with huge pages.
- 2]
SELECT
and manualVACUUM
can spawn additional PostgreSQL backend workers with their own memory usage. - 3] Postgres Query Executor uses memory based on configured parameters specifically to avoid hitting the disk when sorting (thus GROUP BY too) and hashing, and can spawn additional workers.
- 4] PostgreSQL allocates different amounts of memory to the PostgreSQL processes/backend depending on the number of relations (tables, partitions, indexes, sequences) accessed by the session, and is shown in
TopMemoryContext
.
For 4], in addition to number of relations accessed, this value varies between Postgres versions and build, and the max longevity time of connections (frequently connection pools are refreshed). E.g. on PostgreSQL 16.3 x86_64, the biggest memory contexts (CacheMemoryContext
and index info
) depends on the no of relations accessed by the session (here it was 1 table with N table partitions (each with local index)):
tables(total relations) mem used (CacheMemoryContext+index info)
~100/200 1024+257 =~ 1281kB
~500/1000 4096+657 =~ 4753kB
~1000/2000 8192+1157 =~ 9349kB
Due to the above it is very important to put proper limits by using PostgreSQL parameters to avoid consuming too much memory.
One should derive PostgreSQL configuration that sensibly limits max_connections
, work_mem
, hash_mem_multiplayer
and max_parallel_workers
, so that not more memory is used than the system can handle. If you want higher values for these parameters, we recommend increasing Total RAM
).
For the majority of our customers, max_connections
and work_mem
have been the two most important parameters in relation to this.
max_connections = GREATEST( 4 * CPU cores, 100 )
.
This formula can be used a starting point, but it should be performance tested, as we are often RAM bound with postgres, not CPU bound.
Use of a connection pooler is strongly recommended beyond 100 connections
Our blog Challenges setting max_connections and why you should use connection pooler displays the performance detriments of too many concurrent connections, and how funnelling connections through the pooler is more performant for more demanding workloads.
This is a very workload-dependant parameter, where it needs to be low enough to not cause harm by causing memory stress, but high enough to avoid hitting the storage to perform e.g. sorting (which would be slow).
work_mem
is the maximum total amount of memory that each backend can use for caching for an operation (e.g sorting and hashing), before having to use temporary disk files. If a query has multiple operations, it can use multiples of work_mem
.
We recommend as a starting point (adjusted with performance testing) the following formula:
work_mem = Total RAM * 0.25 / max_connections
Additionally, we recommend full logging of all tuplestore operations (enabled via log_temp_files = 0
) initally and trying to tune the value to avoid causing memory stress, but have reasonable performance too. Hint: it may be worth to set higher work_mem
just for some individual queries (by sending SET work_mem=...
first, issuing query and then resetting the effective value using RESET work_mem
). Different work_mem
can be applied for some dedicated application accounts/roles too.
Additionally, if parallelism is enabled and query would benefit from running it with Parallel Query, work_mem for an operation may be applied individually to each parallel worker, which means the total memory allocation of that operation may be much higher than work_mem
.
work mem should be evaluated in conjunction with the number of sorts and hash joins in a query, and also many parameters - particularly max_connections
, max_worker_processes
, max_parallel_workers
, max_parallel_workers_per_gather
, hash_mem_multiplier
SELECT
and manual VACUUM
can spawn additional PostgreSQL backend workers (up to max_parallel_workers
total, with up to max_parallel_workers_per_gather
per parallelized SQL plan node) that may use even more memory (again, each parallel background process uses: constant memory allocation, up to full work_mem and so on).
Performant starting values are:
max_worker_process = n CPU
max_parallel_workers = n CPU
max_parallel_workers_per_gather = 0.5 * nCPU
However, as discussed, this may need to be reduced depending on the work_mem
and max_connections
and performance testing.
Used to compute the maximum amount (not reserved immediately) of memory that hash-based operations can use. The final limit is determined by multiplying work_mem by hash_mem_multiplier. The default value is 2.0, which makes hash-based operations use twice the usual work_mem base amount.
Consider increasing hash_mem_multiplier
in environments where spilling by query operations is a regular occurrence, especially when simply increasing work_mem
results in memory pressure (memory pressure typically takes the form of intermittent out of memory errors). The default setting of 2.0 is often effective with mixed workloads. Higher settings in the range of 2.0 - 8.0 or more may be effective in environments where work_mem has already been increased to 40MB or more.
Example: assuming one has 64GB total RAM and
- Is using 16GB for
shared_buffers
without Huge Pages - Average access number of relations is 2000 (like above) in every long-lived backend (let's assume: mostly due to lots of partitioning in play)
-
work_mem
is 16MB and let's say on average there are 3 sort/hash-like operations for active processes -
max_connections
is 1000 - Linux in overcommit mode 2 (and 80% overcommitment ratio) and has no swap configured
Defined previously, CommitLimit = (total_RAM - total_huge_pages) * overcommit_ratio / 100 + total_swap
.
We can estimate that the CommitLimit = (64GB - 0GB) * 80 / 100 + 0 = ~51(GB)
. Assuming we allocated 16GB just for shared memory, some other OS agents could possible take ~2GB, it leaves us with 51-16-2=33GB for the rest (file system cache, and postgres wal buffers, clog buffers, memory for locks). Given the previous private per-backend memory estimations we can say that:
- An idle backend can consume up to: 8MB + ~9.12MB = ~17MB
- An active (working) backend can consume on average: 8MB + ~9.12MB + 3*4MB = ~65MB
33GB / 17MB > max_connections (1000), however if there would more than 33GB / 65MB (that is 519) active working processes, one still could get into situation where Linux would start returning ENOMEM errors. Those calculations could be enhanced, so that they would take into account leaving some more free memory for file system caches to avoid lots of IOs.
Leaving some additional free memory is especially important as often the following chain-reaction / spiral of death takes place (with unbounded max_connections):
- system gets slower due to higher number business requests (TPS)
- application(s) opens more connections (this is allowed, because max_connections is too high)
- more work is performed concurrently (DB and OS memory pressure increases, including lots of similar activity that cannot finish quickly)
- it gets even slower (all filesystem cache memory is allocated towards DB processes)
- users abort and start new requestes (or those can timeout and be started again, without stopping previous still running work), this causes the application(s) to create even more connections
- OOM/out of memory happens
Some maintenance operations on PostgreSQL can use more memory than discussed earlier (usually those are VACUUM
, CREATE INDEX
or other maintenance operations ). The maintenance_work_mem
parameter can be used to control it per single process. A good rule of thumb is to configure as starting point as follows: 0.15 * (total_RAM - shared_buffers) / autovacuum_max_workers
and usually values up to 1GB are a good start. Please note that there can be several PostgreSQL workers at a time using this amount of memory (see PostgreSQL parameters like: autovacuum_max_workers
and max_parallel_maintenance_workers
).
We see customers with swap and without swap enabled, depending on their preference. Our opinion is that in a properly tuned PostgreSQL installation, one should not see a lot of constant and on-going swapping.
The direct effect of having swap enabled is course increase of the CommitLimit
(with overcommit_memory=2
strategy) which may reduce chance of ENOMEM (cannot allocate memory) in temporary memory pressures. Our opinion is that though adding swap is beneficial for a buffer, if having to choose between increasing Total RAM
or swap, we would choose increasing RAM.
For a deeper discussion about swap please see excellent article by Chris Down (Linux kernel developer): In defence of swap
Memory leaks do happen even on stable releases of PostgreSQL. If you are running with reasonable PostgreSQL and OS parameters discussed above and still getting out of memory
errors or OOMS, there are chances that you might be dealing with memory leaks. In such case please consider opening support case. The best course of action is to run with recent minor point release of the software and demonstrate of the memory leak happening per individual PID over time. The command pidstat -r -p <pid> 300
(over several hours) is usually a very good way to demonstrate the memory leak to the support team.
- It is important to keep PostgreSQL / EPAS and related extensions and libraries up to date as it often helps to fix memory leaks. The best course of action is even before starting investigating memory leaks , one ensures that he runs with most recent minor versions of PostgreSQL / EPAS / extensions / libraries.
-
pg_dump
tool has several known non-optimal memory allocation issues when dealing with very high counts of Large Objects and or DDL objects counts (same forpg_upgrade
which uses it too underneath). -
pg_catcheck
: versions older than 1.4.0 - Various memory leaks were present for BDR <= 4.3.3, BDR < 3.7.23
- Running with
vm.overcommit_ratio=100
orvm.overcommit_kbytes=$TOTAL_RAM
might still cause OOM killer (see Red Hat Solutions )