Tuning TCP settings for streaming replication

Arthur Nascimento
Arthur Nascimento
  • Updated

This article provides some advice on tuning TCP settings, both at PostgreSQL and kernel levels, in order to kill the hanging connections faster than the default behavior.

A TCP connection may be hanging in either of two states: Either it is idle meaning there's no pending application data queued for delivery, or there is a send queue and the OS entered a retry loop attempting to retransmit data.

TCP keepalive is a mechanism for TCP connections to determine whether the other end has stopped responding or not, if the connection is in an idle state locally.

A host will send a keepalive probe packet to the other host several times after a period of idle time. If it receives a reply, it can assume that the connection is still up and running, otherwise, the socket will be closed. This may also prevent stateful network equipment in between (firewall, intrusion detection system) from dropping the connection due to inactivity.

Please note that an idle connection switches to the retry mode as soon as the local host tries to send data, independent of how many keepalive packets have been sent. If an application implements its own keepalive mechanism by making sure to regularly send some data, such as Postgtres' walsender does, this effectively overrides the TCP keepalive mechanism.

The "Connection timed out" error may tell which mode a connection has been in prior to timing out: If a read syscall receives the error, it may have been idle, in case a write syscall reports the timeout error, it's more likely it had been in retry mode.

TCP Retries

Whenever there is data to be sent, which for some reason did not reach the destination, TCP retries with an exponential backoff, so as to not further overload a possibly congested network. The following two parameters control TCP retransmission at the (Linux-) kernel level.

  • sudo sysctl net.ipv4.tcp_retries1 - number of retries until the lower network stack is told to check for different routes to the destination. No effect on the user-space application whatsoever.

  • sudo sysctl net.ipv4.tcp_retries2 - number of retries overall, before giving up and notifying the user-space application of a loss of connectivity (i.e. closing the TCP socket).

For applications like Postgres, only the latter setting is of interest. The default is 15 retries, thus only after the 15th retransmission attempt expires without confirmation, the kernel closes the TCP connection. The delay between retries is called the retransmission time out (RTO) and starts at 200ms and increases exponentially. It is capped again at 120s max in between retries. The default of 15 retries2 yields a timeout of 15.4 minutes, the table below lists expected timeouts for other values.

| retry2 | total timeout (s) | total timeout (min) |
| 7 | 51.0 | 0.9 |
| 8 | 102.2 | 1.7 |
| 9 | 204.6 | 3.4 |
| 10 | 324.6 | 5.4 |
| 11 | 444.6 | 7.4 |
| 12 | 564.6 | 9.4 |
| 13 | 684.6 | 11.4 |
| 14 | 804.6 | 13.4 |
| 15 | 924.6 | 15.4 |

Note that RFC 1122 "Requirements for Internet Hosts" recommends (SHOULD) a minimum of 100s, so a retry2 setting of 8 seems to be the lowest compliant setting.

There currently is no way of adjusting this for Postgres TCP connections only, so changing this system-wide clearly affects other applications and services.

Further reading:

TCP Keepalive

Keepalive behavior can be configured for PostgreSQL connections only, or globally for the entire host.

PostgreSQL level settings

PostgreSQL has 3 parameters in the postgresql.conf file to handle TCP keepalive settings:

  • tcp_keepalives_idle: number of seconds of inactivity after which TCP should send a keepalive message to the client.
  • tcp_keepalives_interval: number of seconds after which a TCP keepalive message that is not acknowledged by the client should be retransmitted.
  • tcp_keepalive_count: number of lost TCP keepalive messages before the server's connection to the client is considered dead.

The 3 settings altogether control the time it takes for PostgreSQL to kill an IDLE connection that has been inactive in terms of network packets. They can also be tuned to proactively detect dead clients. An aggressive setting ensures that, if your client is no longer able to connect to the database, then any hung connections are quickly closed.

The amount of time a dead connection can remain idle is given by the formula:

tcp_keepalives_idle + ( tcp_keepalive_count * tcp_keepalives_interval)

The value of these settings can be verified within the postgresql.conf file or running the following SQL command:

SELECT name, setting FROM pg_settings WHERE name ~ 'keepalive';

The parameters are set to 0 by default, meaning that the operating system ones will be used.

The default TCP keepalive settings (in seconds) in the Linux kernel are shown below:

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

tcp_* settings under ipv4also apply to IPv6, so there are no additional parameters to change if you use IPv6 connections.

This means that the TCP keepalive process waits for two hours (7200 secs) for socket inactivity before sending the first keepalive probe, and then resend it every 75 seconds if no ACK (acknowledge) has been received. If no ACK response is received for nine consecutive times, the connection is marked as broken.

These values are too high if you want to properly configure streaming replication between master and standby with RPO close to 0 or synchronous replication, especially when the network is unstable.

The exact values depend on clients and network, but the following settings in the postgresql.conf file could be a good starting point:

tcp_keepalives_idle = 600
tcp_keepalives_interval = 60
tcp_keepalives_count = 6

To change these parameters it is enough to issue a reload of the PostgreSQL service. This means no downtime is generated. Setting them in postgresql.conf ensures they are applied only to PostgreSQL connections.

Interaction with statement_timeout

When query cancellation happens due to statement_timeout or (other reasons such as pg_cancel_backend()), the database server must send a message to the client informing of the event before it can proceed with releasing the backend resources on the server side.

The PostgreSQL protocol sends data prefixed with the size, and only accepts the next control command after that amount of data has been transmitted.

But if at the moment a cancellation happens, a large piece of data was in transit or the network is unresponsive, then the cancellation message cannot be sent yet, as it has to wait for the data to finish being sent.

That means the statement_timeout setting may not interrupt a query immediately in certain situations, especially when the network is unresponsive.

So in cases when it's important that statement_timeout releases the resources of the backend quickly, it's best to adjust it and the tcp_keepalive_* settings together so that the network disruption may be identified just as quickly as one would like statement_timeout to be.

In general, it would mean trying to maintain the following relationship among them:

tcp_keepalives_idle + (tcp_keepalive_count * tcp_keepalives_interval) < statement_timeout

Note that statement_timeout affects the queries and transactions of the application, so its value should be discussed with the application team and tested carefully with the application workload itself.

Setting the parameters in the driver

It is possible to set the TCP keepalive parameters on the client side for each connection. However, the steps to doing so might be different for each driver and for your application.

libpq (both PostgreSQL and EPAS)

When using libpq, the builtin PostgreSQL client library, the options available have similar names to the server settings, but without the tcp_ prefix:

  • keepalives (default 1, enabled): Controls whether client-side TCP keepalives are used.
  • keepalives_idle (default 0, uses system value): Controls the amount of seconds of inactivity after which TCP should send a keepalive message to the server.
  • keepalives_interval (default 0, uses system value): Controls the amount of seconds after which a TCP keepalive message that is not acknowledged by the server should be retransmitted.
  • keepalives_count (default 0, uses system value): Controls the amount of TCP keepalive messages that can be lost before the client's connection to the server is considered dead.

An example of these settings being used in psql, which uses libpq, is shown further below, in the Monitoring section.

More details on the PostgreSQL documentation section 34.1.1. Connection Strings.

JDBC driver (both community and EDB)

The JDBC driver has only one option, which can be used to enable TCP keepalives, but not to control the timouts or counts:

  • tcpKeepAlive (boolean; default false, disabled): Enable or disable TCP keep-alive probe.

More details on the JDBC driver can be found in the documentation JDBC Connection Parameters.

.Net driver (both community and EDB)

  • Tcp Keepalive (default false, disabled): Whether to use TCP keepalive with system defaults if overrides isn't specified.
  • Keepalive (default 0, disabled): The number of seconds of connection inactivity before Npgsql sends a keepalive query.
  • Tcp Keepalive Time (default 0, disbled): The number of milliseconds of connection inactivity before a TCP keepalive query is sent. Use of this option is discouraged, use KeepAlive instead if possible. Supported only on Windows.
  • Tcp Keepalive Interval (default value of Tcp Keepalive Time): The interval, in milliseconds, between when successive keep-alive packets are sent if no acknowledgement is received. Tcp KeepAlive Time must be non-zero as well. Supported only on Windows. Keepalive Time

More details on the .Net driver can be found in the documentation Npgsql - Timeouts and keepalive.

EDB OCL and ODBC (both community and EDB)

Both OCL and ODBC are based on libpq, so the libpq parameters have to be placed in the connection string.

More details on how to edit the connection string to the OCL driver in OCL - Forming a connection string

More details on how to edit the connection string to the ODBC driver in ODBC Configuration Options.

Kernel level settings

To change them at a global level, the kernel TCP keepalive parameters must be configured. In order to do this, we recommend first to test the TCP keepalive parameters setting them via command line with the following commands:

sudo sysctl net.ipv4.tcp_keepalive_time=600
sudo sysctl net.ipv4.tcp_keepalive_intvl=60
sudo sysctl net.ipv4.tcp_keepalive_probes=6

Once you have found a configuration that works for your application, these settings can be rendered persistent across reboots by adding the following lines (using values of your choice) to /etc/sysctl.conf:

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 6

Monitoring TCP retries and keepalive

The ss (man 8 ss) tool has the argument -o or --options that causes the TCP sockets to be shown with the state of timeouts in the format:


The first field is the timer name (on, keepalive, timewait, persist or unknown). The second field is the time until the timer expires. The third is the number of times the retransmission occurred. More details on man 8 ss.

Example output

A local TCP connection is established to PostgreSQL with default port, user name and database name:

psql 'host=localhost keepalives_idle=60 keepalives_interval=15 keepalives_count=3'

Afterwards, it's possible to observe it with ss on the database port (5432 or postgresql by default):

% watch -t ss -o dport postgresql
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp ESTAB 0 0 [::1]:58044 [::1]:postgresql timer:(keepalive,12sec,0)

On a healthy connection, the retransmission counts remain at 0, and the timer fluctuates from 60s to 0s and from 15s to 0s repeatedly, as the socket goes from waiting the idle time or the keepalive interval (once), then back to the idle time.

On an unhealthy connection, the timer counts down from 60s once, then down from 15s three times, increasing the retransmission count, until the connection is terminated.

Related to

Was this article helpful?

0 out of 0 found this helpful