Monitoring streaming replication in PostgreSQL

Amit Wakchaure
Amit Wakchaure
  • Updated

PostgreSQL streaming replication is a method used to replicate data from one PostgreSQL database server, known as the primary server or master, to one or more standby servers, also called replicas, in real time. It ensures high availability, fault tolerance, and load balancing for PostgreSQL databases. However, it's essential to keep an eye on replication lag. What exactly is replication lag, and how can you effectively monitor it along with PostgreSQL replication overall?

Key views for replication monitoring

PostgreSQL provides several essential views for replication monitoring:

pg_stat_replication

The pg_stat_replication view contains one row per WAL sender process, showing statistics about replication to that sender's connected standby server. The lag times reported in this view measure the time taken for recent WAL (Write-Ahead Log) to be written, flushed, replayed, and acknowledged by the sender.

pg_is_in_recovery()

The pg_is_in_recovery() function can be used to check whether a standby server is in recovery mode or not.

pg_stat_wal_receiver

The pg_stat_wal_receiver view displays information about the server's WAL receiver. It contains only one row, showing statistics about the WAL receiver from that receiver's connected server.

Important functions for replication monitoring

pg_last_wal_receive_lsn() → pg_lsn

Returns the last write-ahead log location that has been received and synced to disk by streaming replication. If recovery has been completed, this remains static at the location of the last WAL record received and synced to disk during recovery. If streaming replication is disabled, or if it has not yet started, the function returns NULL.

pg_last_wal_replay_lsn() → pg_lsn

Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress, this increases monotonically. If recovery has been completed, this remains static at the location of the last WAL record applied. When the server has been started normally without recovery, the function returns NULL.

pg_last_xact_replay_timestamp() → timestamp with time zone

Returns the timestamp of the last transaction replayed during recovery. This is when the commit or abort WAL record for that transaction was generated on the primary. If no transactions have been replayed during recovery, the function returns NULL. Otherwise, if recovery is still in progress, this increases monotonically. If recovery has been completed, this remains static at the time of the last transaction applied during recovery. When the server has been started normally without recovery, the function returns NULL.

Lag in Bytes

To calculate lag in bytes, you can use the pg_stat_replication view on the primary with the function pg_wal_lsn_diff:

SELECT client_addr, 
       pg_wal_lsn_diff(pg_stat_replication.sent_lsn, pg_stat_replication.replay_lsn) AS byte_lag 
FROM pg_stat_replication;

Lag in Seconds

To find the lag in seconds, use the following SQL query:

SELECT CASE 
           WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 
           ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) 
       END AS log_delay;

Additional Tools for Monitoring

The Streaming Replication Analysis Dashboard of Postgres Enterprise Manager (PEM) displays statistical information about WAL activity for a monitored server. For more information about the PEM Streaming Replication Analysis Dashboard, visit the PEM documentation.

Causes of Replication Lag

There are various reasons for replication lag, including:

Network Latency

This refers to the delay incurred during the transit of data between the primary and standby databases. Factors such as the physical distance between the databases and network congestion can contribute to network latency.

Slow Disk I/O

Slow disk I/O can introduce latency due to factors like disk fragmentation or insufficient disk space, impacting the timely writing of data to the standby databases.

Long-running Transactions

Replication delays occur because changes made by these transactions are not replicated until the transaction is fully committed.

Poor Configuration

Setting a low number of max_wal_senders while dealing with a high volume of transaction requests can lead to latency.

WAL Segment Recycling

In some cases, the server may recycle old WAL segments before the backup process completes, making it impossible to locate the required WAL segment from the primary database. This situation can be linked to checkpointing behavior, which involves the rotation or recycling of WAL segments. If you have streaming replication without file-based continuous archiving, then the standby will need to be reinitialized from a new base backup.

The replication lag can be avoided by setting wal_keep_size to a value large enough to ensure that WAL segments are not recycled too early, or by configuring a replication slot for the standby.

wal_keep_size (integer)

Specifies the minimum size of past WAL files kept in the pg_wal directory, in case a standby server needs to fetch them for streaming replication. Suppose a standby server connected to the sending server falls behind by more than wal_keep_size megabytes. In that case, the sending server might remove a WAL segment still needed by the standby, terminating the replication connection.

Using a Replication Slot

Replication slots provide an automated way to ensure that the primary does not remove WAL segments until they have been received by all standbys and that the primary does not remove rows which could cause a recovery conflict even when the standby is disconnected. However, with replication slots, it is essential to monitor space availability on the primary server since the non-availability of standby database instances would lead to the accumulation of WALs on the primary server.

For more information about PostgreSQL replication, visit the PostgreSQL documentation on replication.

Continuous WAL Archiving

Continuous WAL Archiving helps keep the standby in sync and is used for Point-in-Time Recovery (PITR). Archiving can be configured with the parameters archive_mode and archive_command.

  • archive_mode: Enabling archive_mode starts WAL archiving.
  • archive_command: This parameter sets the archive location, which must be accessible from the standby even when the primary is unavailable. WALs should reside on the standby server itself or another trusted server, not on the primary server. To include a remote server location, you need to set up password-less authentication for continuous shipping of WALs from the primary database server to the remote server location.

The restore_command is the local shell command that needs to be configured on the standby to retrieve an archived segment of the WAL file series.

For more information about Write-Ahead Logs, Continuous Archiving, and Point-in-Time Recovery (PITR), visit the following links:

Was this article helpful?

0 out of 0 found this helpful