Lost async page write (Linux) and fsync() errors

Craig Ringer
Craig Ringer

On some storage configurations Linux can drop writes, failing to perform a write requested by the application. Then continue to use the storage. You will see lost page write or lost async page write errors if this occurs. PostgreSQL may report ERROR: could not fsync file "...": Input/output error This can result in severe PostgreSQL data corruption. Do not ignore these errors. They are very unlikely on most configurations, but important.

On some storage configurations Linux can drop writes, failing to perform a write requested by the application. Then continue to use the storage. You will see "lost page write" or "lost async page write" errors in the kernel logs and possibly "fsync()" errors in the PostgreSQL logs if this occurs.

This can result in PostgreSQL data corruption. Do not ignore that error.

Prevalence

Most storage configurations on Linux are affected. NFS, multipath, and SANs that are backed by thin provisioned storage are at high risk. For most other storage configurations problems won't occur unless there's a physical failure in the underlying storage system, like bad blocks on a disk.

Cause

The cause is a disagreement about the error-handling semantics of fsync(), which seems to be unspecified in POSIX and undocumented in the Linux kernel manpages. Some UNIX-like operating systems also have Linux-like behaviour.

  • Linux clears the error flag from dirty-writeback pages when it reports a write error, so the page will not be written again. A subsequent fsync() on the same file descriptor will succeed; but
  • PostgreSQL assumes that if it retries fsync() after it returns an error, the kernel will retry writes that failed before and will only return true if all writes since the last successful fsync() have hit disk

PostgreSQL thinks that fsync()'s contract is "everything written since your last successful fsync is on disk"; Linux thinks that fsync()'s contract is "everything written since your last successful or unsuccessful fsync is on disk". Linux expects you to retry any writes after an unsuccessful fsync, but it doesn't provide any way to find out which writes they were.

As a result, PostgreSQL can treat a checkpoint as having succeeded and write an updated control file, when in fact the checkpoint was not safely completed. It will never retry those writes, causing arbitrary corruption. On some file systems re-reads of the lost blocks after cache eviction will cause the old data to be read; in others, it may read the new-but-not-really-written data instead.

Additionally, it has been found that PostgreSQL may not see fsync() errors at all. On kernels prior to 4.13 the first process to see the fsync() error clears it. On kernel 4.13 and newer, it's guaranteed that each process with the file open will see the error if it fsync()s the file. But unfortunately, PostgreSQL write()s to a file then close()s it, and a buffered write fails before the checkpointer re-open()s the file in the checkpointer to fsync() it, PostgreSQL never gets the news. We also fail to notice if the write happens to fail while we have the file open in a user backend or the bgwriter and not in the checkpointer, because we don't fsync() the file before we close it except at checkpoint time.

Discussion on these issues is ongoing on pgsql-hackers. No fixes were available at time of writing.

Diagnosis

PostgreSQL may fail to start, crash, ERROR on queries, complain about "unexpected zero page"s, clog errors, etc. Pretty much any corruption symptom is possible.

dmesg may show logs like:

kernel: Buffer I/O error on device dm-0, logical block 3055047803
kernel: lost page write due to I/O error on dm-0

and PostgreSQL may have logs like:

ERROR: could not fsync file "base/24305/1709567143.1": Input/output error

Repair

If you see this error in the logs and see errors from PostgreSQL relating to fsync of persistent parts of the data dir, restore the database from backups if possible.

Contact support if restoration from backup is not possible or practical. Complete repair may be possible if you have a complete and continuous WAL archive from the current database state back to before the first error, and a matching base backup from which a control file can be extracted. We create a custom backup label that forces the database to repeat work from prior to the last checkpoint, hopefully re-writing the failed blocks. Otherwise you'll want assistance with data dumping and restoration into a newly initdb'd PostgreSQL instance.

Prevention

No comprehensive list of configurations in which lost page writes can occur is known, and evidence suggests it's possible on all or nearly all configurations, but mainly a risk for a few:

  • multipath I/O using default settings, if connections are interrupted
  • NFS, if it runs out of disk space
  • storage backed by lvm-thin or device mapper thin-pool, if it runs out of space
  • storage backed by a thin-provisioning SAN, if it runs out of space

To safeguard against these main cases:

  • Don't use NFS
  • Don't use lvm-thin or the device-mapper thin-pool
  • If you have thin provisioned SAN storage, fully provision your PostgreSQL volumes or monitor space extremely carefully
  • Ensure that features='1 queue_if_no_path' is set in multipath storage configurations; see /etc/multipath.conf.
  • Monitor dmesg or the kernel logs for I/O errors
  • Monitor the PostgreSQL logs for fsync() errors

We also suggest to initialise your PostgreSQL instances' data directories with page checksums, using the --data-checksums option for initdb. Although this feature, introduced in PostgreSQL 9.3, won't prevent data corruption from happening, it should increase the probability of detecting an I/O error on PostgreSQL data files.

Thin-provisioning

A SAN-backed thin provisioned volume was where this issue was originally discovered.

Thin-provisioned storage is at particular risk because it evades the normal guarantee made by Linux file systems that when write() returns, the storage space is actually reserved. Out-of-space errors can occur lazily, after write() returns, and trigger the same problems that are otherwise only hit by storage failure related I/O errors. The exact behaviour will depend on the thin provisioned storage implementation and how that storage is connected to the host system.

If you are using thin-provisioned storage, expand your PostgreSQL volumes to their maximum possible size so that they are fully provisioned, or closely monitor the free space in the thin provisioning pool.

NFS

PostgreSQL on NFS volumes faces the biggest risk because the error can happen when the backing volume runs out of space. Most file systems reserve disk space immediately when PostgreSQL performs a write, so they cannot report the kind of delayed error that causes this problem. NFS does not perform such an immediate reservation.

Additionally, NFS flushes storage on close(), unlike most file systems, and PostgreSQL does not treat I/O errors on close() as indications to enter crash recovery.

Problems are especially likely when a volume is mounted with the soft and/or intr options. (async is also unsafe, but for reasons unrelated to this bug).

Discontinue use of PostgreSQL on NFS if possible. If not possible, make sure nfs is mounted with sync,hard,nointr and monitor disk space closely. Monitor PostgreSQL logs for errors relating to fsync or close. Plan a migration off NFS.

Multipath I/O

Users of multipath I/O should ensure that their multipath.conf specifies features "1 queue_if_no_path", and does not specify no_path_retry or sets it to no_path_retry=queue. These settings will cause multipath to retry indefinitely rather than reporting an error back to the kernel and application.

The effect will be an indefinite PostgreSQL hang on I/O errors or until a transient error is resolved, but it will guard against potential corruption.

Further reading

Was this article helpful?

0 out of 0 found this helpful