On some storage configurations Linux can drop writes, failing to perform a write requested by the application. Then continue to use the storage. You will see lost page write
or lost async page write
errors if this occurs. PostgreSQL may report ERROR: could not fsync file "...": Input/output error
This can result in severe PostgreSQL data corruption. Do not ignore these errors. They are very unlikely on most configurations, but important.
On some storage configurations Linux can drop writes, failing to perform a write requested by the application. Then continue to use the storage. You will see "lost page write" or "lost async page write" errors in the kernel logs and possibly "fsync()" errors in the PostgreSQL logs if this occurs.
This can result in PostgreSQL data corruption. Do not ignore that error.
Most storage configurations on Linux are affected. NFS, multipath, and SANs that are backed by thin provisioned storage are at high risk. For most other storage configurations problems won't occur unless there's a physical failure in the underlying storage system, like bad blocks on a disk.
The cause is a disagreement about the error-handling semantics of fsync()
, which seems to be unspecified in POSIX and undocumented in the Linux kernel manpages. Some UNIX-like operating systems also have Linux-like behaviour.
- Linux clears the error flag from dirty-writeback pages when it reports a write error, so the page will not be written again. A subsequent
fsync()
on the same file descriptor will succeed; but - PostgreSQL assumes that if it retries
fsync()
after it returns an error, the kernel will retry writes that failed before and will only return true if all writes since the last successfulfsync()
have hit disk
PostgreSQL thinks that fsync()
's contract is "everything written since your last successful fsync is on disk"; Linux thinks that fsync()
's contract is "everything written since your last successful or unsuccessful fsync is on disk". Linux expects you to retry any writes after an unsuccessful fsync, but it doesn't provide any way to find out which writes they were.
As a result, PostgreSQL can treat a checkpoint as having succeeded and write an updated control file, when in fact the checkpoint was not safely completed. It will never retry those writes, causing arbitrary corruption. On some file systems re-reads of the lost blocks after cache eviction will cause the old data to be read; in others, it may read the new-but-not-really-written data instead.
Additionally, it has been found that PostgreSQL may not see fsync()
errors at all. On kernels prior to 4.13 the first process to see the fsync() error clears it. On kernel 4.13 and newer, it's guaranteed that each process with the file open will see the error if it fsync()
s the file. But unfortunately, PostgreSQL write()
s to a file then close()
s it, and a buffered write fails before the checkpointer re-open()
s the file in the checkpointer to fsync()
it, PostgreSQL never gets the news. We also fail to notice if the write happens to fail while we have the file open in a user backend or the bgwriter and not in the checkpointer, because we don't fsync()
the file before we close it except at checkpoint time.
Discussion on these issues is ongoing on pgsql-hackers. No fixes were available at time of writing.
PostgreSQL may fail to start, crash, ERROR
on queries, complain about "unexpected zero page"s, clog errors, etc. Pretty much any corruption symptom is possible.
dmesg
may show logs like:
kernel: Buffer I/O error on device dm-0, logical block 3055047803
kernel: lost page write due to I/O error on dm-0
and PostgreSQL may have logs like:
ERROR: could not fsync file "base/24305/1709567143.1": Input/output error
If you see this error in the logs and see errors from PostgreSQL relating to fsync of persistent parts of the data dir, restore the database from backups if possible.
Contact support if restoration from backup is not possible or practical. Complete repair may be possible if you have a complete and continuous WAL archive from the current database state back to before the first error, and a matching base backup from which a control file can be extracted. We create a custom backup label that forces the database to repeat work from prior to the last checkpoint, hopefully re-writing the failed blocks. Otherwise you'll want assistance with data dumping and restoration into a newly initdb'd PostgreSQL instance.
No comprehensive list of configurations in which lost page writes can occur is known, and evidence suggests it's possible on all or nearly all configurations, but mainly a risk for a few:
- multipath I/O using default settings, if connections are interrupted
- NFS, if it runs out of disk space
- storage backed by lvm-thin or device mapper thin-pool, if it runs out of space
- storage backed by a thin-provisioning SAN, if it runs out of space
To safeguard against these main cases:
- Don't use NFS
- Don't use lvm-thin or the device-mapper thin-pool
- If you have thin provisioned SAN storage, fully provision your PostgreSQL volumes or monitor space extremely carefully
- Ensure that
features='1 queue_if_no_path'
is set in multipath storage configurations; see/etc/multipath.conf
. - Monitor
dmesg
or the kernel logs for I/O errors - Monitor the PostgreSQL logs for
fsync()
errors
We also suggest to initialise your PostgreSQL instances' data directories with page checksums, using the --data-checksums
option for initdb
. Although this feature, introduced in PostgreSQL 9.3, won't prevent data corruption from happening, it should increase the probability of detecting an I/O error on PostgreSQL data files.
A SAN-backed thin provisioned volume was where this issue was originally discovered.
Thin-provisioned storage is at particular risk because it evades the normal guarantee made by Linux file systems that when write()
returns, the storage space is actually reserved. Out-of-space errors can occur lazily, after write()
returns, and trigger the same problems that are otherwise only hit by storage failure related I/O errors. The exact behaviour will depend on the thin provisioned storage implementation and how that storage is connected to the host system.
If you are using thin-provisioned storage, expand your PostgreSQL volumes to their maximum possible size so that they are fully provisioned, or closely monitor the free space in the thin provisioning pool.
PostgreSQL on NFS volumes faces the biggest risk because the error can happen when the backing volume runs out of space. Most file systems reserve disk space immediately when PostgreSQL performs a write, so they cannot report the kind of delayed error that causes this problem. NFS does not perform such an immediate reservation.
Additionally, NFS flushes storage on close()
, unlike most file systems, and PostgreSQL does not treat I/O errors on close()
as indications to enter crash recovery.
Problems are especially likely when a volume is mounted with the soft
and/or intr
options. (async
is also unsafe, but for reasons unrelated to this bug).
Discontinue use of PostgreSQL on NFS if possible. If not possible, make sure nfs is mounted with sync,hard,nointr
and monitor disk space closely. Monitor PostgreSQL logs for errors relating to fsync or close. Plan a migration off NFS.
Users of multipath I/O should ensure that their multipath.conf
specifies features "1 queue_if_no_path"
, and does not specify no_path_retry
or sets it to no_path_retry=queue
. These settings will cause multipath to retry indefinitely rather than reporting an error back to the kernel and application.
The effect will be an indefinite PostgreSQL hang on I/O errors or until a transient error is resolved, but it will guard against potential corruption.
- lwn.net article "PostgreSQL's fsync() surprise"
- lwn.net article "Fixing error reporting - again"
- "Writing programs to cope with I/O errors causing lost writes on Linux" on Stack Overflow points out the kernel level paths/behaviour
- Kernel bugzilla "fsync man page should mention that EIO and ENOSPC are cleared"
- The PostgreSQL hackers discussion thread and follow-up