Minimizing database corruption best practices

Luciano Botti
Luciano Botti
  • Updated

Contributors

Co-authors Luciano Botti and Matthew Gwillam-Kelly.

Technical additions by Jakub Wartak and Arthur Nascimento.

Abstract

This article is part of series:

This article is intended to be used as a general guide to give recommendations and best practices to prevent database corruption.

Introduction

What is database corruption?

Database corruption is when certain parts of the database data are in the wrong format and unreadable through normal usage, due to unintended changes of the data at byte level.

This results in application errors, and lost data (unless it is retrieved).

Postgres logs can be useful in showing relations erroring due to corruption, and queries executed around the time of corruption occurrence. However, they will not show the cause of the corruption.

Lookout for errors which seem to be complaining about internal database issues rather than user-facing things. The following differing error messages suggest some sort of corruption:

ERROR: could not access status of transaction 3881522
DETAIL: could not open file "pg_xact/0003": No such file or directory
ERROR: could not read block 123 in file "base/45678/90123": read only 0 of 8192 bytes
ERROR: failed to re-find parent key in index "xxx_idx" for split pages 123/456
ERROR: cache lookup failed for relation 123456
ERROR: found multixact 123 from before relminmxid 456
ERROR: uncommitted xmin 1234 from before xid cutoff 5678 needs to be frozen
ERROR: MultiXactId 12345 has not been created yet -- apparent wraparound
ERROR: could not fsync file "base/24305/1709567143.1": Input/output error
ERROR: missing chunk number 0 for toast value XXXXX in pg_toast_XXXXX

General corruption causes

There can be a multitude of corruption causes:

  • Hardware issues (disk, CPU, RAM), often relating to power loss.
  • Postgres files being changed by a non-postgres process (e.g antivirus).
  • OS bugs.
  • Upgrading OS incorrectly.
  • Upgrading postgres incorrectly.
  • Bad filesystem setup.
  • Bad postgres setup.
  • Backups being made incorrectly/altered.
  • Postgres bugs.

In our experience, the majority of database corruption occurs from hardware issues, and upgrading OS and postgres incorrectly.

If you can reproducibly create corruption through your application use, then please create a support ticket so that it can be investigated in depth by EDB.

Best practices to minimize data corruption

In this section you will find some recommendations and best practices to avoid or minimize corruption on a database system.

Backup and restore

1) Take backups regularly.

In case you face corruption later, you have a restore point from that backup.

If using tablespaces, make sure that the backups include the tablespace data.

Regularly restore the backups into a test environment to ensure that they are functional, and check that the data make sense - run application tests against it.

2) Use a tool for physical backup and restore.

A good tool should help you leverage the burden and maintenance of backups - Barman or pgbackrest.

Configure your backup tool to take continuous backups and WAL archiving. For more information, please see continuous-archiving.

3) Evaluate periodical logical backups

If you are concerned with your vulnerability to database corruption, you could also evaluate the possibility of creating regular logical backups alongside your physical backups, with pg_dump or pg_dumpall. This gives additional insurance against database corruption.

Corruption is passed on to a physical backup, but is not passed on to a logical backup.

However, this creates additional complexity, and would need to be tested against database performance, as the logical backups consumes resources, can impede other operations, and prevent maintenance processes. If desired, you could test whether this could be performed in low usage hours. If not, it could be created during downtimes.

Additionally, logical backups can be slow for large databases.

Postgres administration

1) Do not use fsync=off.

An Operating System crash while having fsync=off can produce severe corruption to the database.

2) Do not use full_page_writes=off.

Turning this parameter off might lead to either unrecoverable data corruption, or silent data corruption, after a system failure.

3) Run the database instance with checksums enabled.

To enable checksums, use initdb -k while initializing the cluster. postgres 12+ checksums can be enabled on an offline database with pg_checksums --enable.

It is recommended to enable checksums, and alert on corruption related errors in the postgres logs. This won’t prevent your database from becoming corrupted, but the corruption is more likely to be automatically detected, which makes diagnosis, prevention and remediation at time of issue much more effective than discovering at a later date.

For example, in the case of a table that has had the tuple header corrupted with incorrect data:

The select works fine, meaning application errors would not immediately display the corruption:

psql -d edb -c "select * from test_corruption"

i   
  1
  2
  3
  4
  5

However, postgres logs the corruption:

tail -f enterprisedb-2024-04-18_163505.log
2024-04-18 16:36:01 UTC LOG:  statement: select * from test_corruption;
2024-04-18 16:36:36 UTC WARNING:  page verification failed, calculated checksum 13021 but expected 10829
2024-04-18 16:36:36 UTC ERROR:  invalid page in block 0 of relation base/14319/73740

Additionally, issues can then be discovered by running pg_checksums --check on the offline node. Please see the subsequent article in this series detecting-and-fixing-corruption for more information.

4) Do not manually modify the postgres data files in any way.

In particular, do not remove files in the database directory manually – not from pg_wal, pg_xlog, pg_xact or pg_clog, base.

If you are out of disk space on the server and are our customer, create an EDB support ticket.

5) Do not remove postmaster.pid

This could permit multiple instances of the same PostgreSQL instance to run at the same time, which will corrupt data. Recent Postgres versions have a better protection against running multiple postgres services in the same PGDATA directory, is always good to keep this recommendation in mind.

6) Use an optimal wal_sync_method value for your Operating System.

As well as writing to WAL, wal_sync_method is also used to flush pages during a checkpoint, so it could lead to table corruption too, not just WAL corruption.

Linux - Check pg_test_fsync output – your chosen method should not be unrealistically fast.

On MacOS X, use fsync_writethrough.

On Windows, use fsync or fsync_writethrough, or disable write caching.

OS administration

1) Keep Postgres and the Operating System packages up to date

Have a reasonable upgrade schedule (update as often as possible).

Bear in mind that upgrading requires a Postgres restart.

2) Use local storage, not a network filesystem.

There are fewer ways for a directly connected disk to fail. It may be possible to set up NFS or iSCSI reliably, but many setups seem to be unreliable.

If you must use NFS, export with sync, mount with v4,hard,sync and test it carefully.

3) Choose a reliable filesystem.

ext4 and xfs seem to be more reliable than other Linux filesystems.

4) Do not run anti-virus software in the database server

Anti-virus and OS security agents cause corruption in some cases, due to altering the OS and causing unintended data changes.

Conclusion

As this is an extensive and open topic, we tried to summarize in a few points some of the main common cases. Hope this helps you set your environment in a better position against this corruption issues.

Still, we invite you to continue reading other related blogs and Knowledge Base articles listed below.

Source

The articles in this series are based on the following articles, postgres documentation and lessons learnt from support tickets involving experts:

Related to

Was this article helpful?

0 out of 0 found this helpful