Some kinds of corruptions can be repaired by forcing PostgreSQL to restart redo at an earlier checkpoint than the restart point in the control file. Here's how to do it.
WARNING: Please check this procedure on a test instance before following it live.
Take a full snapshot of the system if at all possible, before doing anything more. Doing this wrong or doing it when it's not the appropriate fix can make a bad problem worse.
Used correctly, you can force PostgreSQL to retry WAL that was misapplied the first time around due to disk errors, creative operator intervention, postgres or kernel bugs, etc, and repair a broken cluster.
You must have a continuous archive of xlogs from before whatever the corruption was until the current state to attempt this fix. It will only help if the xlogs are correct and intact, and it's just the heap that's corrupt.
- Examine the xlog filesystem timestamps and/or use
pg_xlogdump
/pg_waldump
to find roughly the LSN range at which forced redo should start. - Use
pg_xlogdump
/pg_waldump
to find the exact LSN of a checkpoint at which redo should start. Note down the checkpoint LSN and the file it is in. - Set
hot_standby = on
inpostgresql.conf
- Start the server with a
recovery.conf
. This updates the control data, setting the cluster status asarchive recovery
(the status of a running standby). Make sure that the recovery.conf has: standby_mode = on
pause_at_recovery_target = on
- A
restore_command
if not all WAL is inpg_xlog
- Stop postgres with
pg_ctl stop -m immediate
to avoid performing a clean shutdown that would change the status. If it was already shut down, start it then stop it again with this command. Checkpg_controldata
to make sure it shows the status asarchive_recovery
. - Copy all WAL from that file through to the current state into
pg_xlog
, or copy the first oldest few segments including the start location segment and set arestore_command
inrecovery.conf
. - Create a
backup_label
file that lists the lsn and xlog segment file of the checkpoint, e.g.
START WAL LOCATION: 44C4/ED7A7EF8 (file 00000001000044C4000000ED)
CHECKPOINT LOCATION: 44C4/ED7A7EF8
BACKUP METHOD: streamed
BACKUP FROM: standby
START TIME: Fri 24 Feb 2017 05:01:07 PM UTC
LABEL: force recovery
- Start the server
- Wait for it to replay all WAL and get back to where it was. It will remain in standby mode
- Shut the server down and remove
recovery.conf
- Start the server; it should now operate normally
This procedure should not increment the server's timeline.