Corruption

VITALLY IMPORTANT FIRST RESPONSE

Do you think you have one of those rare but nasty index or table corruption problems?

Before doing anything else:

Stop the postmaster and take a file system level copy of your database NOW!. Don't use pg_dump, instead make a copy of all the files in the data directory (the one that contains the "base", "pg_xlog", "pg_clog" etc folders).

If possible, put a copy of it on an external hard drive, DVD, or other storage you can disconnect from your computer so you don't accidentally modify your snapshot after making it.

People often destroy potentially recoverable data by trying to repair it. Make a file-system-level copy of your database before attempting any repair. It might save your data.

The experimental hex editor toolkit pg_hexedit can be used for low-level analysis of PostgreSQL relation files. It should only be run against a copy of the database.

Your backup copy is also important evidence that may help figure out what caused the problem, allowing you to prevent it from happening again. Repair efforts usually destroy that evidence. Having the damaged data may be the only way to figure out why it was damaged and how to prevent the same thing happening again.

Detecting corruption

As of PostgreSQL 10, there is am extension called amcheck. Although it only verifies the integrity of B-Tree indexes, it is reasonably well suited for use as a general purpose corruption smoke test, especially because the indexes can be verified against the heap with the "heapallindexed" verification option (later versions only). Note that there is also a version on Github that targets earlier PostgreSQL versions.

PostgreSQL 9.6 introduced a more specialized tool for verifying the integrity of the visibility map, called pg_visibility. This may be of interest in cases where visibility map corruption is suspected specifically.

PostgreSQL 12 and above have the ability to do pg_checksums. These will rewrite the database files in place. There is a tutorial for setting them up.

Causes

Remember that most corruption is caused by hardware issues:

RAID controllers with faulty / worn out battery backup, and an unexpected power loss
Hard disk drives with write-back cache enabled, and an unexpected power loss
Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss
Defective RAM
Defective or overheating CPU(s)

Other causes can be:

Transitioning to a newer Operating System via pg_basebackup or streaming replication where the source and target machines run different versions of glibc. Common boundaries are CentOS 7 --> CentOS 8 and Ubuntu 18.04 --> Ubuntu 20.04 but corruption can occur anywhere the underlying c library change impacts LOCALE collation.
Database systems configured with fsync=off and an OS crash or power loss
Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM was a particular culprit but full barrier support was added by Liunx 2.6.33-rc1.
PostgreSQL bugs
Operating system bugs
Admin error
- directly modifying Postgres data-directory contents
- faulty fail-over procedures like rsync'ing without pg_start_backup
- and lots more

What next?

Consider contacting a professional support service provider. Corruption issues can be time consuming and difficult to diagnose and fix, and are often hardware- or configuration-specific.

Read the Guide_to_reporting_problems and post to the pgsql-general mailing list with information about the problem. Remember that your hardware settings and postgresql configuration are vital information, as are any error messages, your Pg version, etc.

Christophe Pettus's FOSDEM presentation on database corruption is also well worth a read, but you don't get the spoken warnings that come with it. In particular, the recovery techniques described there are for recovering what you can from a DB, with the understanding that you'll have missing chunks of data, inconsistent FK relationships, reappearing deleted rows / visible uncommitted rows, and all sorts of problems depending on the repair technique used. It assumes you've already done the sensible initial investigation and fixed any less serious problems that might be causing your issue.

Corruption

Contents

VITALLY IMPORTANT FIRST RESPONSE

Detecting corruption

Causes

What next?

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Tools

Search