Corruption

From PostgreSQL wiki

Jump to: navigation, search

Contents

VITALLY IMPORTANT FIRST RESPONSE

Do you think you have one of those rare but nasty index or table corruption problems?

Before doing anything else:

Stop the postmaster and take a file system level copy of your database NOW!. Don't use pg_dump, instead make a copy of all the files in the data directory (the one that contains the "base", "pg_xlog", "pg_clog" etc folders).

If possible, put a copy of it on an external hard drive, DVD, or other storage you can disconnect from your computer so you don't accidentally modify your snapshot after making it.

People often destroy potentially recoverable data by trying to repair it. Make a file-system-level copy of your database before attempting any repair. It might save your data.

The experimental hex editor toolkit pg_hexedit can be used for low-level analysis of PostgreSQL relation files. It should only be run against a copy of the database.

Your backup copy is also important evidence that may help figure out what caused the problem, allowing you to prevent it from happening again. Repair efforts usually destroy that evidence. Having the damaged data may be the only way to figure out why it was damaged and how to prevent the same thing happening again.

Detecting corruption

PostgreSQL 10 has a contrib extension called amcheck. Although it only verifies the integrity of B-Tree indexes, it is reasonably well suited for use as a general purpose corruption smoke test. Note that there is also a version on Github that targets earlier PostgreSQL versions.

PostgreSQL 9.6 introduced a more specialized tool for verifying the integrity of the visibility map, called pg_visibility. This may be of interest in cases where visibility map corruption is suspected specifically.

Causes

Remember that most corruption is caused by hardware issues:

  • RAID controllers with faulty / worn out battery backup, and an unexpected power loss
  • Hard disk drives with write-back cache enabled, and an unexpected power loss
  • Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss
  • Defective RAM
  • Defective or overheating CPU(s)

Other causes can be:

  • Database systems configured with fsync=off and an OS crash or power loss
  • Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit.
  • PostgreSQL bugs
  • Operating system bugs
  • Admin error
    • directly modifying Postgres data-directory contents
    • faulty fail-over procedures like rsync'ing without pg_start_backup
    • and lots more

What next?

Consider contacting a professional support service provider. Corruption issues can be time consuming and difficult to diagnose and fix, and are often hardware- or configuration-specific.

Read the Guide_to_reporting_problems and post to the pgsql-general mailing list with information about the problem. Remember that your hardware settings and postgresql configuration are vital information, as are any error messages, your Pg version, etc.

Christophe Pettus's FOSDEM presentation on database corruption is also well worth a read, but you don't get the spoken warnings that come with it. In particular, the recovery techniques described there are for recovering what you can from a DB, with the understanding that you'll have missing chunks of data, inconsistent FK relationships, reappearing deleted rows / visible uncommitted rows, and all sorts of problems depending on the repair technique used. It assumes you've already done the sensible initial investigation and fixed any less serious problems that might be causing your issue.

Personal tools