Corruption Detection and Containment
Introduction
PostgreSQL relies heavily on the filesystem returning the same data that was previously written. Unfortunately, due to the complexity of filesystems (and the storage systems upon which they rely), returning different (corrupt) data is common enough to cause serious problems. For large systems with many disks, you can expect this to happen with some regularity.
The largest problem is that these errors are silent, meaning that the system gives no indication that the data is corrupt, and it's left up to PostgreSQL to continue processing the corrupt data. Furthermore, because it's the filesystem returning the corrupt data, then taking a backup or syncing a new replica is likely to copy the corrupt block, leaving the backup/replica corrupt as well.
Checksums are a way to make the error known early, before returning results to the user and before copying the data to a backup/replica. That allows a number of potential solutions, the simplest of which is to replace the faulty hardware and restore from a backup.
Requirements
- Determine when a data page has been corrupted
- Determine when a CLOG page has been corrupted
- Determine when a temporary table or file has been corrupted
- Enable/Disable corruption detection online
- Upgrade from a system without corruption detection to a system with corruption detection
- Detect corruption before processing the data
- Detect corruption when taking a base backup or syncing a new replica
- Detect corruption in background or when offline (e.g. test a backup)
High-Level Design
- initdb-time option to enable checksums
- Add 16-bit checksum to every page in place of pd_tli
- When setting a hint bit, a full page image is now required if it's the first modification since the last checkpoint (otherwise, the xlog action can return early).
- Add bits to the page header so that we can differentiate between a page with checksums and one without. This will help satisfy the upgrade requirement later, by offering a transition state where some pages have checksums and some don't.
- detect zeroed pages
- WAL log all relation extensions
- For performance, probably need to extend in bulk (TODO: explore alternatives)
- always initialize all pages
- WAL log all relation extensions
- Add GUC to detect checksum failures in temporary tables and files (e.g. for Sort).
- TODO: what to do about UNLOGGED tables?
- Option to VACUUM, or other command, that allows turning checksums on/off while online.
- TODO: details
- Detect corruption in CLOG
- TODO: other SLRU?
- Background/offline checker, and checker in pg_basebackup
- Challenge: how to deal with a race condition where Postgres is writing data and the other utility is reading it (kind of like an online torn page)?
- Should be very rare: retry?
- Could also have a function in postgres that can recheck an individual block, so if the retry fails, then the user could be absolutely sure it's corrupt by asking postgres.
- Challenge: how to deal with a race condition where Postgres is writing data and the other utility is reading it (kind of like an online torn page)?