Full page writes
PostgreSQL relies on page-level atomicity for correct operation, so it must do extra work to create that atomicity itself unless you declare that the storage provides it. After a crash in the middle of a write, we need to be able to read back either the old page or the new page, but not a mash-up of the two or arbitrary corrupted data. Such corruption is referred to as "torn writes" in the PostgreSQL community, and sometimes "fractured blocks" or partial writes in other database communities.
The reason it needs page-level atomicity is because it uses "physio-logical" logging, in the terminology of the book Transaction Processing: Concepts and Techniques by Reuters and Gray. Transaction log entries identify the physical page to be modified and then describe the change to the page in logical terms. For example, rather than saying "overwrite bytes 10-20 with the following data", they say "delete tuple 7", where to understand what tuple 7 means, you need to consult a table of tuples on the page. For that to be possible, the recovery mechanism has to be able to (1) read the page in the first place, including having its checksum pass, and (2) understand the existing contents of the page, such as the tuple table. Therefore, we need changes to pages to be atomic with respect to crashes: a partially overwritten page is corrupted and its contents can't be consulted.
The full_page_writes option, on by default and quite rarely turned off, defends against torn writes.
The idea is to switch to "physical" logging for the first modification of each page after a checkpoint, which is a fancy way of saying that we dump the whole 8KB page into the transaction log instead of describing how to change it. Future updates to that page can be done with the usual and much cheaper physiological logging approach, where records are usually one or two orders of magnitude smaller, right up until the next checkpoint.
Whenever data is being written out, whether in the background in the bgwriter or checkpointer, or in the foreground when dirty pages are evicted from the buffer pool, the standard WAL log-before-data policy is always applied: the WAL must be flushed up to the LSN that last touch the page, so that the WAL is on disk first. Therefore, if we lose power in the middle of a write of a data page, only one of those two copies could be torn. The WAL that has the full page must already be on disk before we begin writing the data page. In crash recovery, we'll just restore the page directly into the buffer pool from the log, and never consult the torn page. At the next checkpoint, it'll be overwritten.
If on the other hand we lose power while writing the WAL and the WAL record is torn, recovery will end at that point because the WAL checksum will fail, but we didn't start writing the data page, so again the data page is not corrupted. Atomicity is achieved.
Consequences for performance
- On a busy system, every checkpoint is followed by an increase in WAL traffic, and usually lower transaction throughput. This is one reason why long running benchmarks on PostgreSQL show a wave pattern in TPS, which can be flattened out by turning off full_page_writes.
- In crash recovery, full_page_writes throws away any progress due to data that was already written out before we crashed. We always have to replay the full set of changes to every page from the last checkpoint, because the first mention of each page includes the page image, and then we replay all changes on top of that. We lose the opportunity to load a page from disk, determine that this page already has all changes up to some future LSN, and skip replaying WAL records with earlier LSNs.
- The flip side of the previous point is that in streaming replication and crash recovery, we never have to read data pages from disk the first time they are touched after a checkpoint: we load them into the buffer pool directly out of the WAL. This avoids I/O stalls, but only for as long as the page stays in the buffer pool. This means that the recovery_prefetch feature does nothing useful until the page falls out of the buffer pool and has to be loaded back in again, if full_page_writes was on. In other words, recovery_prefetch's cache-miss avoidance potential varies over the checkpoint cycle, and it generally only helps with very large working sets that don't fit in cache, but receive multiple updates per checkpoint cycle. This makes full_page_writes seem like a good feature for recovery performance, but it comes at a very high cost to the on-line database.
When can you turn it off?
ZFS has record-level atomicity, so with record_size set >= 8KB it is expected to be safe to turn full_page_writes off. Then you should expect much smoother and higher TPS with smoother and smaller WAL write volume. On the flip side, replication and crash recovery might suffer more I/O stalls, because of the lack of full page images in the WAL to avoid data file reads. This situation should be improved by the PostgreSQL 15 recovery_prefetch feature, though note that it only works on ZFS versions that support posix_fadvise, which is anticipated to appear in OpenZFS 2.2 (not yet released). That closes a loop: work on recovery_prefetch began after studying reports from large scale PostgreSQL-on-ZFS users who were (probably) disabling full page writes for online performance and then suffering from poor redo performance, and wrote an external tool to prefault referenced data to fix that.
There may be many other systems than can provide 8KB atomicity as a result of their copy-on-write and/or data-journaling designs.
Some overwrite systems can probably provide 4KB filesystem block-level atomicity. It may technically be possible to switch PostgreSQL 4KB pages, modern sector size. However, it's very hard to get concrete published promises about that: no relevant standards for user space programming interfaces cover this, so you have to construct the belief by investigating your whole storage stack. There have been proposals for various mechanisms to expose information, or construct atomicity, for example O_ATOMIC, that would make promises to user space applications. A topic for future research.
What other kinds of atomicity requirements are lurking in PostgreSQL?
We rely on 512 byte blocks (historical sector size of spinning disks) to be power-loss atomic, when we overwrite the "control file" at checkpoints. If there is a system that can lose power having updated only some of those bytes, the cluster will not be able to recovery. There are nvram/pmem devices that can be mounted with or without artificial sector-level atomicity; without that, PostgreSQL is not crash-safe, a topic for future research.
There was a similar requirement for a "relation map" file, but it was removed in 16.
Contrast with other RDBMSs
As far as I know (corrections welcome), commercial RDBMSs do not attempt to protect against torn writes on single node databases in software, because:
- they were developed for high-end storage that would make guarantees about power-loss atomicity (battery backed controller cards, uninterruptible power supply, etc),
- they might have page recovery strategies involving asking a peer node
In contrast, open source RBDMSs like PostgreSQL and MySQL historically ran on a wide range of hardware, including unreliable consumer equipment, and often without battery protection. Therefore, torn writes were a real phenomenon that could create an unrecoverable database and had to be solved.
Briefly: The MySQL/InnoDB solution has in common that writes data pages out twice with a synchronisation barrier in between. That's relying on the same observation that both copies can't be torn, but differs significantly in the choice of when and where to do it. Instead of dumping page images into the WAL sometimes, it writes a copy of all dirty pages out to a double-write file before writing to the main data files. This moves the problem to the background, getting it out of the path of transactions, but also has to do it for all writes, not just after checkpoints.