Updating the WAL infrastructure

From PostgreSQL wiki
Jump to navigationJump to search

This page lists several ideas about improving our usage of WAL, including updating the on-disk format that hasn't been modified since the great WAL format unification in late 2014 (released with 9.5), updating the rules for when to log what, and how to run recovery.

The areas mentioned in this page are relevant as of May 2023, in the beta window for PostgreSQL 16

Removing unused (or otherwise useless) bytes from the format

The primary WAL struct contains bytes that may often be unused or useless:

Alignment losses in XLogRecord

The XLogRecord struct currently contains 2 bytes of alignment losses. We already complain about adding 4 bytes to some records, so why don't we fix this issue, too?

XLogRecord's TransactionID is unused in many records

All core indexes don't use the TransactionId in the XLogRecord, as do many RMGRs. If it's not used, we shouldn't be logging those 4 bytes all the time. Once 64-bit XIDs land, those will become 8 bytes in each record - I'd rather prevent that from happening.

Average length of XLogRecord is <<< 2^16

The full size of any WAL record rarely exceeds 2^16 and often is < 255B; but the field storing the length of the record xl_tot_len is currently a uint32, often leaving 2 or 3 bytes of the field unused. If we were to use variable-length encoding (or something similar) for that field, we could save some bytes.

Block data may often be 0, or at least < 2^8

Blocks registered to the xlog record don't need to contain much data or contain data at all, yet we store an uint16 containing how much data the record contains regardless of any data presence. We can often save one byte, sometimes 2, by var-encoding the field with indicator bits in the upper part of e.g. the block id.

See Commitfest patch
See -hackers mail thread

Improving efficiency of compression

In this discussion on youtube, Andrey M. Borodin mentioned that the compression of multiple full-page images in the WAL could probably benefit from using a single compression stream because most cases of logged FPIs share (some) structure and data. Currently, compression is applied on a page-by-page basis, but compressing more pages at once might save some bytes on the whole.

Reducing the impact of dirtying pages whilst protecting against torn writes

Some WAL records, such as the VACUUM record used by the heap table access method and the "generic WAL record" from the Generic RMGR, don't modify the page layout but only bytes in predetermined places of the page. Let's call these modifications "precise".

FPIs are emitted primarily to combat torn writes, where only a part of the changes in the page are written to disk persistently due to an unexpectedly fast shutdown. However, if you have a "precise" modification of a page, you shouldn't need to log a full page image to WAL because if you _always_ replay the "precise" modification, any torn writes (i.e., bytes of the old page that should've been updated) are overwritten again with the new bytes, and regardless of the original modifications the end state of the page is the same.

That discovery could help significantly reduce the WAL pressure of vacuuming and freezing; these workloads generally produce a lot of WAL overhead while also being capable of producing precise directions of what to modify on a page.

Note that for a modification to be "precise", system rules must be defined so that it doesn't accidentally overwrite new data in a way that makes the data unrecoverable, etc. Precise rules are to be discovered and written about what may and may not be modified on a page using this scheme, but as a start, it is a nice possible venue for reducing PostgreSQL's WAL volume.

Parallel replay of WAL

Main page: Parallel Recovery

Not all WAL can be replayed in parallel, but we might be able to replay some WAL in parallel. Replaying WAL in parallel could speed up recovery and reduce read replica lag spikes: Primary instances can generate WAL with many parallel processes, but replicas currently are limited to only a single process working on recovery. Usually, one thread is enough, but there are situations where the replica can't catch up with the primary, resulting in permanent performance degradation.