PGCon 2021 Fun With WAL
Notes from PGCon 2021 Unconference session "Fun with WAL"
Fun with WAL
- There are some cool things we could with the WAL
- Faster archive recovery
- Lazy or partial restore, only replay WAL when it's needed
- Push API to replace archive_command
Zenith architecture
WAL PostgreSQL ----------------> Page Server GetPage@LSN ----------------> <----------------
- PostgreSQL streams the WAL to the page server using Streaming Replication.
- Page Server applies the WAL
- No (relation) data is stored in the PostgreSQL data directory
- smgr / md layer has been replaced with calls into the Page Server
Faster archive recovery
Perform WAL redo in two phases:
1. Scan the wal, make note of which record applies to which page.
2. Whenever you see a full-page image for a block, all the previous records for same block can be immediately thrown away. (This can be made much more effective by writing an extra full-page image e.g. every 1000 updates on the same page.)
3. After the first phase, wal redo can be performed separately for each relation or block. that's a better, more sequential I/O pattern.
Parallel WAL redo:
After the WAL has been split per relation, each relation can be restored in parallel
Instant Recovery:
You can actually start up the cluster, before WAL redo has finished. Whenever a page is accessed, replay all the WAL applicable to that record on demand.
http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/publications/2017/PhD_Thesis_Caetano_Sauer.pdf
Lazy Restore
Restoring cluster from base backup requires copying all the data from the backup, and replaying all the WAL.
- If we split the WAL in the backup, per relation, it would be possible to restore only what's needed.
- with Lazy Restore, non-relational stuff, like clog, would be restored as usual.
- When a relation is accessed for the first time, it's fetched from the backup, on demand, and the WAL applicable to that relation is replayed
Problems / annoyances
Some things are currently not included in WAL records: - cmin/cmax - speculative insertion tokens
These are not needed for crash recovery, but are needed by the primary server
With some WAL records, it's complicated to decipher which blocks are affected. For example: - The visibility map and FSM updates are implicit with heap WAL records
- XLOG_SMGR_TRUNCATE truncates the heap, the FSM and VM in one operation
- pg_rewind suffers from these too
Synchronous replication woes
If a WAL record cannot be streamed out, we still write it to local disk.
There was discussion on this last year in PGCon...
Push API for WAL
To replace archive_command
Corrupt WAL and security
- We could use with more sanity checks in WAL redo routines
- Can we make a guarantee that the WAL redo routines can tolerate any corrupt WAL without crashing? (Currently we can't.)