My Wiki home: Koichi
Other topics: Distributed deadlock detection
PostgreSQL redo log (WAL) outline
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in this documentation page. The outline of WAL is:
- Every updating backend process must write its redo log to WAL before it really updates data files.
- Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:
- Each WAL record is read from WAL segments in the written order,
- Read WAL record is applied to corresponding data block. This is called "redo" or "replay".
- To maintain redo consistency, redo was done in single thread.
In the case of log-shipping replication, each WAL record is:
- Transferred to each replica in the written order and
- Replayed at replica in a single thread just as in recovery
- When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.
Issues around serialized replay
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:
- Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.
- Replication lag can be long when the master has heavy updating workloads.
Can replay be done in parallel ?
If we apply several rules about recovery order, yes. The rules are:
- For a given data block, WAL records must be applied in the written order.
- For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,
- In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.
- To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.
- Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.
- Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.
- In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,
- In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.
- We can update applied WAL LSN when all the preceding WAL records has been replayed.
- Because current recovery is done in startup process, parallel replay threads are implemented as its child process.
- Now we have following processes for this purpose:
- Startup process: read WAL record from WAL segment/walsender and perform overall control,
- Dispatching worker: analyze and dispatch WAL records,
- Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.
- Transaction worker: replays WAL record which does not have any block information to update.
- Block worker: replays block information (mostly HEAP/HEAP2).
We can have more than one block workers. Other workers consists of single process each.
To pass and share information such as WAL record and status, shared memory defined in
storage/dsm.h is used.
To protect these data, spin lock defined in
storage/spin.h is used.
Synchronization among workers
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with
Following GUC parameters are added:
parallel_replay_test(bool, just internal) enable additional code to work to attach workers with gdb for test.
num_preplay_workers(int) number of total replay workers.
num_preplay_queue(int) number of queues holding outstanding WAL records,
num_preplay_max_txn(int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.
Connection to GDB
Different from usual backend, recovery code runs as startup process (and workers forked from startup process). Because this is to be terminated when postmaster begins to accept connection unless hot standby is enabled, we need some more to connect GDB to parallel replay workers. Here's how to do this:
- Implement dummy function to work as gdb break point (
- When startup process begins to fork workers, it writes messages to the debug file (located in
- The mssage contains what to do in separate shell:
- Start gdb
- Attach startup process and worker process respectively,
- Setup temprary break function,
- Create signal file,
- Then the worker (and startup process) waits for the signal file to be created and call the break function.
In this way, we can connect startup process and each worker process to gdb without pain.
Current code is available from Koichi's GITHub repository. Branch is
For more information about the test, you can visit Another Koichi's GitHub repository. Use master branch. Please understand this page is not complete now and needs more updates for general use.
Current code status
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.
Complete features are:
- Folk additional worker process from the startup process,
- Various synchronization mechanism to maintain semantics of global order of WAL replay,
Remaining issues are:
- Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.
- Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).
- Good benchmark to show potential performance gain.
Any cooperation for discussion/test/development is very very welcome. You can contact to koichi.dbms_at_gmail.com.
Please understand that the repo can be incomplete due to occasional work.
Further addition is:
- To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test.
xlog_outrec()looks very useful for this purpose. At present, this is enabled with
Discussion in PGCon 2023
Slide deck is available at File:Parallel Recovery in PostgreSQL.pdf
Here are some of the inputs from PGCon 2023 presentation.
- Parallel recovery at standby may break if index page redo is done before corresponding data page redo is done.
- In this case, index record will point to vacant page ndex (not in use). In the current implementation of indexes, this will cause error.
- Idea is: use this when hot standby is disabled. In parallel, we can change the index code just to ignore such dangling reference.
- [Mmeent] Dangling references from indexes are signs of corruption for most indexes. Disabling the integrity checks only on replicas could be possible, but would result in a reduced ability to detect integrity violations.
- Can workers be child processes of postmaster?
- Comment suggested it is possible. Startup's children looks simpler in inherting startup's resources.
- Linux domain socket can be replaced with Latch.
- Should consider this in the upcoming improvement.
I didn't have feedback on the use of POSIX mqueue. However, POSIX mqueue is not so flexible in configuration. Every change needs kernel configuration and it looks better to implement this with shared memory and socket/latch.
[Mmeent] After thinking about it a little more, we might be able to use parallel recovery on hot standby nodes:
- Normal backend buffer accesses would need to conflict with redo process buffer accesses. We may not let redo processes modify pages and update those page's LSNs without making the backends that use that page take notice.
- We register the highest LSN this backend has seen on any page, in e.g. MyLsnHighWaterMark.
- Any time a non-redo process tries to access a page in a Hot Standby, we compare MyLsnHighWaterMark to the buffer's redo worker's current redo LSN.
- If the redo worker has not yet replayed that LSN, wait for the redo worker to finish replaying that LSN.
This way, any ReadBuffer calls we do are reading page modifications in a serialized order, even if the ordering is not strictly serialized in the redo processes. As an additional benefit, this doesn't require expensive relfilenode- and transaction dependency tracking, at the cost of more expensive page lookups.