Parallel Recovery

From PostgreSQL wiki
Jump to navigationJump to search

MyPicture github.jpg

My Wiki home: Koichi


Other topics: Distributed deadlock detection

Parallel Recovery

PostgreSQL redo log (WAL) outline

In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in this documentation page. The outline of WAL is:

  • Every updating backend process must write its redo log to WAL before it really updates data files.
  • Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.

WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:

  • Each WAL record is read from WAL segments in the written order,
  • Read WAL record is applied to corresponding data block. This is called "redo" or "replay".
  • To maintain redo consistency, redo was done in single thread.

In the case of log-shipping replication, each WAL record is:

  • Transferred to each replica in the written order and
  • Replayed at replica in a single thread just as in recovery
  • When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.

Issues around serialized replay

At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:

  • Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.
  • Replication lag can be long when the master has heavy updating workloads.

Can replay be done in parallel ?

If we apply several rules about recovery order, yes. The rules are:

  • For a given data block, WAL records must be applied in the written order.
  • For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,
  • In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.
  • To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.
  • Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.
  • Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.
  • In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,
  • In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.
  • We can update applied WAL LSN when all the preceding WAL records has been replayed.

Implementation Note

Worker configuration

  • Because current recovery is done in startup process, parallel replay threads are implemented as its child process.
  • Now we have following processes for this purpose:
    • Startup process: read WAL record from WAL segment/walsender and perform overall control,
    • Dispatching worker: analyze and dispatch WAL records,
    • Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.
    • Transaction worker: replays WAL record which does not have any block information to update.
    • Block worker: replays block information (mostly HEAP/HEAP2).

We can have more than one block workers. Other workers consists of single process each.

Shared information and locks

To pass and share information such as WAL record and status, shared memory defined in storage/dsm.h is used.

To protect these data, spin lock defined in storage/spin.h is used.

Synchronization among workers

For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with stoage/latch.h.


Following GUC parameters are added:

  • parallel_replay (bool)
  • parallel_replay_test (bool, just internal) enable additional code to work to attach workers with gdb for test.
  • num_preplay_workers (int) number of total replay workers.
  • num_preplay_queue (int) number of queues holding outstanding WAL records,
  • num_preplay_max_txn (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.
  • preplay_buffers (int)

Connection to GDB

Different from usual backend, recovery code runs as startup process (and workers forked from startup process). Because this is to be terminated when postmaster begins to accept connection unless hot standby is enabled, we need some more to connect GDB to parallel replay workers. Here's how to do this:

  • Implement dummy function to work as gdb break point (PRDebug_sync()).
  • When startup process begins to fork workers, it writes messages to the debug file (located in pr_debug in PGDATA).
  • The mssage contains what to do in separate shell:
    • Start gdb
    • Attach startup process and worker process respectively,
    • Setup temprary break function,
    • Create signal file,
  • Then the worker (and startup process) waits for the signal file to be created and call the break function.

In this way, we can connect startup process and each worker process to gdb without pain.

Current code

Current code is available from Koichi's GITHub repository. Branch is parallel_replay_14_6.

For more information about the test, you can visit Another Koichi's GitHub repository. Use master branch. Please understand this page is not complete now and needs more updates for general use.

Current code status

No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.

Complete features are:

  • Folk additional worker process from the startup process,
  • Assign XLogRecord to workers,
  • Various synchronization mechanism to maintain semantics of global order of WAL replay,

Remaining issues are:

  • Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.
  • Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).
  • Good benchmark to show potential performance gain.

Any cooperation for discussion/test/development is very very welcome. You can contact to

Please understand that the repo can be incomplete due to occasional work.

Further addition is:

  • To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. xlog_outrec() looks very useful for this purpose. At present, this is enabled with WAL_DEBUG flag.

Discussion in PGCon 2023

Slide deck is available at File:Parallel Recovery in PostgreSQL.pdf

Here are some of the inputs from PGCon 2023 presentation.

  • Parallel recovery at standby may break if index page redo is done before corresponding data page redo is done.
    • In this case, index record will point to vacant page ndex (not in use). In the current implementation of indexes, this will cause error.
    • Idea is: use this when hot standby is disabled. In parallel, we can change the index code just to ignore such dangling reference.
      • [Mmeent] Dangling references from indexes are signs of corruption for most indexes. Disabling the integrity checks only on replicas could be possible, but would result in a reduced ability to detect integrity violations.
  • Can workers be child processes of postmaster?
    • Comment suggested it is possible. Startup's children looks simpler in inherting startup's resources.
  • Linux domain socket can be replaced with Latch.
    • Should consider this in the upcoming improvement.

I didn't have feedback on the use of POSIX mqueue. However, POSIX mqueue is not so flexible in configuration. Every change needs kernel configuration and it looks better to implement this with shared memory and socket/latch.

Post-PGCon ideas

[Mmeent] After thinking about it a little more, we might be able to use parallel recovery on hot standby nodes:

  1. Monitor relfinenode of WAL records for each transaction.
  2. If relfilenode changes, then worker wait for all other WAL record replay for the transaction. This maintains the order of update for data and index.

[Save memory allocation]

  1. Found that most of the WAL records are very short, less than 128 bytes.
  2. We can have a payload to each queue so that such short WAL record can be stored in the queue. Queue is allocated at the beginning of the replay and most of the WAL record does not need dynamic memory allocation in the shared memory.
  3. This can considerably reduce the number of dynamic memory allocation. For example, HOT UPDATE, VACUUM, COOMIT/ABORT, VISIBLE, INSERT/DELETE, ,PRUNE, etc.

[Consistentent recovery point]

  1. We can rely this on the reader worker. When concistent recovery point is reached, we just need to synchronize all the assigned ques to workers to be replayed.

[Enqueue and dequeue WAL]

  1. May need to start with dedicated implementation, based upon shared memory, spinlock and local socket.