Persistent Memory for WAL

This page describes the persistent memory-backed WAL designs and implementations proposed in pgsql-hackers.

Overview of persistent memory (PMEM)

(All subsections TBD)

Programming model

PMEM modules

Filesystem DAX

Persistent Memory Development Kit (PMDK)

Basic performance

Designs of PMEM-backed WAL

Buffers on DRAM; Segments on disk (Current non-PMEM design)

Allocates WAL buffers on DRAM
Allocates WAL segment files on disk
Opens the segment files
Inserts (memory-copies) WAL records into the buffers
Writes the records out of the buffers to the segment files
At end of checkpoint, recycles old segment files
When recovery, reads the records from the segment files

Buffers on DRAM; Segments on PMEM but not mapped

Same as "Buffers on DRAM; Segments on disk" design except using PMEM instead of disk.

Buffers on DRAM; Segments on PMEM and mapped

Allocates WAL buffers on DRAM
Allocates WAL segment files on PMEM
Memory-maps the segment files on PMEM
Inserts (memory-copies) WAL records into the buffers
Memory-copies the records from the buffers to the mapped segment files.
At end of checkpoint, recycles old segment files
When recovery, reads the records from the segment files

Buffers on PMEM; Segments on disk; Asynchronous write to disk

Allocates WAL buffers on PMEM
Allocates WAL segment files on disk
Opens the segment files
Inserts (memory-copies) WAL records into the buffers
Flushes the records out of CPU caches to the PMEM then asynchronously writes the records out of the buffers to the segment files
At end of checkpoint, recycles old segment files
When recovery, reads the records from the segment files and the buffers

Segments on PMEM mapped for buffers

DO NOT allocate WAL buffer pages on DRAM; Note that the xlblocks array is still allocated on DRAM
Allocates WAL segment files on PMEM
Memory-maps the segment files on PMEM for WAL buffers
Inserts (memory-copies) WAL records into the PMEM-backed buffers
Flushes the records out of CPU caches to the segment files
At end of checkpoint, recycles old segment files
When recovery, reads the records from the segment files

No segment; Single file on PMEM mapped for buffers

DO NOT allocate WAL buffer pages on DRAM; Note that the xlblocks array is still allocated on DRAM
Allocates (typically large) single file on PMEM
Memory-maps the single file on PMEM for WAL buffers
Inserts (memory-copies) WAL records into the PMEM-backed buffers
Flushes the records out of CPU caches to the single file; No write out to WAL segment files
At end of checkpoint, recycles old segment units on the buffers
When recovery, loads the records from the mapped single files

Proposed designs and patchsets for PMEM-backed WAL

Use WAL segments as WAL buffers

Discussion: Here
- An old discussion is here (search for "Use-WAL-segments-as-WAL-buffers")
Design: Segments on PMEM mapped for buffers
Implementation:
- At end of recovery, the startup process allocates WAL segment files and initializes segment/page headers on them. It will take a certain time. How far the segment files are allocated and initialized is depend on min_wal_size.
- The walwriter process periodically allocates the segment files and initializes the headers on them to advance WAL buffers.
- Each backend/checkpointer process maps the segment files for the buffers when it is inserting records to the buffers. If the segment file that the process needs have not allocated or initialized yet, the process allocates and initializes it.
- On commit or at end of checkpoint, each backend/checkpointer process flushes the records that are not done so yet out of CPU caches to PMEM.
  - The existing struct XLogCtl->LogwrtResult remembers how far the records are flushed out of CPU caches (.Write) and hit to PMEM (.Flush).
  - The records that a certain backend/checkpointer process is flushing may have been inserted by other backend/checkpointer processes.

(Abandoned) Non-volatile WAL buffer

Discussion: Here (Search for "Non-volatile-WAL-buffer")
Design: No segment; Single file on PMEM mapped for buffers
Implementation:
- The postmaster process maps (typically large) single file (nvwal_path in postgresql.conf) on PMEM for WAL buffers. Each child process uses it.
  - Total size of the buffers (nvwal_size in postgresql.conf) is hard limit. It neither extend nor shrink.
- The startup process initializes the WAL buffers at end of recovery. It will take a certain time.
- On commit or at end of checkpoint, a backend/checkpointer process flushes the records that are not flushed yet out of CPU caches to PMEM.
  - A new field XLogCtl->flushedUpTo is used to remember the LSN.
  - The records a backend/checkpointer process is flushing may have been inserted by that process or other backend/checkpointer processes.
  - The WALWriteLock is not used bacause inserted records are already "on PMEM" and does not need to move to anywhere else. All we need is CPU cache flush, and it does not require such a lock.
- WAL segment files are exceptionally used in the following cases:
  - When all the WAL buffers are filled, a backend/checkpointer process inserting records writes the oldest segment unit on the buffers to the segment file then clear the unit for new records. (Please recall that total size of the buffers is hard limit.) All other backend/checkpointer processes block until the write and the clear finish.
  - In WAL archive mode, the walwriter process writes the fixed segment units on the buffers to the segment files.

(Abandoned) Applying PMDK to WAL operations for persistent memory

Discussion: Here
Design: Buffers on DRAM; Segments on PMEM and mapped
Implementation:
- Each backend/checkpointer process maps WAL segment files when it is writing (memory-copying) buffer pages to the segment files.

Non-volatile Memory Logging

Slides: Here (PGCon 2016)
Design: Buffers on PMEM; Segments on disk; Asynchronous write to disk

Performance tips

DO NOT use the specific version of kernel having known issues

Recommended: Use Linux kernel 5.4.0 or later.

DO NOT use Linux kernel 4.20.0, 5.0.0, 5.1.0, and 5.2.0: These version has the "RocksDB can hang indefinitely when using a DAX file" issue and PostgreSQL may also encounter it. Bad commit: dax: Convert page fault handlers to XArray; fixing commit: dax: Fix missed wakeup with PMD faults.
DO NOT use Linux kernel 5.3.0: This version has the performance regression issue. Bad commit: dax: Fix missed wakeup with PMD faults; fixing commit: fs/dax: Fix pmd vs pte conflict detection.

If you use an old stable kernel < 5.4.0 or a custom-patched kernel < 5.4.0, please verify whether your kernel has fixing commits for known bad commits.

Configure and verify DAX hugepage faults

Filesystem DAX supports 2 MiB hugepage faults that will reduce the number of page faults and TLB (Translation Lookaside Buffer) misses, resulting in increased performance of PMEM. If you cannot get performance as expected, no hugepage fault but regular 4 KiB faults might occur.

It is recommended to create multiple namespaces and use raw PMEM devices pmem0, pmem1, and so on. In contrast, it is not recommended to create any partitions on top of PMEM such as pmem0p1. If you have to create such partitions, you must ensure that those partitions are 2MiB-aligned.

See https://nvdimm.wiki.kernel.org/2mib_fs_dax for how to configure and verify DAX hugepage faults.

Run postgres server processes on the same NUMA node as PMEM

Intra-NUMA access is faster than inter-NUMA access. PMEM itself is very fast, so it is important for achieving stable and potential performance to manage and control NUMA nodes.

For example, if the PMEM that postgres server processes will use is on NUMA node 0, pin the server processes to the node 0 as follows:

$ numactl -N 0 -m 0 -- pg_ctl start

To show your machine's NUMA nodes, run numactl --hardware. To show which node your PMEM is on, run ndctl list -v.

Verify no segment file is used (only for "Non-volatile WAL buffer")

If all the WAL buffers are filled and segment files exceptionally used, performance should degrade so much.

To detect it, watch pg_wal directory whether any segment file is created or not, or watch DBA log files for WARNING old segment written to file: up to %X/%X logs.

To avoid it, configure checkpoint to run and clear the buffers before they get full, or make larger buffers which will not get full in a short time.

Other DBMSes using PMEM for WAL

Microsoft SQL Server

Microsoft SQL Server 2016 SP1 or later has a PMEM feature called "Persistent Log Buffer" or "Tail-of-Log Caching." Its design looks similar to "Buffers on PMEM; Segments on disk; Asynchronous write to disk."

Oracle Database

Oracle Exadata X8M family has a feature called "Exadata Smart PMEM Log." Storage Server has "PMEM Log Buffer." Database Server writes log to the Buffer via RDMA and Storage Server performs write to disk in the background. This design looks similar to "Buffers on PMEM; Segments on disk; Asynchronous write to disk."

Exadata with Persistent Memory:An Epic Journey (PDF; See pp.23-27; Linked from here)
Exadata uses Persistent Memory for Fast Transactions
Datasheet: Oracle Exadata Database Machine X8M-2 (PDF; Linked from here)

MariaDB

TBD