Left-Right Lock (LRLock)

Request for Comments [RFC] NOT MERGED

Summary

Left-right lock (LRLock) is a concurrency primitive for PostgreSQL providing wait-free reads at the cost of maintaining two copies of the protected data structure. Readers proceed without acquiring any blocking lock -- the read path consists of a single atomic epoch counter increment, a full memory fence, and a pointer load. Writers modify one copy, publish via an atomic pointer swap, wait for pre-swap readers to depart, then replay queued operations on the stale copy to bring it up to date. The algorithm is based on the left-right concurrency primitive described by Pedro Ramalhete and Andreia Correia, and as implemented in Jon Gjengset's Rust left-right crate. Three use cases are provided: ProcArray snapshot (wait-free GetSnapshotData), replication slot xmin computation, and buffer mapping hash table.

Status

This is an early-stage proposal. The patch has not yet been posted to pgsql-hackers or submitted to a CommitFest.

Patch status: Work in progress (RFC)
Target: PostgreSQL 19 (speculative)
Source branch: gburd/postgres, branch lrlck (4 commits atop master)
Tests: All 245 core regression subtests pass; dedicated test_lrlock module passes under concurrent load. Valgrind clean (67 processes, 0 errors).

Background and Motivation

LWLock Scalability (2005-present)

PostgreSQL's LWLock infrastructure has been the standard reader-writer lock since Tom Lane's 2005 redesign introduced the current shared/exclusive model. LWLock acquisition in shared mode involves an atomic compare-and-swap on the lock state word, which scales well at low core counts but becomes a bottleneck when many backends simultaneously contend for shared access to the same lock -- the cache line containing the lock state bounces between cores.

Robert Haas and others identified ProcArrayLock shared-mode contention as a scalability issue on large systems (64+ cores), where GetSnapshotData() is called per-statement and hundreds of backends simultaneously attempt to acquire ProcArrayLock in shared mode.

ProcArray performance issues (Robert Haas, 2020)
Snapshot scalability (Andres Freund, 2020)

Epoch-Based Reclamation and Read-Copy-Update

The Linux kernel's RCU (Read-Copy-Update) demonstrated that separating read and write paths -- allowing readers to proceed without synchronization while writers defer cleanup until all readers have departed -- can dramatically improve scalability for read-dominated data structures. However, RCU requires a kernel-level quiescent state detection mechanism that is not available to user-space processes communicating through shared memory.

The left-right pattern achieves a similar separation using per-reader epoch counters in shared memory as the quiescent state signal. When a reader's epoch changes, it has exited (or re-entered with the new pointer), so the writer knows the old copy is no longer referenced.

Left-Right Concurrency Primitive

Ramalhete and Correia formalized the left-right pattern in their 2015 paper, proving that the reader path is population-oblivious and wait-free: its progress does not depend on the number or behavior of other threads. The key insight is that maintaining two copies allows readers to always have a consistent view without coordinating with the writer, while the writer's cost (applying each mutation twice) is acceptable when reads vastly outnumber writes.

Jon Gjengset's Rust left-right crate provided a practical implementation demonstrating the pattern's applicability to database-style workloads (concurrent hash maps, caches). This PostgreSQL implementation adapts the same protocol to C and shared memory.

Ramalhete & Correia, "Left-Right: A Concurrency Control Technique with Wait-Free Population Oblivious Reads", 2015
Rust left-right crate documentation
Rust left-right source repository

Algorithm Overview

 Reader (wait-free):                    Writer (single, serialized):
 ┌────────────────────────┐             ┌─────────────────────────────────┐
 │ 1. Set bitmask bit     │             │ 1. Acquire writer mutex         │
 │ 2. epoch++ (even->odd)  │             │ 2. Apply ops to write copy      │
 │    [AcqRel]            │             │ 3. Memory barrier               │
 │ 3. SeqCst fence        │             │ 4. Swap read_idx (0<->1)          │
 │ 4. Load read_idx       │             │ 5. SeqCst fence                 │
 │    [Acquire]           │             │ 6. Snapshot epoch counters       │
 │ 5. Read from           │             │    (bitmask-guided)             │
 │    data[read_idx]      │             │ 7. Wait for pre-swap readers    │
 │ 6. epoch++ (odd->even)  │             │    to depart                    │
 │    [AcqRel]            │             │ 8. Replay oplog on stale copy   │
 │ 7. Clear bitmask bit   │             │    (or full sync)               │
 └────────────────────────┘             │ 9. Clear oplog                  │
                                        │10. Release writer mutex          │
                                        └─────────────────────────────────┘

 Shared state:
 ┌───────────────────────────────────────────────────────────────────────┐
 │ data[0]         data[1]         read_idx (0 or 1)                    │
 │ ┌─────────┐    ┌─────────┐                                          │
 │ │ Copy A  │    │ Copy B  │    epochs[MaxBackends] (cache-line padded)│
 │ └─────────┘    └─────────┘    active_readers_mask[ceil(N/64)]        │
 │                               last_seen_epochs[MaxBackends]          │
 │                               oplog (growable buffer)                │
 └───────────────────────────────────────────────────────────────────────┘

Key Properties

Property	LWLock (shared mode)	LRLock (read path)
Blocking	Yes (CAS on shared state word)	No (wait-free)
Cache-line contention	All readers contend on lock word	Each reader touches only its own epoch line
Memory cost	1x data structure	2x data structure + epoch array
Write cost	Single write under exclusive lock	Write applied twice (both copies) + drain wait
Concurrent writers	Serialized by exclusive mode	Serialized by spinlock (or external lock)
Nested reads	Automatic (shared count)	Supported (nesting counter, skip fence on re-entry)

Active-Reader Bitmask

A naive implementation scans all max_backends epoch counters (each on its own 64-byte cache line) during the writer's drain-wait loop. With 1000+ backends, this means touching 64+ KB of memory per publish. The active-reader bitmask reduces this to O(active_readers) cache line accesses:

One pg_atomic_uint64 word per 64 backends (typically 2-4 words total)
Reader sets its bit BEFORE incrementing its epoch (AcqRel ordering)
Reader clears its bit AFTER its epoch returns to even
Writer's lrlock_snapshot_epochs() reads bitmask words first, then only fetches epoch values for backends with bits set
Idle backends (bit clear) get last_seen_epochs[i] = 0 (even), which the drain-wait loop skips without touching their 64-byte epoch cache line

This reduces cache pressure from O(max_backends x 64B) per drain iteration to O(active_readers x 64B + max_backends/64 x 8B).

Per-Operation Log (Oplog)

Instead of requiring the writer to maintain both copies manually, the LRLock provides an operation log:

Writer calls LRLockApplyOp(lock, &op, sizeof(op))
The operation is immediately applied to the write copy via apply_fn
The operation is serialized into the oplog buffer
On LRLockPublish(), after the pointer swap and drain-wait, the oplog is replayed on the now-stale copy
Both copies are now in sync; the oplog is cleared

This allows incremental updates (40 bytes for a single xid change) instead of full copies (6 KB for the entire ProcArray snapshot), reducing write-path overhead when the data structure is large but individual mutations are small.

When the oplog accumulates many entries (heuristic: oplog_count * 256 > data_size), the writer falls back to a full sync_fn copy instead of replaying individual operations, as sequential memcpy is faster than random-access replay for large operation counts.

Use Cases

ProcArray Snapshot

The primary motivating use case. GetSnapshotData() is called per-statement on every backend to construct a transaction visibility snapshot. It reads four parallel arrays from ProcGlobal (xids, subxidStates, statusFlags, pgprocnos) plus scalar fields (latestCompletedXid, xactCompletionCount, numProcs).

Before (LWLock path):
  GetSnapshotData:
    LWLockAcquire(ProcArrayLock, LW_SHARED)    <-- contention point at high core counts
    read numProcs, arrays, scalars
    LWLockRelease(ProcArrayLock)

After (LRLock path):
  GetSnapshotData:
    snap = LRLockReadBegin(ProcArrayLRLock)     <-- wait-free: epoch++ + fence + load
    read numProcs, arrays, scalars from snap
    LRLockReadEnd(ProcArrayLRLock)              <-- epoch++ only

Write path: All ProcArray mutations (ProcArrayAdd, ProcArrayRemove, ProcArrayEndTransaction, ProcArrayGroupClearXid) continue to hold ProcArrayLock exclusively. After updating the live arrays, they publish to the LRLock:

Structural changes (add/remove backend, group clear): call ProcArrayPublishSnapshot() which copies all four arrays into the write copy and calls LRLockPublishFullSync().
Single-slot XID changes (ordinary commit/abort): call ProcArrayPublishXidChange(pgxactoff) which records a 40-byte ProcArrayXidOp in the oplog, avoiding the 6 KB full copy.

The LRLock snapshot is a contiguous shared memory region containing:

typedef struct ProcArraySnapshotHdr
{
    FullTransactionId  latestCompletedXid;
    uint64             xactCompletionCount;
    int32              numProcs;
    int32              maxProcs;
    /* followed by 4 parallel arrays of maxProcs entries each: */
    /* TransactionId    xids[maxProcs]          */
    /* XidCacheStatus   subxidStates[maxProcs]  */
    /* uint8            statusFlags[maxProcs]   */
    /* int              pgprocnos[maxProcs]     */
} ProcArraySnapshotHdr;

Total size: ~6 KB for typical max_connections (header + 4 arrays x PROCARRAY_MAXPROCS).
Two copies maintained by LRLock: ~12 KB total shared memory.

The hot standby (recovery) path retains the existing ProcArrayLock shared approach unchanged, as it involves KnownAssignedXids bookkeeping that requires the traditional lock protocol.

Replication Slot Xmin Array

ReplicationSlotsComputeRequiredXmin() computes the oldest effective_xmin and effective_catalog_xmin across all replication slots. It is called from 14+ sites per-transaction. The existing path acquires ReplicationSlotControlLock shared and iterates all slots under per-slot spinlocks.

The LRLock replaces this with a compact per-slot snapshot array:

typedef struct ReplicationSlotXminEntry
{
    TransactionId  effective_xmin;
    TransactionId  effective_catalog_xmin;
    bool           in_use;
    bool           invalidated;
} ReplicationSlotXminEntry;

Data size: max_replication_slots x sizeof(ReplicationSlotXminEntry)
         = 15 slots x 12 bytes = 180 bytes per copy

The write path (ReplicationSlotPublishXmin()) is called after any change to slot xmin state (create, drop, release, invalidate, xmin advance). It records a single SlotXminOp (slot index + new entry) via LRLockApplyOp.

When already_locked is false (the common case), ReplicationSlotsComputeRequiredXmin reads from the LRLock without any lock acquisition. The already_locked=true path (caller holds exclusive locks) retains the existing spinlock-based iteration.

Buffer Mapping Hash Table

Replaces the 128-way partitioned LWLock array with a single LRLock protecting a unified open-addressing hash table. Readers (BufTableLookup) use LRLock read epochs; writers (BufTableInsert, BufTableDelete) use the LRLock write path with oplog-recorded insert/delete operations.

The BufTableReadBegin/BufTableReadEnd pair allows holding the read guard across a lookup + PinBuffer sequence, eliminating a TOCTOU race where a buffer could be invalidated between lookup and pin.

API

Creation

/* Simple creation (allocates sub-structures from ShmemAlloc) */
LRLock *LRLockCreate(Size data_size, LRLockApplyFn apply_fn,
                     LRLockSyncFn sync_fn, const char *name);

/* In-place initialization from a pre-allocated contiguous block */
LRLock *LRLockInitInPlace(void *block, Size data_size,
                          LRLockApplyFn apply_fn, LRLockSyncFn sync_fn,
                          int max_backends, Size oplog_capacity,
                          const char *name);

/* Shared memory size calculation */
Size LRLockShmemSize(Size data_size, int max_backends, Size oplog_capacity);

Reader API (wait-free)

const void *LRLockReadBegin(LRLock *lock);
void        LRLockReadEnd(LRLock *lock);

Readers call LRLockReadBegin to obtain a read-only pointer to the current read copy. The pointer is valid until LRLockReadEnd. Nested reads are supported: inner calls skip the expensive epoch bump and fence, performing only an Acquire pointer load.

Writer API (single writer)

void       *LRLockWriteBegin(LRLock *lock);
void        LRLockApplyOp(LRLock *lock, const void *operation, Size op_size);
void        LRLockPublish(LRLock *lock);
void        LRLockPublishFullSync(LRLock *lock);
void        LRLockWriteEnd(LRLock *lock);

LRLockWriteBegin: acquires write access, returns mutable pointer to write copy
LRLockApplyOp: records an operation in the oplog and applies it to the write copy immediately
LRLockPublish: swaps pointers, waits for pre-swap readers to depart, replays oplog on stale copy
LRLockPublishFullSync: like Publish, but unconditionally calls sync_fn on the stale copy (use when write copy was modified directly without oplog)
LRLockWriteEnd: releases write access

Callbacks

/* Apply one operation to a data copy (must be deterministic) */
typedef void (*LRLockApplyFn)(void *data, const void *operation, Size op_size);

/* Full synchronization: byte-for-byte copy from src to dst */
typedef void (*LRLockSyncFn)(void *dst, const void *src, Size data_size);

Correctness Argument

The protocol's correctness rests on four ordering guarantees:

Reader epoch visibility: The reader's epoch increment (even to odd) uses AcqRel ordering, followed by a SeqCst fence before loading read_idx. This ensures the writer sees the reader's odd epoch before the reader loads the pointer. If the writer observes an even epoch, the reader has either not entered yet or has already departed -- either way, the reader will (re-)load the correct pointer.

Writer swap visibility: The writer's pg_atomic_exchange_u32(&read_idx, new_value) is followed by a SeqCst fence before snapshotting epochs. This ensures any reader that loads the new pointer does so after the swap is globally visible -- the writer won't mistake it for a pre-swap reader.

Drain completeness: The writer waits until every epoch that was odd at snapshot time has changed. Since an epoch changes only when the reader exits (odd to even) or re-enters (entering reads the new pointer), all pre-swap readers have either departed or re-entered with the new pointer. The old copy is guaranteed to have no readers.

Oplog replay ordering: The oplog is replayed on the stale copy after all pre-swap readers have departed. Since no reader references the stale copy at this point, replay races with no one. After replay, both copies are identical, maintaining the invariant for the next publish.

Performance

Methodology

pgbench with three workloads (SELECT-only simple protocol, TPC-B default, SELECT-only prepared protocol), 7 client counts (1, 2, 4, 8, 16, 32, 64), 3 iterations of 30 seconds each per data point, scale factor 50. Configuration: fsync=off, synchronous_commit=off, shared_buffers=256MB, max_connections=100. Both branches built from the same base (PostgreSQL 19devel) with identical compiler flags.

Results (RISC-V 8-core)

Hardware: SiFive 8-core RISC-V, 8 GB RAM, Linux. Representative of an in-order, cache-coherent multiprocessor where atomic contention costs are high relative to x86:

Workload	Clients	master TPS	lrlck TPS	Delta
SELECT-only (simple)	2	14,820	15,180	+2.4%
SELECT-only (simple)	4	27,530	28,250	+2.6%
SELECT-only (simple)	8	48,960	49,100	+0.3%
TPC-B (mixed R/W)	2	3,420	3,510	+2.6%
TPC-B (mixed R/W)	8	6,180	6,270	+1.5%
TPC-B (mixed R/W)	64	5,890	6,070	+3.0%
SELECT-only (prepared)	4	31,200	32,050	+2.7%
SELECT-only (prepared)	8	55,400	57,170	+3.2%
SELECT-only (prepared)	16	79,800	80,820	+1.3%

The largest gains appear in read-heavy workloads at moderate concurrency (4-8 clients), where ProcArrayLock shared-mode contention is measurable but the system is not yet saturated on other resources. At high client counts (32-64), gains are smaller as other bottlenecks (buffer contention, scheduling) dominate.

No regressions were observed at any client count across all workloads.

Trade-offs and Limitations

When LRLock is appropriate

Read-to-write ratio is high (10:1 or greater)
Protected data structure is small enough that 2x memory is acceptable
Reads are latency-sensitive and occur on every transaction/statement
Write operations can tolerate being applied twice (deterministic)
A single writer (or external writer serialization) is acceptable

When LRLock is NOT appropriate

Write-heavy workloads (writer path is strictly more expensive than LWLock exclusive)
Large data structures where 2x memory is unacceptable
Multiple concurrent writers without external serialization
Non-deterministic operations that cannot be replayed identically

Memory Overhead

Component	Size (typical max_connections=100)
LRLock struct	128 bytes (MAXALIGN'd)
Two data copies	2 x data_size
Epoch array	max_backends x 64 bytes (cache-line padded)
Active-reader bitmask	ceil(max_backends/64) x 8 bytes
Last-seen epochs	max_backends x 4 bytes
Operation log	Configurable (256 bytes for ProcArray)

For the ProcArray use case with 100 backends: ~6 KB x 2 (data) + 6.4 KB (epochs) + 400 bytes (last_seen) + 16 bytes (bitmask) + 256 bytes (oplog) = ~19 KB total shared memory.

Implementation Files

File	Description
`src/include/storage/lrlock.h`	Public API: LRLock type, callback typedefs, function declarations
`src/backend/storage/lmgr/lrlock.c`	Full implementation: LRLockEpoch, LRLockOpHeader, struct LRLock, reader/writer paths, epoch snapshot, drain-wait, oplog replay
`src/backend/storage/ipc/procarray.c`	ProcArray LRLock: ProcArraySnapshotHdr, ProcArrayXidOp, apply/sync callbacks, ProcArrayPublishSnapshot, ProcArrayPublishXidChange, LRLock fast paths in GetSnapshotData/ComputeXidHorizons/TransactionIdIsInProgress
`src/backend/storage/buffer/buf_table.c`	Buffer mapping LRLock: BufLRHashTable, BufTableReadBegin/End, BufTableWriteBegin/Publish/End, open-addressing hash with oplog
`src/backend/storage/buffer/bufmgr.c`	Buffer manager integration: LRLock read guards around buffer lookups and pins
`src/backend/replication/slot.c`	Replication slot xmin LRLock: ReplicationSlotXminEntry, SlotXminOp, ReplicationSlotPublishXmin, wait-free ReplicationSlotsComputeRequiredXmin
`src/include/replication/slot.h`	ReplicationSlotXminEntry struct, ReplicationSlotXminLock extern
`src/test/modules/test_lrlock/`	Regression test module exercising correctness under concurrent load

Commit Series

The implementation is organized as four incremental commits on the lrlck branch:

Introduce left-right lock (LRLock) concurrency primitive -- Core implementation: lrlock.c, lrlock.h, test_lrlock module. Provides wait-free reader path, active-reader bitmask, per-operation oplog, and cache-line-padded epoch array.
Use left-right lock for buffer mapping hash table -- Replaces 128-way partitioned LWLock array with single LRLock. Open-addressing hash table with oplog-recorded insert/delete operations.
Use left-right lock for replication slot xmin array -- Wait-free ReplicationSlotsComputeRequiredXmin via compact per-slot xmin snapshot under LRLock.
Use left-right lock when accessing the ProcArray -- Wait-free GetSnapshotData via ProcArraySnapshotHdr under LRLock. Incremental 40-byte oplog for single-commit fast path; full sync for structural changes.

Open Issues

Known design limitations

Writer spinlock held across publish: The writer mutex is a slock_t (spinlock) held during the entire write-publish cycle, including the drain-wait loop. If the writer is descheduled while spinning, other writers burn CPU. The ProcArray use case mitigates this because writers already hold ProcArrayLock exclusive (making the inner spinlock redundant in practice), but the generic primitive should use an LWLock or semaphore.
oplog_grow calls ShmemAlloc at runtime: The oplog growth path allocates from shared memory after postmaster startup, which is architecturally problematic (shared memory is fixed-size on most platforms). The leaked old allocation cannot be freed. In practice the ProcArray oplog is pre-sized to 256 bytes and should never grow, but the generic code path should either pre-allocate or hard-fail.
No wait event reporting: The writer drain-wait loop (lrlock_wait_for_readers) does not call pgstat_report_wait_start/end, making writer stalls invisible to pg_stat_activity.
pg_usleep(1) in drain-wait: After the spin limit, the writer calls pg_usleep(1) which on Linux with HZ=250 sleeps for at least 4ms. A condition-variable or futex-based wait would provide better latency.

Needed before submission

Large-machine benchmarking: Demonstrate ProcArrayLock shared-mode contention on 64+ core hardware with 256+ clients, where the wait-free reader path provides the most benefit.
perf profile evidence: Capture flamegraphs showing time spent in LWLockAcquire(ProcArrayLock, LW_SHARED) before vs after on large systems.
Wait event instrumentation: Add pgstat_report_wait_start/end to the writer drain-wait loop.
Writer mutex upgrade: Replace the spinlock-based writer serialization with an LWLock or semaphore.

Future extensions

CLOG and SUBTRANS: Evaluate whether CLOG lookups (TransactionIdGetStatus) and SUBTRANS page reads benefit from LRLock protection.
Buffer mapping evaluation: The buffer mapping use case (commit 2/4) replaces a well-tuned 128-partition LWLock scheme and has shown no measurable benefit on 8-core hardware. It may need to be dropped or justified with larger-machine profiles.

Credits

Author: Greg Burd
Reviewers: (seeking review)

References

Research

P. Ramalhete and A. Correia, "Left-Right: A Concurrency Control Technique with Wait-Free Population Oblivious Reads", 2015.
J. Gjengset, "left-right: A lock-free, read-optimized, concurrency primitive", Rust crate, 2018-present.
M. Herlihy, "Wait-Free Synchronization", ACM TOPLAS 13(1), 1991.
P. McKenney and J. Slingwine, "Read-Copy Update: Using Execution History to Solve Concurrency Problems", PDCS 1998.

PostgreSQL Mailing List

ProcArray performance issues (Robert Haas, 2020)
Snapshot scalability (Andres Freund, 2020)

Source Code

Left Right Lock Primitive