Left Right Lock Primitive
Left-Right Lock (LRLock)
Request for Comments [RFC] NOT MERGED
Summary
Left-right lock (LRLock) is a concurrency primitive for PostgreSQL providing wait-free reads at the cost of maintaining two copies of the protected data structure. Readers proceed without acquiring any blocking lock -- the read path consists of a single atomic epoch counter increment, a full memory fence, and a pointer load. Writers modify one copy, publish via an atomic pointer swap, wait for pre-swap readers to depart, then replay queued operations on the stale copy to bring it up to date. The algorithm is based on the left-right concurrency primitive described by Pedro Ramalhete and Andreia Correia, and as implemented in Jon Gjengset's Rust left-right crate. Three use cases are provided: ProcArray snapshot (wait-free GetSnapshotData), replication slot xmin computation, and buffer mapping hash table.
Status
This is an early-stage proposal. The patch has not yet been posted to pgsql-hackers or submitted to a CommitFest.
- Patch status: Work in progress (RFC)
- Target: PostgreSQL 19 (speculative)
- Source branch: gburd/postgres, branch lrlck (4 commits atop master)
- Tests: All 245 core regression subtests pass; dedicated
test_lrlockmodule passes under concurrent load. Valgrind clean (67 processes, 0 errors).
Background and Motivation
LWLock Scalability (2005-present)
PostgreSQL's LWLock infrastructure has been the standard reader-writer lock since Tom Lane's 2005 redesign introduced the current shared/exclusive model. LWLock acquisition in shared mode involves an atomic compare-and-swap on the lock state word, which scales well at low core counts but becomes a bottleneck when many backends simultaneously contend for shared access to the same lock -- the cache line containing the lock state bounces between cores.
Robert Haas and others identified ProcArrayLock shared-mode contention as a scalability issue on large systems (64+ cores), where GetSnapshotData() is called per-statement and hundreds of backends simultaneously attempt to acquire ProcArrayLock in shared mode.
- ProcArray performance issues (Robert Haas, 2020)
- Snapshot scalability (Andres Freund, 2020)
Epoch-Based Reclamation and Read-Copy-Update
The Linux kernel's RCU (Read-Copy-Update) demonstrated that separating read and write paths -- allowing readers to proceed without synchronization while writers defer cleanup until all readers have departed -- can dramatically improve scalability for read-dominated data structures. However, RCU requires a kernel-level quiescent state detection mechanism that is not available to user-space processes communicating through shared memory.
The left-right pattern achieves a similar separation using per-reader epoch counters in shared memory as the quiescent state signal. When a reader's epoch changes, it has exited (or re-entered with the new pointer), so the writer knows the old copy is no longer referenced.
Left-Right Concurrency Primitive
Ramalhete and Correia formalized the left-right pattern in their 2015 paper, proving that the reader path is population-oblivious and wait-free: its progress does not depend on the number or behavior of other threads. The key insight is that maintaining two copies allows readers to always have a consistent view without coordinating with the writer, while the writer's cost (applying each mutation twice) is acceptable when reads vastly outnumber writes.
Jon Gjengset's Rust left-right crate provided a practical implementation demonstrating the pattern's applicability to database-style workloads (concurrent hash maps, caches). This PostgreSQL implementation adapts the same protocol to C and shared memory.
- Ramalhete & Correia, "Left-Right: A Concurrency Control Technique with Wait-Free Population Oblivious Reads", 2015
- Rust left-right crate documentation
- Rust left-right source repository
Algorithm Overview
Reader (wait-free): Writer (single, serialized):
┌────────────────────────┐ ┌─────────────────────────────────┐
│ 1. Set bitmask bit │ │ 1. Acquire writer mutex │
│ 2. epoch++ (even->odd) │ │ 2. Apply ops to write copy │
│ [AcqRel] │ │ 3. Memory barrier │
│ 3. SeqCst fence │ │ 4. Swap read_idx (0<->1) │
│ 4. Load read_idx │ │ 5. SeqCst fence │
│ [Acquire] │ │ 6. Snapshot epoch counters │
│ 5. Read from │ │ (bitmask-guided) │
│ data[read_idx] │ │ 7. Wait for pre-swap readers │
│ 6. epoch++ (odd->even) │ │ to depart │
│ [AcqRel] │ │ 8. Replay oplog on stale copy │
│ 7. Clear bitmask bit │ │ (or full sync) │
└────────────────────────┘ │ 9. Clear oplog │
│10. Release writer mutex │
└─────────────────────────────────┘
Shared state:
┌───────────────────────────────────────────────────────────────────────┐
│ data[0] data[1] read_idx (0 or 1) │
│ ┌─────────┐ ┌─────────┐ │
│ │ Copy A │ │ Copy B │ epochs[MaxBackends] (cache-line padded)│
│ └─────────┘ └─────────┘ active_readers_mask[ceil(N/64)] │
│ last_seen_epochs[MaxBackends] │
│ oplog (growable buffer) │
└───────────────────────────────────────────────────────────────────────┘
Key Properties
| Property | LWLock (shared mode) | LRLock (read path) |
|---|---|---|
| Blocking | Yes (CAS on shared state word) | No (wait-free) |
| Cache-line contention | All readers contend on lock word | Each reader touches only its own epoch line |
| Memory cost | 1x data structure | 2x data structure + epoch array |
| Write cost | Single write under exclusive lock | Write applied twice (both copies) + drain wait |
| Concurrent writers | Serialized by exclusive mode | Serialized by spinlock (or external lock) |
| Nested reads | Automatic (shared count) | Supported (nesting counter, skip fence on re-entry) |
Active-Reader Bitmask
A naive implementation scans all max_backends epoch counters (each on its own 64-byte cache line) during the writer's drain-wait loop. With 1000+ backends, this means touching 64+ KB of memory per publish. The active-reader bitmask reduces this to O(active_readers) cache line accesses:
- One
pg_atomic_uint64word per 64 backends (typically 2-4 words total) - Reader sets its bit BEFORE incrementing its epoch (AcqRel ordering)
- Reader clears its bit AFTER its epoch returns to even
- Writer's
lrlock_snapshot_epochs()reads bitmask words first, then only fetches epoch values for backends with bits set - Idle backends (bit clear) get
last_seen_epochs[i] = 0(even), which the drain-wait loop skips without touching their 64-byte epoch cache line
This reduces cache pressure from O(max_backends x 64B) per drain iteration to O(active_readers x 64B + max_backends/64 x 8B).
Per-Operation Log (Oplog)
Instead of requiring the writer to maintain both copies manually, the LRLock provides an operation log:
- Writer calls
LRLockApplyOp(lock, &op, sizeof(op)) - The operation is immediately applied to the write copy via
apply_fn - The operation is serialized into the oplog buffer
- On
LRLockPublish(), after the pointer swap and drain-wait, the oplog is replayed on the now-stale copy - Both copies are now in sync; the oplog is cleared
This allows incremental updates (40 bytes for a single xid change) instead of full copies (6 KB for the entire ProcArray snapshot), reducing write-path overhead when the data structure is large but individual mutations are small.
When the oplog accumulates many entries (heuristic: oplog_count * 256 > data_size), the writer falls back to a full sync_fn copy instead of replaying individual operations, as sequential memcpy is faster than random-access replay for large operation counts.
Use Cases
ProcArray Snapshot
The primary motivating use case. GetSnapshotData() is called per-statement on every backend to construct a transaction visibility snapshot. It reads four parallel arrays from ProcGlobal (xids, subxidStates, statusFlags, pgprocnos) plus scalar fields (latestCompletedXid, xactCompletionCount, numProcs).
Before (LWLock path):
GetSnapshotData:
LWLockAcquire(ProcArrayLock, LW_SHARED) <-- contention point at high core counts
read numProcs, arrays, scalars
LWLockRelease(ProcArrayLock)
After (LRLock path):
GetSnapshotData:
snap = LRLockReadBegin(ProcArrayLRLock) <-- wait-free: epoch++ + fence + load
read numProcs, arrays, scalars from snap
LRLockReadEnd(ProcArrayLRLock) <-- epoch++ only
Write path: All ProcArray mutations (ProcArrayAdd, ProcArrayRemove, ProcArrayEndTransaction, ProcArrayGroupClearXid) continue to hold ProcArrayLock exclusively. After updating the live arrays, they publish to the LRLock:
- Structural changes (add/remove backend, group clear): call
ProcArrayPublishSnapshot()which copies all four arrays into the write copy and callsLRLockPublishFullSync(). - Single-slot XID changes (ordinary commit/abort): call
ProcArrayPublishXidChange(pgxactoff)which records a 40-byteProcArrayXidOpin the oplog, avoiding the 6 KB full copy.
The LRLock snapshot is a contiguous shared memory region containing:
typedef struct ProcArraySnapshotHdr
{
FullTransactionId latestCompletedXid;
uint64 xactCompletionCount;
int32 numProcs;
int32 maxProcs;
/* followed by 4 parallel arrays of maxProcs entries each: */
/* TransactionId xids[maxProcs] */
/* XidCacheStatus subxidStates[maxProcs] */
/* uint8 statusFlags[maxProcs] */
/* int pgprocnos[maxProcs] */
} ProcArraySnapshotHdr;
Total size: ~6 KB for typical max_connections (header + 4 arrays x PROCARRAY_MAXPROCS).
Two copies maintained by LRLock: ~12 KB total shared memory.
The hot standby (recovery) path retains the existing ProcArrayLock shared approach unchanged, as it involves KnownAssignedXids bookkeeping that requires the traditional lock protocol.
Replication Slot Xmin Array
ReplicationSlotsComputeRequiredXmin() computes the oldest effective_xmin and effective_catalog_xmin across all replication slots. It is called from 14+ sites per-transaction. The existing path acquires ReplicationSlotControlLock shared and iterates all slots under per-slot spinlocks.
The LRLock replaces this with a compact per-slot snapshot array:
typedef struct ReplicationSlotXminEntry
{
TransactionId effective_xmin;
TransactionId effective_catalog_xmin;
bool in_use;
bool invalidated;
} ReplicationSlotXminEntry;
Data size: max_replication_slots x sizeof(ReplicationSlotXminEntry)
= 15 slots x 12 bytes = 180 bytes per copy
The write path (ReplicationSlotPublishXmin()) is called after any change to slot xmin state (create, drop, release, invalidate, xmin advance). It records a single SlotXminOp (slot index + new entry) via LRLockApplyOp.
When already_locked is false (the common case), ReplicationSlotsComputeRequiredXmin reads from the LRLock without any lock acquisition. The already_locked=true path (caller holds exclusive locks) retains the existing spinlock-based iteration.
Buffer Mapping Hash Table
Replaces the 128-way partitioned LWLock array with a single LRLock protecting a unified open-addressing hash table. Readers (BufTableLookup) use LRLock read epochs; writers (BufTableInsert, BufTableDelete) use the LRLock write path with oplog-recorded insert/delete operations.
The BufTableReadBegin/BufTableReadEnd pair allows holding the read guard across a lookup + PinBuffer sequence, eliminating a TOCTOU race where a buffer could be invalidated between lookup and pin.
API
Creation
/* Simple creation (allocates sub-structures from ShmemAlloc) */
LRLock *LRLockCreate(Size data_size, LRLockApplyFn apply_fn,
LRLockSyncFn sync_fn, const char *name);
/* In-place initialization from a pre-allocated contiguous block */
LRLock *LRLockInitInPlace(void *block, Size data_size,
LRLockApplyFn apply_fn, LRLockSyncFn sync_fn,
int max_backends, Size oplog_capacity,
const char *name);
/* Shared memory size calculation */
Size LRLockShmemSize(Size data_size, int max_backends, Size oplog_capacity);
Reader API (wait-free)
const void *LRLockReadBegin(LRLock *lock); void LRLockReadEnd(LRLock *lock);
Readers call LRLockReadBegin to obtain a read-only pointer to the current read copy. The pointer is valid until LRLockReadEnd. Nested reads are supported: inner calls skip the expensive epoch bump and fence, performing only an Acquire pointer load.
Writer API (single writer)
void *LRLockWriteBegin(LRLock *lock); void LRLockApplyOp(LRLock *lock, const void *operation, Size op_size); void LRLockPublish(LRLock *lock); void LRLockPublishFullSync(LRLock *lock); void LRLockWriteEnd(LRLock *lock);
LRLockWriteBegin: acquires write access, returns mutable pointer to write copyLRLockApplyOp: records an operation in the oplog and applies it to the write copy immediatelyLRLockPublish: swaps pointers, waits for pre-swap readers to depart, replays oplog on stale copyLRLockPublishFullSync: like Publish, but unconditionally callssync_fnon the stale copy (use when write copy was modified directly without oplog)LRLockWriteEnd: releases write access
Callbacks
/* Apply one operation to a data copy (must be deterministic) */ typedef void (*LRLockApplyFn)(void *data, const void *operation, Size op_size); /* Full synchronization: byte-for-byte copy from src to dst */ typedef void (*LRLockSyncFn)(void *dst, const void *src, Size data_size);
Correctness Argument
The protocol's correctness rests on four ordering guarantees:
- Reader epoch visibility: The reader's epoch increment (even to odd) uses AcqRel ordering, followed by a SeqCst fence before loading
read_idx. This ensures the writer sees the reader's odd epoch before the reader loads the pointer. If the writer observes an even epoch, the reader has either not entered yet or has already departed -- either way, the reader will (re-)load the correct pointer.
- Writer swap visibility: The writer's
pg_atomic_exchange_u32(&read_idx, new_value)is followed by a SeqCst fence before snapshotting epochs. This ensures any reader that loads the new pointer does so after the swap is globally visible -- the writer won't mistake it for a pre-swap reader.
- Drain completeness: The writer waits until every epoch that was odd at snapshot time has changed. Since an epoch changes only when the reader exits (odd to even) or re-enters (entering reads the new pointer), all pre-swap readers have either departed or re-entered with the new pointer. The old copy is guaranteed to have no readers.
- Oplog replay ordering: The oplog is replayed on the stale copy after all pre-swap readers have departed. Since no reader references the stale copy at this point, replay races with no one. After replay, both copies are identical, maintaining the invariant for the next publish.
Performance
Methodology
pgbench with three workloads (SELECT-only simple protocol, TPC-B default, SELECT-only prepared protocol), 7 client counts (1, 2, 4, 8, 16, 32, 64), 3 iterations of 30 seconds each per data point, scale factor 50. Configuration: fsync=off, synchronous_commit=off, shared_buffers=256MB, max_connections=100. Both branches built from the same base (PostgreSQL 19devel) with identical compiler flags.
Results (RISC-V 8-core)
Hardware: SiFive 8-core RISC-V, 8 GB RAM, Linux. Representative of an in-order, cache-coherent multiprocessor where atomic contention costs are high relative to x86:
| Workload | Clients | master TPS | lrlck TPS | Delta |
|---|---|---|---|---|
| SELECT-only (simple) | 2 | 14,820 | 15,180 | +2.4% |
| SELECT-only (simple) | 4 | 27,530 | 28,250 | +2.6% |
| SELECT-only (simple) | 8 | 48,960 | 49,100 | +0.3% |
| TPC-B (mixed R/W) | 2 | 3,420 | 3,510 | +2.6% |
| TPC-B (mixed R/W) | 8 | 6,180 | 6,270 | +1.5% |
| TPC-B (mixed R/W) | 64 | 5,890 | 6,070 | +3.0% |
| SELECT-only (prepared) | 4 | 31,200 | 32,050 | +2.7% |
| SELECT-only (prepared) | 8 | 55,400 | 57,170 | +3.2% |
| SELECT-only (prepared) | 16 | 79,800 | 80,820 | +1.3% |
The largest gains appear in read-heavy workloads at moderate concurrency (4-8 clients), where ProcArrayLock shared-mode contention is measurable but the system is not yet saturated on other resources. At high client counts (32-64), gains are smaller as other bottlenecks (buffer contention, scheduling) dominate.
No regressions were observed at any client count across all workloads.
Trade-offs and Limitations
When LRLock is appropriate
- Read-to-write ratio is high (10:1 or greater)
- Protected data structure is small enough that 2x memory is acceptable
- Reads are latency-sensitive and occur on every transaction/statement
- Write operations can tolerate being applied twice (deterministic)
- A single writer (or external writer serialization) is acceptable
When LRLock is NOT appropriate
- Write-heavy workloads (writer path is strictly more expensive than LWLock exclusive)
- Large data structures where 2x memory is unacceptable
- Multiple concurrent writers without external serialization
- Non-deterministic operations that cannot be replayed identically
Memory Overhead
| Component | Size (typical max_connections=100) |
|---|---|
| LRLock struct | 128 bytes (MAXALIGN'd) |
| Two data copies | 2 x data_size |
| Epoch array | max_backends x 64 bytes (cache-line padded) |
| Active-reader bitmask | ceil(max_backends/64) x 8 bytes |
| Last-seen epochs | max_backends x 4 bytes |
| Operation log | Configurable (256 bytes for ProcArray) |
For the ProcArray use case with 100 backends: ~6 KB x 2 (data) + 6.4 KB (epochs) + 400 bytes (last_seen) + 16 bytes (bitmask) + 256 bytes (oplog) = ~19 KB total shared memory.
Implementation Files
| File | Description |
|---|---|
src/include/storage/lrlock.h |
Public API: LRLock type, callback typedefs, function declarations |
src/backend/storage/lmgr/lrlock.c |
Full implementation: LRLockEpoch, LRLockOpHeader, struct LRLock, reader/writer paths, epoch snapshot, drain-wait, oplog replay |
src/backend/storage/ipc/procarray.c |
ProcArray LRLock: ProcArraySnapshotHdr, ProcArrayXidOp, apply/sync callbacks, ProcArrayPublishSnapshot, ProcArrayPublishXidChange, LRLock fast paths in GetSnapshotData/ComputeXidHorizons/TransactionIdIsInProgress |
src/backend/storage/buffer/buf_table.c |
Buffer mapping LRLock: BufLRHashTable, BufTableReadBegin/End, BufTableWriteBegin/Publish/End, open-addressing hash with oplog |
src/backend/storage/buffer/bufmgr.c |
Buffer manager integration: LRLock read guards around buffer lookups and pins |
src/backend/replication/slot.c |
Replication slot xmin LRLock: ReplicationSlotXminEntry, SlotXminOp, ReplicationSlotPublishXmin, wait-free ReplicationSlotsComputeRequiredXmin |
src/include/replication/slot.h |
ReplicationSlotXminEntry struct, ReplicationSlotXminLock extern |
src/test/modules/test_lrlock/ |
Regression test module exercising correctness under concurrent load |
Commit Series
The implementation is organized as four incremental commits on the lrlck branch:
- Introduce left-right lock (LRLock) concurrency primitive -- Core implementation: lrlock.c, lrlock.h, test_lrlock module. Provides wait-free reader path, active-reader bitmask, per-operation oplog, and cache-line-padded epoch array.
- Use left-right lock for buffer mapping hash table -- Replaces 128-way partitioned LWLock array with single LRLock. Open-addressing hash table with oplog-recorded insert/delete operations.
- Use left-right lock for replication slot xmin array -- Wait-free ReplicationSlotsComputeRequiredXmin via compact per-slot xmin snapshot under LRLock.
- Use left-right lock when accessing the ProcArray -- Wait-free GetSnapshotData via ProcArraySnapshotHdr under LRLock. Incremental 40-byte oplog for single-commit fast path; full sync for structural changes.
Open Issues
Known design limitations
- Writer spinlock held across publish: The writer mutex is a
slock_t(spinlock) held during the entire write-publish cycle, including the drain-wait loop. If the writer is descheduled while spinning, other writers burn CPU. The ProcArray use case mitigates this because writers already hold ProcArrayLock exclusive (making the inner spinlock redundant in practice), but the generic primitive should use an LWLock or semaphore. - oplog_grow calls ShmemAlloc at runtime: The oplog growth path allocates from shared memory after postmaster startup, which is architecturally problematic (shared memory is fixed-size on most platforms). The leaked old allocation cannot be freed. In practice the ProcArray oplog is pre-sized to 256 bytes and should never grow, but the generic code path should either pre-allocate or hard-fail.
- No wait event reporting: The writer drain-wait loop (
lrlock_wait_for_readers) does not callpgstat_report_wait_start/end, making writer stalls invisible topg_stat_activity. - pg_usleep(1) in drain-wait: After the spin limit, the writer calls
pg_usleep(1)which on Linux with HZ=250 sleeps for at least 4ms. A condition-variable or futex-based wait would provide better latency.
Needed before submission
- Large-machine benchmarking: Demonstrate ProcArrayLock shared-mode contention on 64+ core hardware with 256+ clients, where the wait-free reader path provides the most benefit.
- perf profile evidence: Capture flamegraphs showing time spent in
LWLockAcquire(ProcArrayLock, LW_SHARED)before vs after on large systems. - Wait event instrumentation: Add
pgstat_report_wait_start/endto the writer drain-wait loop. - Writer mutex upgrade: Replace the spinlock-based writer serialization with an LWLock or semaphore.
Future extensions
- CLOG and SUBTRANS: Evaluate whether CLOG lookups (TransactionIdGetStatus) and SUBTRANS page reads benefit from LRLock protection.
- Buffer mapping evaluation: The buffer mapping use case (commit 2/4) replaces a well-tuned 128-partition LWLock scheme and has shown no measurable benefit on 8-core hardware. It may need to be dropped or justified with larger-machine profiles.
Credits
- Author: Greg Burd
- Reviewers: (seeking review)
References
Research
- P. Ramalhete and A. Correia, "Left-Right: A Concurrency Control Technique with Wait-Free Population Oblivious Reads", 2015.
- J. Gjengset, "left-right: A lock-free, read-optimized, concurrency primitive", Rust crate, 2018-present.
- M. Herlihy, "Wait-Free Synchronization", ACM TOPLAS 13(1), 1991.
- P. McKenney and J. Slingwine, "Read-Copy Update: Using Execution History to Solve Concurrency Problems", PDCS 1998.
PostgreSQL Mailing List
- ProcArray performance issues (Robert Haas, 2020)
- Snapshot scalability (Andres Freund, 2020)