Multiple Buffer Pools
Multiple Buffer Pools
Request for Comments [RFC] NOT MERGED
Summary
Multiple buffer pools extend PostgreSQL's buffer manager so that the cluster's shared_buffers can be partitioned into named pools, each governed by its own pluggable replacement algorithm. Tables and indexes can be assigned to a specific pool via the buffer_pool reloption, and TOAST overflow data can be routed to a dedicated pool via overflow_buffer_pool (default JAM algorithm).
Six replacement algorithms ship as contrib extensions alongside the built-in clock-sweep -- ARC, CAR, LIRS, LIRS2 (RCRD), LRU, OSIC -- plus three core-built-in routines for special-purpose pools:
- KEEP -- never evicts; for pinning a small hot working set in memory.
- RECYCLE -- one-chance clock sweep that replaces per-backend ring buffers for sequential scans, VACUUM, and bulk writes.
- JAM -- accelerated-decay clock tuned for the write-heavy, read-few access pattern of TOAST data.
Each user-created pool runs its own trickle-writer background worker for dirty-page flushing. Pools are created, resized (drop-and-recreate), renamed, and dropped via SQL DDL; metadata is persisted in a new pg_bufferpool system catalog and recreated automatically on server restart.
The replacement-algorithm framework is a thin vtable layered on top of the existing buffer manager: when the DEFAULT pool runs clock-sweep (the configured default), the framework's hot path is one indirect call per StrategyGetBuffer call. Benchmark numbers below show the cost of that indirection is in the noise; the cost of running an alternative algorithm comes from the algorithm's own bookkeeping, not from the framework.
Status
- Patch status: Work in progress; pre-CommitFest design preview.
- Target release: PostgreSQL 19 (speculative).
- Branch name:
arc. - Repository: gburd/postgres, branch arc.
- Mailing list thread: To be posted to pgsql-hackers; this wiki page is the design preview.
- CommitFest: Not yet submitted.
- Tests: Core regress passes (361/361 in
meson test); per-commit verification confirmspg_indent-clean and warning-free compile under-Werrorat every commit boundary.
Try It / Build
The framework is always available; the DEFAULT pool runs clock-sweep unless the operator changes buffer_pool_algorithm. Alternative algorithms ship as contrib extensions and must be loaded via shared_preload_libraries before they can be selected.
git clone https://github.com/gburd/postgres.git cd postgres git checkout arc meson setup build --prefix=$PWD/install -Dcassert=true ninja -C build install initdb -D /path/to/data pg_ctl -D /path/to/data start
Quick smoke test:
-- Use the default clock-sweep DEFAULT pool plus a private ARC pool
CREATE EXTENSION pg_bp_arc; -- after restart with shared_preload_libraries
CREATE BUFFER POOL hot_data
HANDLER arc_pool_handler SIZE '67108864';
CREATE TABLE events (id bigint, ts timestamptz, payload jsonb)
WITH (buffer_pool = 'hot_data');
SELECT name, nbuffers, hits, evictions FROM pg_stat_bufferpool;
Benchmark harness (Python + DTrace/bpftrace + HammerDB integration) lives on a companion branch and is referenced from this page rather than included in the patch series:
Background and Motivation
Clock-Sweep (2005)
PostgreSQL has used a clock-sweep buffer replacement algorithm since version 8.1, when Tom Lane redesigned the buffer manager to eliminate the monolithic BufMgrLock and replace the short-lived ARC implementation with a scalable, lock-light alternative.
- Design notes for BufMgrLock rewrite (Tom Lane, 2005)
The clock-sweep algorithm uses an atomic nextVictimBuffer pointer and per-buffer usage_count fields. On each buffer allocation, the clock hand advances; buffers with usage_count > 0 get a second chance (count decremented), while those at zero are evicted. This approach is simple, has low per-operation overhead, and scales well on multi-core systems.
However, clock-sweep is a single policy applied uniformly to all data. It cannot distinguish between a frequently-accessed index root page and a sequential-scan page that will never be re-read. Robert Haas demonstrated this limitation in 2011, showing that only 96 of 14,286 buffers for a table were retained in cache despite over 37,000 completely unused buffers being available.
- Our buffer replacement strategy is kind of lame (Robert Haas, 2011)
The ARC Patent
In 2003 Jan Wieck implemented the Adaptive Replacement Cache (ARC) algorithm for PostgreSQL 7.4/8.0, based on research from IBM Almaden by Nimrod Megiddo and Dharmendra Modha.
- Experimental ARC implementation (Jan Wieck, 2003)
The implementation was removed in early 2005 after the community learned of IBM's pending patent (US 6,996,676, granted February 2006). Tom Lane developed a 2Q replacement as a bridge, then designed the clock-sweep algorithm that has been in use since.
- 8.0.X and the ARC patent (Bruce Momjian, Tom Lane, 2005)
The ARC patent reached its 20-year statutory term on February 22, 2024 and has expired. Sun Microsystems' related US 7,469,320 (also expired) covered work that had been shipping for years inside ZFS under the CDDL, which the broader community has long treated as authoritative public-domain prior art.
This series implements ARC as one of the contrib replacement algorithms. The implementation in contrib/pg_bp_arc follows the pseudocode in the FAST 2003 paper and was cross-checked against the Caffeine project's reference implementation (ArcPolicy.java); the Apache 2.0 license header from that file is reproduced verbatim in pg_bp_arc.c per the Apache 2.0 grant.
A note on patent posture
Ben Manes, who maintains the Caffeine cache library, has publicly summarised the ARC patent landscape this way (paraphrased from the file-level docstring of ArcPolicy.java):
This algorithm is patented by IBM (6996676, 7096321, 7058766, 8612689) and Sun (7469320), making its use in applications ambiguous due to Sun's ZFS providing an implementation under the CDDL.
That is the right framing for our community as well. For the years 2005-2023 the patent posture made ARC unsafe to ship in PostgreSQL even though a CDDL implementation existed in the wild. As of 2024 the IBM patent family has reached its 20-year term and expired, and Sun's patent is similarly out of force; the prior art landscape is now unambiguous. We include this implementation on that basis and recommend the same conclusion. Reviewers and downstreams should still confirm with their own counsel before relying on that conclusion in their jurisdiction.
Ring Buffers and Bulk I/O
In 2007 Tom Lane introduced BufferAccessStrategy objects -- small per-backend ring buffers that prevent large sequential scans, VACUUM, and bulk writes from evicting the entire shared buffer cache. The RECYCLE pool in this series generalises that ring-buffer concept into a shared, named pool with its own replacement policy: a one-chance clock-sweep that gives each page exactly one cycle to be re-accessed before it becomes a victim.
Related Work
Several prior efforts have explored improvements to PostgreSQL's buffer management:
| Year | Proposal | Authors | Outcome |
|---|---|---|---|
| 2000 | LRU-2 experiment | Tom Lane | No measurable improvement; not pursued |
| 2002 | LRU-K / 2Q for scan resistance | Amit Khare, Tom Lane, Neil Padgett | Informed 2005 2Q bridge patch |
| 2013 | LIRS page replacement | Ants Aasma, Tom Lane | Discussion only |
| 2014 | Correlated reference period / LRU-K | Peter Geoghegan, Amit Kapila | Prototype only |
| 2014 | Lockless StrategyGetBuffer | Andres Freund | Committed (atomic nextVictimBuffer)
|
| 2024 | Dynamic shared_buffers without restart |
Dmitry Dolgov | Multiple shared memory segments |
| 2025 | NUMA-aware buffer management | Tomas Vondra, Jakub Wartak | Partitioned freelists per NUMA node |
| 2025 | Freelist removal / buffer_strategy_lock elimination |
Greg Burd | Atomic-only clock sweep |
| 2025 | Eagerly evict bulkwrite ring | Melanie Plageman | Batch eviction for async I/O |
| 2026 | Batched clock sweep | Greg Burd, Jim Mlodgenski | 16-20% pgbench improvement on NUMA |
Architecture Overview
┌──────────────────────────────┐
│ SQL Interface │
│ CREATE / ALTER / DROP │
│ BUFFER POOL │
└──────────────┬───────────────┘
│
┌────────────────────────┼───────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ DEFAULT pool │ │ User pools │ │ RECYCLE pool │
│ (shared_buffers)│ │ (DSM-backed) │ │ (bulk I/O) │
│ │ │ │ │ │
│ clock-sweep │ │ any handler: │ │ one-chance │
│ (or any │ │ ARC, CAR, LIRS, │ │ clock-sweep │
│ registered │ │ LRU, OSIC, RCRD,│ │ │
│ algorithm via │ │ KEEP, JAM, ... │ │ │
│ GUC) │ │ │ │ │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└───────────────────────┼──────────────────────────┘
│
┌─────────┴──────────┐
│ BufferPoolRoutine │
│ vtable (per pool) │
│ │
│ get_victim() │
│ on_hit / on_miss │
│ on_evict │
│ on_new_tag │
│ trickle_iter_* │
│ hint_vacuum │
│ reject_buffer │
│ prefetch_hint │
│ shmem_size / init │
└─────────┬───────────┘
│
┌─────────┴──────────┐
│ Trickle Writer │
│ (per dynamic pool │
│ background worker)│
│ │
│ algorithm-directed │
│ dirty page flush │
│ oversubscription │
│ nudging │
└─────────────────────┘
The DEFAULT pool always exists at slot 0 and uses the existing shared_buffers allocation and global BufferDescriptors[] array. User-created pools allocate their own dynamic shared memory (DSM) segments containing private buffer descriptors, data blocks, hash table partitions, and algorithm state. Each backend lazily attaches to a pool's DSM segment on first access.
SQL Interface
Creating a Pool
CREATE BUFFER POOL name HANDLER handler_function SIZE 'size'; CREATE BUFFER POOL name REMAINDER HANDLER handler_function;
The HANDLER names a function that returns a BufferPoolRoutine vtable. SIZE is specified as a bare integer string in bytes (e.g. '67108864' for 64 MB); the smallest pool is 16 buffers (128 KB at 8 KB/buffer). The REMAINDER form sizes the pool to whatever shared-buffer budget is not already claimed by other named pools; only one REMAINDER pool can exist at a time.
-- 64 MB ARC pool for hot OLTP data CREATE BUFFER POOL hot_data HANDLER arc_pool_handler SIZE '67108864';
-- catch-all for the unclaimed remainder of shared_buffers CREATE BUFFER POOL leftovers HANDLER lru_pool_handler REMAINDER;
Pool metadata is persisted in pg_bufferpool and pools are recreated on server restart. Up to 64 pools (MAX_BUFFER_POOLS) can exist simultaneously.
Resizing a Pool
ALTER BUFFER POOL name SET SIZE 'new_size';
This is implemented as drop-and-recreate, not online resize: DSM has no resize API, so the pool's segment is torn down and recreated at the new size. All pages in the pool are flushed and the pool starts cold. This is intentionally not advertised as "online" -- users should expect a cold-cache transient on the affected pool immediately after.
Renaming and Dropping
ALTER BUFFER POOL name RENAME TO new_name; DROP BUFFER POOL [IF EXISTS] name [CASCADE | RESTRICT];
All dirty pages are flushed on drop. The pool cannot be dropped if any buffers are still pinned. Tables previously assigned to a dropped pool fall back to the DEFAULT pool.
Assigning Tables to Pools
-- assign at table creation
CREATE TABLE events (id bigint, ts timestamptz, data jsonb)
WITH (buffer_pool = 'hot_data');
-- route TOAST overflow to a different pool
CREATE TABLE documents (id int, body text)
WITH (overflow_buffer_pool = 'toast_pool');
-- move an existing table ALTER TABLE events SET (buffer_pool = 'archive_pool');
-- back to the default pool ALTER TABLE events RESET (buffer_pool);
The overflow_buffer_pool reloption applies only to heap relations and propagates automatically to the table's TOAST relation as that TOAST relation's buffer_pool. ALTER TABLE ... SET / RESET (overflow_buffer_pool = ...) cascades the change.
Selecting the DEFAULT Pool's Algorithm
The DEFAULT pool's replacement algorithm is selected by the buffer_pool_algorithm GUC (string, PGC_POSTMASTER, default 'clock'). Only clock is built in; any other name must come from an extension loaded via shared_preload_libraries:
# postgresql.conf shared_preload_libraries = 'pg_bp_arc' buffer_pool_algorithm = 'arc'
If the GUC names an algorithm that has not been registered (typo, or extension missing from shared_preload_libraries) the server falls back to clock-sweep with a WARNING rather than refusing to start.
Pool Types
DEFAULT Pool
The DEFAULT pool is the standard shared_buffers allocation. It always exists at slot 0 and is not visible as a target of CREATE BUFFER POOL / DROP BUFFER POOL. Its replacement algorithm is selected at startup via buffer_pool_algorithm.
User Pools
User pools are created with CREATE BUFFER POOL and backed by DSM segments. Each pool has:
- its own buffer descriptors and data blocks;
- a multi-partition open-addressed hash table sized by
ComputePoolPartitions(); - algorithm-private state in the same DSM segment;
- a dedicated trickle-writer background worker;
- atomic statistics counters (reads, hits, evictions).
A new BufferPoolKind classification distinguishes BUFPOOL_DEFAULT, BUFPOOL_RECYCLE (the well-known system pool, looked up by kind rather than name), and BUFPOOL_USER (catalog-backed user pools).
REMAINDER Pool
A REMAINDER pool claims whatever shared-buffer budget is not already allocated to other user pools. Only one REMAINDER pool can exist at a time; attempting to create a second produces an error referencing the existing one. Its size is recomputed at creation and is fixed for the life of the pool.
RECYCLE Pool
The RECYCLE pool replaces per-backend ring buffers (BufferAccessStrategy) for sequential scans, VACUUM, and bulk-write operations with a single shared pool. Enabled by sizing it through the recycle_pool_buffers GUC:
# postgresql.conf recycle_pool_buffers = 4096 # 0 disables; restart required
Its replacement routine is a one-chance clock sweep: usage_count is cleared (not decremented) on the sweep, so each page gets exactly one cycle to be re-accessed before it becomes a victim. Aggressive enough not to pollute the main pool while still being a real cache rather than a private throwaway ring.
KEEP Pool
KEEP is a built-in routine that never evicts. Pages stay resident until either the pool is dropped or a free buffer is loaded into a slot that has not yet been claimed. When the pool is full and no free buffer is available, KEEP raises ERRCODE_INSUFFICIENT_RESOURCES rather than evicting an existing page. Useful for pinning a small hot working set entirely in memory:
CREATE BUFFER POOL hot_lookups
HANDLER keep_pool_handler SIZE '8388608';
ALTER TABLE country_codes SET (buffer_pool = 'hot_lookups');
JAM (TOAST-Optimised)
JAM is a clock-sweep variant tuned for the write-heavy, read-few, temporal-locality access pattern of TOAST data:
- new buffers are inserted with
usage_count = 1because few TOAST pages are re-read; - on hit,
usage_countis boosted modestly and capped at 3; - the sweep halves
usage_countinstead of decrementing it, so pages age out faster than in plain clock-sweep; - the trickle iterator yields dirty, cold pages (
usage_count == 0) to the trickle writer in algorithm-meaningful order.
The implementation lives in src/backend/storage/buffer/jam_pool.c next to the other built-in routines (clock, KEEP, RECYCLE). It is not heap-specific code and does not depend on heapam internals, even though TOAST is its primary motivating workload. Reachable from SQL via the jam_pool_handler proc:
CREATE BUFFER POOL toast_jam HANDLER jam_pool_handler SIZE '67108864';
CREATE TABLE docs (id int, body text)
WITH (overflow_buffer_pool = 'toast_jam');
JAM is not yet auto-assigned as the heap AM's default overflow pool; the heap AM's relation_overflow_pool callback is a stub awaiting a separate auto-creation mechanism.
Replacement Algorithm Extensions
Six replacement algorithms ship as contrib extensions:
| Extension | Handler | Algorithm | Notes |
|---|---|---|---|
pg_bp_lru |
lru_pool_handler |
LRU | Classic Least Recently Used; primarily a baseline |
pg_bp_osic |
osic_pool_handler |
OSIC | One-Stage Insert Clock; clock-sweep simplicity with insert-time admission |
pg_bp_arc |
arc_pool_handler |
ARC | Adaptive Replacement Cache; patent expired Feb 2024 |
pg_bp_lirs |
lirs_pool_handler |
LIRS | Low Inter-reference Recency Set; strong scan resistance |
pg_bp_rcrd |
rcrd_pool_handler |
LIRS2 (RCRD) | Second-generation LIRS; cheaper bookkeeping under contention |
pg_bp_car |
car_pool_handler |
CAR | Clock with Adaptive Replacement; ARC's adaptivity, clock's lock profile |
Each extension registers itself by name from its _PG_init() hook via RegisterDefaultPoolAlgorithm("name", &routine). Names are looked up by string; core does not carry a fixed enum of contrib algorithm identifiers.
Writing a Custom Extension
A replacement algorithm extension implements the BufferPoolRoutine vtable defined in src/include/storage/bufpool.h:
typedef struct BufferPoolRoutine
{
NodeTag type;
/* Access tracking (all optional, may be NULL) */
void (*on_hit)(void *state, int buf_id, struct buftag *tag);
void (*on_miss)(void *state, struct buftag *tag);
void (*on_evict)(void *state, int buf_id, struct buftag *old_tag);
void (*on_new_tag)(void *state, int buf_id,
struct buftag *new_tag, bool vacuum_hint);
/* Core eviction (required) */
BufferDesc *(*get_victim)(void *state,
BufferAccessStrategy strategy,
uint64 *buf_state, bool *from_ring);
/* Trickle writer dispatch */
int (*sync_start)(void *state, uint32 *complete_passes,
uint32 *num_alloc);
void (*notify_trickle)(void *state, int bgwprocno);
void *(*trickle_iter_begin)(void *state, int max_candidates);
int (*trickle_iter_next)(void *state, void *iter);
void (*trickle_iter_end)(void *state, void *iter);
/* Hints (optional) */
void (*hint_vacuum)(void *state, bool vacuum_active);
bool (*reject_buffer)(void *state, BufferAccessStrategy strategy,
BufferDesc *buf, bool from_ring);
void (*prefetch_hint)(void *state, struct buftag *tags, int ntags);
/* Lifecycle (required) */
Size (*shmem_size)(int nbuffers);
void (*shmem_init)(void *state, int nbuffers, int first_buf_id,
bool init);
void (*shutdown)(void *state);
} BufferPoolRoutine;
The trickle_iter_* callbacks are optional. When provided, they let the algorithm direct flush order to its actually-cold pages -- LRU's tail, LIRS's HIR queue, ARC's T1 LRU end -- which is more useful than blindly walking buffer IDs. When NULL the trickle writer falls back to a linear scan.
A new typedef list entry per added type ensures pg_indent remains clean across the series.
Trickle Writer
Each user-created pool spawns a dedicated background worker that periodically flushes dirty pages via SyncOneBuffer:
- Algorithm-directed flushing through
trickle_iter_begin/next/endwhen the routine provides them, or a linear scan over the pool's descriptor range otherwise. - Oversubscription nudging (advisory): when a pool's actual buffer count exceeds its target -- which can happen transiently after another pool is dropped -- the trickle writer prefers excess buffers as eviction candidates and decrements the counter as it does. This is best-effort accounting, not strict enforcement; pinned buffers stay put and the counter is updated honestly.
- Cross-backend safety: the trickle writer dispatches through a backend-local
routinepointer resolved at startup, not throughpool->bp_routineread from shared memory. An extension's function pointers are only valid in the address space of the backend that loaded the.so, so reading them from shared memory in another backend is unsafe. - Critical-section safety:
DropRelationBuffers()andDropRelationsAllBuffers()are called from inside theSTART_CRIT_SECTION()/END_CRIT_SECTION()block inRelationTruncate. Inside a critical sectionpallocasserts onCritSectionCount == 0 || allowInCritSection; thedsm_attachpath inEnsurePoolAttachedwould assert. These two paths useTryGetPoolAttachedonly and skip pools the current backend has not touched, relying on the invariant that any backend that ever placed a buffer in poolPhas already attached toPviaBufferAllocInPool. An autovacuum worker truncating a system catalog will see no user pools attached, find no buffers to invalidate in any of them, and exit the loop without touching DSM -- which is correct because system catalogs always live in the DEFAULT pool. - Vacuum hint:
heap_vacuum_relbracketslazy_scan_heapwithPoolHintVacuum(rel->rd_bufpool, true / false)so frequency-aware algorithms (ARC, CAR, LIRS, RCRD) can bias VACUUM-loaded pages toward the LRU end of their recency list. The hint is a no-op for clock-sweep, KEEP, RECYCLE, and JAM. - Failure mode: if
RegisterDynamicBackgroundWorkerfails (typicallymax_worker_processesexhausted) the user gets aWARNINGwith anerrhintsuggesting the fix; the pool still works without proactive flushing.
TOAST and Overflow Routing
CREATE BUFFER POOL txn_pool HANDLER arc_pool_handler SIZE '536870912'; CREATE BUFFER POOL blob_pool HANDLER lru_pool_handler SIZE '268435456';
CREATE TABLE orders (
id bigint,
metadata jsonb,
attachment bytea
) WITH (
buffer_pool = 'txn_pool',
overflow_buffer_pool = 'blob_pool'
);
When create_toast_table() runs it picks up the parent's overflow_buffer_pool reloption and propagates it to the new TOAST relation as the TOAST table's buffer_pool. If the parent does not specify one, the heap AM's relation_overflow_pool callback is consulted; if both produce nothing the TOAST table uses the DEFAULT pool.
ALTER TABLE ... SET / RESET (overflow_buffer_pool = ...) cascades to the TOAST relation, keeping parent and TOAST aligned for as long as the user operates on the parent.
Monitoring
pg_stat_bufferpool
SELECT name, nbuffers,
reads, hits,
round(100.0 * hits / nullif(reads + hits, 0), 2) AS hit_pct,
evictions,
target_buffers, current_buffers, oversubscribed
FROM pg_stat_bufferpool
ORDER BY name;
Detecting Unhealthy Cache Behaviour
-- Pools with hit ratio below 90%
SELECT name, nbuffers,
round(100.0 * hits / nullif(reads + hits, 0), 2) AS hit_pct
FROM pg_stat_bufferpool
WHERE reads + hits > 1000
AND 100.0 * hits / nullif(reads + hits, 0) < 90
ORDER BY hit_pct;
-- Oversubscribed pools (currently holding more buffers than their target) SELECT name, target_buffers, current_buffers, oversubscribed FROM pg_stat_bufferpool WHERE oversubscribed;
Catalog
pg_bufferpool
| Column | Type | Description |
|---|---|---|
oid |
oid |
Unique object identifier |
bpname |
name |
Pool name (unique) |
bphandler |
regproc |
Handler function OID |
bpsize |
int8 |
Size in bytes (0 means "unbounded / managed by REMAINDER" or DEFAULT) |
SELECT bpname, bphandler::regproc, bpsize FROM pg_bufferpool;
GUC Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
buffer_pool_algorithm |
string |
'clock' |
Replacement algorithm name for the DEFAULT pool. PGC_POSTMASTER; only built-in is clock; extensions register additional names from shared_preload_libraries. Misconfigured values fall back to clock with a WARNING.
|
recycle_pool_buffers |
int |
0 |
Buffers reserved for the RECYCLE pool. 0 disables the pool and restores per-backend ring buffers. PGC_POSTMASTER.
|
trickle_flush_after |
int |
varies | Pages the trickle writer flushes before issuing an OS writeback hint (mirrors backend_flush_after).
|
trickle_write_batch_size |
int |
128 |
Maximum dirty pages written per trickle-writer cycle. |
Performance
The framework's contribution to overhead is small and concentrated in three places:
- One indirect call (
ActivePoolRoutine->get_victim) perStrategyGetBufferon the DEFAULT pool; - A predicted-not-taken branch in
BufHdrGetBlock()for buffers belonging to dynamic pools (buf_id >= NBuffers); - DSM-attach bookkeeping the first time each backend touches a dynamic pool (one-time, lazy).
The cost of running an alternative algorithm comes from that algorithm's own access-tracking, not from the framework. The numbers below quantify both.
Microbenchmark: pluggability cost
A 19-pattern Python microbenchmark suite (pgbench-style queries against a 100k-row table inside a 128 MB shared_buffers, working set fits in cache so all hit_ratio values are 1.0) compares clock-sweep against each contrib algorithm. Each datapoint is the median of multiple runs; the test instance is the same single-socket Linux box across all algorithms.
Mean over all 19 patterns:
| Algorithm | Mean (s) | vs clock |
|---|---|---|
clock |
0.06703 | baseline |
arc |
0.07436 | +10.9 % |
car |
0.07542 | +12.5 % |
lirs |
0.07525 | +12.3 % |
lru |
0.07565 | +12.9 % |
osic |
0.07582 | +13.1 % |
Per-pattern breakdown of the relative slowdown vs. clock-sweep, with clock-sweep absolute time on the left for context:
| Pattern | clock (s) | arc | car | lirs | lru | osic |
|---|---|---|---|---|---|---|
| full_scan | 0.1976 | +3.2 % | +4.6 % | +2.9 % | +4.0 % | +4.0 % |
| filtered_scan | 0.1422 | +2.6 % | +3.2 % | +3.2 % | +3.5 % | +3.9 % |
| column_projection | 0.0553 | +2.7 % | +2.2 % | +4.0 % | +4.8 % | +4.5 % |
| zipfian_access | 0.0549 | +4.5 % | +5.2 % | +4.5 % | +3.6 % | +4.7 % |
| working_set_80_20 | 0.0258 | +5.9 % | +7.8 % | +6.3 % | +7.3 % | +4.9 % |
| temporal_locality | 0.0320 | +6.4 % | +7.4 % | +8.0 % | +7.3 % | +9.6 % |
| capacity_pressure | 0.0248 | +7.9 % | +7.6 % | +8.9 % | +10.8 % | +8.8 % |
| sequential_then_random | 0.0798 | +10.6 % | +12.7 % | +12.4 % | +13.4 % | +12.1 % |
| write_heavy_update | 0.0203 | +10.0 % | +13.0 % | +11.8 % | +13.7 % | +14.0 % |
| sequential_scan_burst | 0.1438 | +11.5 % | +13.9 % | +12.4 % | +12.1 % | +12.9 % |
| repeated_hotset | 0.1299 | +13.0 % | +13.9 % | +13.8 % | +14.2 % | +15.9 % |
| working_set_shift | 0.0041 | +13.6 % | +19.3 % | +15.3 % | +17.0 % | +19.8 % |
| group_by | 0.0628 | +12.6 % | +17.1 % | +13.9 % | +14.4 % | +14.3 % |
| aggregation | 0.0257 | +17.3 % | +19.6 % | +20.3 % | +22.9 % | +21.7 % |
| concurrent_mixed | 0.0658 | +18.9 % | +20.1 % | +21.2 % | +22.1 % | +23.0 % |
| cyclic_loop | 0.1606 | +23.2 % | +25.5 % | +26.4 % | +26.8 % | +26.3 % |
| mixed_oltp_scan | 0.0435 | +29.7 % | +30.8 % | +35.9 % | +37.9 % | +38.7 % |
What this is and isn't measuring:
- All five non-clock algorithms have a roughly equal envelope of overhead (10-13 % on the average over all patterns; up to ~30-40 % on the worst-case patterns). That envelope is the cost of doing access-tracking at all (linked-list manipulation, ghost-list lookups, per-access spinlock contention) compared to clock-sweep's single atomic increment of
nextVictimBuffer. This is unsurprising and well-documented in the cache-replacement literature. - Hit ratio is identical (1.0) across all algorithms for these microbenchmark patterns because the working set fits in cache. These workloads do not exercise the property the alternative algorithms exist for -- they exercise only the per-access overhead of bookkeeping.
- The framework itself is not visible in the numbers. Clock-sweep on the multi-pool branch matches clock-sweep on master to within run-to-run noise on the same hardware, and the DEFAULT pool dispatches through the same vtable as user pools. The pluggability has no observable cost when the configured algorithm is clock.
- Where the alternative algorithms earn their cost is in scenarios this microbenchmark cannot construct: working sets larger than cache, scan resistance, mixed OLTP+analytics, and workloads with shifting hot/cold data. A forthcoming HammerDB TPC-C / TPC-H run on a NUMA box (configured for working sets >
shared_buffers) is what will exercise that property; the harness lives on thearc-benchbranch and the results will be appended to this page.
What the temporal-locality numbers reveal about per-hit cost
The temporal_locality pattern (5 phases of 500 random point queries each: phases 1-3 access disjoint key ranges, phases 4-5 re-visit phases 1 and 2) was designed to exercise ghost-list / frequency tracking, but on this run the working set fits and ghost lists never activate. What the numbers do reveal -- consistently across all four data distributions -- is a structural difference in per-hit bookkeeping cost:
| Distribution | clock | arc | car | lirs | lru | osic |
|---|---|---|---|---|---|---|
| random | baseline | +6.6 % | +5.2 % | +6.7 % | +7.7 % | +8.8 % |
| clustered | baseline | +5.0 % | +4.7 % | +5.3 % | +5.9 % | +4.9 % |
| low_cardinality | baseline | +5.1 % | +4.7 % | +12.6 % | +12.2 % | +12.4 % |
| high_null | baseline | +8.8 % | +14.1 % | +7.3 % | +3.7 % | +11.9 % |
The pattern is informative even though it isn't testing the intended property:
- ARC and CAR are the cheapest non-clock algorithms on the typical (random / clustered) distributions, ~5 % over clock. This matches their access-pattern: a hit on a page in T1 promotes once to T2 (one list-remove, one list-append), then later hits on the same page in T2 are no-ops on the data structure. Most of phases 4-5 land in T2 at zero structural cost.
- LRU is consistently 7-13 % slower than clock because every hit moves the page to the MRU end -- 2,500 list manipulations across the 5 phases, all under the spinlock. ARC and CAR avoid this because hits in T2 don't churn the list.
- LIRS pays ~13 % on
low_cardinalitybecause the HIR/LIR stack pruning logic does extra work on every reference to keep the LIR set bounded; that work is amortised away onrandombut the same key being hit repeatedly (low_cardinality) makes the per-hit case visible. high_nullis the noisy row. Absolute time is ~13 % larger across the board because NULL-bitmap processing dominates page-access time; a 4 ms swing on a 39 ms run is within run-to-run jitter at this scale. No algorithmic conclusion should be drawn from this row without a longer measurement.
The structural finding is the headline: frequency-aware algorithms (ARC, CAR) have lower per-hit overhead than recency-only algorithms (LRU) because they only mutate their data structure on the first hit, not every hit. Exactly the property that makes them attractive for OLTP workloads where the same hot pages are touched thousands of times.
Reproducing
The benchmark harness, including the 19 query patterns, the cache-stress workload generators, the DTrace/bpftrace profiling scripts, the HammerDB TPC-C/TPC-H integration, and the visualisation utilities, lives at:
- gburd/postgres, branch
arc-bench-- insrc/test/benchmarks/
To reproduce:
git clone https://github.com/gburd/postgres.git cd postgres git checkout arc-bench # builds on top of arc with the harness meson setup build --prefix=$PWD/install -Dcassert=true ninja -C build install cd src/test/benchmarks python3 -m benchmarks --shared-buffers 128MB --duration 30 \ --algorithms clock,arc,car,lirs,lru,osic \ --output /tmp/bp-results
Commit Series
The implementation is organised as 13 substantive commits on the arc branch (oldest to newest):
- Add pluggable buffer-pool framework with named pools -- core infrastructure:
BufferPoolRoutinevtable,pg_bufferpoolcatalog,CREATE/ALTER/DROP BUFFER POOLDDL, DSM-backed dynamic pools, per-pool trickle writer,buffer_poolreloption,pg_buffercache1.8 update,buffer_pool_algorithmstring GUC, register-by-handler-name registry. - Add REMAINDER pool, BufferPoolKind classification, and per-pool flags -- REMAINDER pool kind;
BufferPoolKindenum (DEFAULT/RECYCLE/USER);bp_target_buffers/bp_current_buffers/bp_oversubscribedadvisory counters;bp_use_direct_ioflag (plumbing only); REMAINDER keyword. - Add trickle_iter vtable callbacks for replacement algorithms -- new
trickle_iter_begin/next/endcallbacks; trickle writer dispatches through them when non-NULL. Backend-localroutinepointer, notpool->bp_routine, to avoid cross-backend invalid function pointers. - Add KEEP and RECYCLE built-in buffer pool routines --
keep_pool_routine/keep_pool_handler;recycle_pool_routinewith one-chance clock-sweep;recycle_pool_buffersGUC;BufferAccessStrategyrouting through the RECYCLE pool when configured; lookup bybp_kindnot by name. - Add overflow_buffer_pool reloption for the heap table AM -- new
overflow_buffer_poolreloption (heap only);rd_overflow_bufpool; newrelation_overflow_pooltableam callback (default heap implementation is a stub). - Route heap TOAST tables through overflow_buffer_pool --
create_toast_table()propagates the reloption to the TOAST relation asbuffer_pool;ALTER TABLEcascade. - Add JAM buffer pool replacement algorithm for TOAST workloads --
jam_pool.cinsrc/backend/storage/buffer/next to clock/KEEP/RECYCLE;jam_pool_handler. - Add pg_bp_lru contrib extension: LRU replacement algorithm
- Add pg_bp_osic contrib extension: One-Stage Insert Clock
- Add pg_bp_arc contrib extension: Adaptive Replacement Cache
- Add pg_bp_lirs contrib extension: Low Inter-reference Recency Set
- Add pg_bp_rcrd contrib extension: Recent Consecutive Reuse Distance (LIRS2)
- Add pg_bp_car contrib extension: Clock with Adaptive Replacement
Per-commit verification: pg_indent --check clean and ninja warning-free under -Werror at every commit boundary. Full meson test suite passes at the tip (361/361).
Implementation Files
Core (new)
| File | Description |
|---|---|
src/include/storage/bufpool.h |
BufferPoolRoutine vtable; buffer_pool_algorithm GUC declaration; RegisterDefaultPoolAlgorithm / LookupDefaultPoolAlgorithm; KEEP / RECYCLE / clock pool routine externs.
|
src/include/storage/bufpool_internals.h |
BufferPoolDesc, BufferPoolKind, PoolLocalState, hash-partition layout helpers.
|
src/include/catalog/pg_bufferpool.h |
pg_bufferpool catalog definition (oid, bpname, bphandler, bpsize).
|
src/include/catalog/pg_bufferpool.dat |
Initial DEFAULT row. |
src/include/commands/bufferpoolcmds.h |
CreateBufferPool, AlterBufferPool, DropBufferPool, RenameBufferPool.
|
src/backend/storage/buffer/bufpool.c |
Multi-pool descriptor array, DSM lifecycle (CreateDynamicBufferPool, DestroyDynamicBufferPool, ResizeDynamicBufferPool), per-backend lazy attach (EnsurePoolAttached), BufferAllocInPool hot path, trickle-writer BGW.
|
src/backend/storage/buffer/jam_pool.c |
JAM replacement routine (jam_pool_routine, jam_pool_handler).
|
src/backend/commands/bufferpoolcmds.c |
DDL implementation, name validation, catalog manipulation. |
Modified
| File | Key Changes |
|---|---|
src/backend/storage/buffer/bufmgr.c |
BufHdrGetBlock macro routes buf_id >= NBuffers through GetDynamicPoolBlock; CurrentBufferPool wrapped in PG_TRY/PG_FINALLY in PinBufferForBlock and ExtendBufferedRelShared.
|
src/backend/storage/buffer/freelist.c |
Clock-sweep wrapped in BufferPoolRoutine vtable (clock_pool_routine); BufferAccessStrategy recycles through RECYCLE pool when bp_kind == BUFPOOL_RECYCLE exists; KEEP and RECYCLE routine implementations; algo_registry[] name → routine table; RegisterDefaultPoolAlgorithm / LookupDefaultPoolAlgorithm.
|
src/backend/parser/gram.y |
CREATE BUFFER POOL, ALTER BUFFER POOL, DROP BUFFER POOL, RENAME, REMAINDER productions; BUFFER, POOL, SIZE, REMAINDER unreserved keywords.
|
src/backend/access/common/reloptions.c |
buffer_pool, overflow_buffer_pool reloption types.
|
src/backend/access/heap/heapam_handler.c |
relation_overflow_pool tableam callback (currently NULL stub).
|
src/backend/catalog/toasting.c |
create_toast_table reads overflow_buffer_pool from parent or AM and writes it as buffer_pool on the TOAST relation.
|
src/backend/commands/tablecmds.c |
ATExecSetRelOptions cascades overflow_buffer_pool to TOAST.
|
src/backend/utils/cache/relcache.c |
rd_bufpool, rd_overflow_bufpool resolution.
|
src/backend/utils/misc/guc_parameters.dat |
buffer_pool_algorithm string GUC; recycle_pool_buffers; trickle-writer tunables.
|
src/backend/utils/misc/postgresql.conf.sample |
New parameters added in their natural sections. |
contrib/pg_buffercache/ |
1.7 → 1.8 update for per-pool columns. |
contrib/pg_bp_arc/ |
ARC extension; Apache 2.0 notice for the Caffeine ArcPolicy.java reference; patent disposition. |
contrib/pg_bp_car/ |
CAR extension; patent disposition. |
contrib/pg_bp_lirs/ |
LIRS extension. |
contrib/pg_bp_lru/ |
LRU extension. |
contrib/pg_bp_osic/ |
OSIC extension. |
contrib/pg_bp_rcrd/ |
RCRD / LIRS2 extension. |
References
Research Papers
- N. Megiddo and D.S. Modha, "ARC: A Self-Tuning, Low Overhead Replacement Cache", USENIX FAST 2003.
- N. Megiddo and D.S. Modha, "Outperforming LRU with an Adaptive Replacement Cache Algorithm", IEEE Computer 2004.
- S. Bansal and D.S. Modha, "CAR: Clock with Adaptive Replacement", USENIX FAST 2004.
- S. Jiang and X. Zhang, "LIRS: An Efficient Low Inter-reference Recency Set Replacement Policy to Improve Buffer Cache Performance", ACM SIGMETRICS 2002.
- C. Zhong, X. Zhao, and S. Jiang, "LIRS2: An Improved LIRS Replacement Algorithm", 2013.
- T. Johnson and D. Shasha, "2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm", VLDB 1994.
- E.J. O'Neil, P.E. O'Neil, and G. Weikum, "The LRU-K Page Replacement Algorithm for Database Disk Buffering", ACM SIGMOD 1993.
PostgreSQL Mailing List
- Experimental ARC implementation (Jan Wieck, 2003)
- 8.0.X and the ARC patent (Bruce Momjian, Tom Lane, 2005)
- Design notes for BufMgrLock rewrite (Tom Lane, 2005)
- Our buffer replacement strategy is kind of lame (Robert Haas, 2011)
- Buffer manager scalability and correlated reference period (Peter Geoghegan, 2014)
- Adding basic NUMA awareness (Tomas Vondra, 2025)
- Let's get rid of the freelist and the buffer_strategy_lock (Greg Burd, 2025)
- Batched clock sweep to reduce cross-socket atomic contention (Greg Burd, 2026)
Reference Implementations
- Caffeine cache library, B. Manes (Java).
- Caffeine ArcPolicy.java -- reference ARC implementation under Apache 2.0; the patent posture quoted earlier in this page is from this file's docstring.
- Caffeine: comparative efficiency of replacement algorithms -- empirical comparison of LRU, ARC, LIRS, W-TinyLFU, and others.
Patents
ARC family (IBM):
- US 6,996,676 -- "System and method for implementing an adaptive replacement cache policy". Filed 2002, granted 2006, expired February 22, 2024.
- US 7,096,321 -- continuation; also expired.
- US 7,058,766 -- CAR; also expired.
- US 8,612,689 -- continuation; also expired.
Sun Microsystems:
- US 7,469,320 -- ZFS-related; also out of force.
The implementation in this series is included on the basis that the patents have expired; downstreams should still confirm with their own counsel before relying on that conclusion in their jurisdiction.