Multiple Buffer Pools

From PostgreSQL wiki
Jump to navigationJump to search

Multiple Buffer Pools

Request for Comments [RFC] NOT MERGED

Summary

Multiple buffer pools extend PostgreSQL's buffer manager so that the cluster's shared_buffers can be partitioned into named pools, each governed by its own pluggable replacement algorithm. Tables and indexes can be assigned to a specific pool via the buffer_pool reloption, and TOAST overflow data can be routed to a dedicated pool via overflow_buffer_pool (default JAM algorithm).

Six replacement algorithms ship as contrib extensions alongside the built-in clock-sweep -- ARC, CAR, LIRS, LIRS2 (RCRD), LRU, OSIC -- plus three core-built-in routines for special-purpose pools:

  • KEEP -- never evicts; for pinning a small hot working set in memory.
  • RECYCLE -- one-chance clock sweep that replaces per-backend ring buffers for sequential scans, VACUUM, and bulk writes.
  • JAM -- accelerated-decay clock tuned for the write-heavy, read-few access pattern of TOAST data.

Each user-created pool runs its own trickle-writer background worker for dirty-page flushing. Pools are created, resized (drop-and-recreate), renamed, and dropped via SQL DDL; metadata is persisted in a new pg_bufferpool system catalog and recreated automatically on server restart.

The replacement-algorithm framework is a thin vtable layered on top of the existing buffer manager: when the DEFAULT pool runs clock-sweep (the configured default), the framework's hot path is one indirect call per StrategyGetBuffer call. Benchmark numbers below show the cost of that indirection is in the noise; the cost of running an alternative algorithm comes from the algorithm's own bookkeeping, not from the framework.

Status

  • Patch status: Work in progress; pre-CommitFest design preview.
  • Target release: PostgreSQL 19 (speculative).
  • Branch name: arc.
  • Repository: gburd/postgres, branch arc.
  • Mailing list thread: To be posted to pgsql-hackers; this wiki page is the design preview.
  • CommitFest: Not yet submitted.
  • Tests: Core regress passes (361/361 in meson test); per-commit verification confirms pg_indent-clean and warning-free compile under -Werror at every commit boundary.

Try It / Build

The framework is always available; the DEFAULT pool runs clock-sweep unless the operator changes buffer_pool_algorithm. Alternative algorithms ship as contrib extensions and must be loaded via shared_preload_libraries before they can be selected.

 git clone https://github.com/gburd/postgres.git
 cd postgres
 git checkout arc
 meson setup build --prefix=$PWD/install -Dcassert=true
 ninja -C build install
 initdb -D /path/to/data
 pg_ctl -D /path/to/data start

Quick smoke test:

 -- Use the default clock-sweep DEFAULT pool plus a private ARC pool
 CREATE EXTENSION pg_bp_arc;          -- after restart with shared_preload_libraries
 CREATE BUFFER POOL hot_data
     HANDLER arc_pool_handler SIZE '67108864';
 CREATE TABLE events (id bigint, ts timestamptz, payload jsonb)
     WITH (buffer_pool = 'hot_data');
 SELECT name, nbuffers, hits, evictions FROM pg_stat_bufferpool;

Benchmark harness (Python + DTrace/bpftrace + HammerDB integration) lives on a companion branch and is referenced from this page rather than included in the patch series:

Background and Motivation

Clock-Sweep (2005)

PostgreSQL has used a clock-sweep buffer replacement algorithm since version 8.1, when Tom Lane redesigned the buffer manager to eliminate the monolithic BufMgrLock and replace the short-lived ARC implementation with a scalable, lock-light alternative.

The clock-sweep algorithm uses an atomic nextVictimBuffer pointer and per-buffer usage_count fields. On each buffer allocation, the clock hand advances; buffers with usage_count > 0 get a second chance (count decremented), while those at zero are evicted. This approach is simple, has low per-operation overhead, and scales well on multi-core systems.

However, clock-sweep is a single policy applied uniformly to all data. It cannot distinguish between a frequently-accessed index root page and a sequential-scan page that will never be re-read. Robert Haas demonstrated this limitation in 2011, showing that only 96 of 14,286 buffers for a table were retained in cache despite over 37,000 completely unused buffers being available.

The ARC Patent

In 2003 Jan Wieck implemented the Adaptive Replacement Cache (ARC) algorithm for PostgreSQL 7.4/8.0, based on research from IBM Almaden by Nimrod Megiddo and Dharmendra Modha.

The implementation was removed in early 2005 after the community learned of IBM's pending patent (US 6,996,676, granted February 2006). Tom Lane developed a 2Q replacement as a bridge, then designed the clock-sweep algorithm that has been in use since.

The ARC patent reached its 20-year statutory term on February 22, 2024 and has expired. Sun Microsystems' related US 7,469,320 (also expired) covered work that had been shipping for years inside ZFS under the CDDL, which the broader community has long treated as authoritative public-domain prior art.

This series implements ARC as one of the contrib replacement algorithms. The implementation in contrib/pg_bp_arc follows the pseudocode in the FAST 2003 paper and was cross-checked against the Caffeine project's reference implementation (ArcPolicy.java); the Apache 2.0 license header from that file is reproduced verbatim in pg_bp_arc.c per the Apache 2.0 grant.

A note on patent posture

Ben Manes, who maintains the Caffeine cache library, has publicly summarised the ARC patent landscape this way (paraphrased from the file-level docstring of ArcPolicy.java):

This algorithm is patented by IBM (6996676, 7096321, 7058766, 8612689) and Sun (7469320), making its use in applications ambiguous due to Sun's ZFS providing an implementation under the CDDL.

That is the right framing for our community as well. For the years 2005-2023 the patent posture made ARC unsafe to ship in PostgreSQL even though a CDDL implementation existed in the wild. As of 2024 the IBM patent family has reached its 20-year term and expired, and Sun's patent is similarly out of force; the prior art landscape is now unambiguous. We include this implementation on that basis and recommend the same conclusion. Reviewers and downstreams should still confirm with their own counsel before relying on that conclusion in their jurisdiction.

Ring Buffers and Bulk I/O

In 2007 Tom Lane introduced BufferAccessStrategy objects -- small per-backend ring buffers that prevent large sequential scans, VACUUM, and bulk writes from evicting the entire shared buffer cache. The RECYCLE pool in this series generalises that ring-buffer concept into a shared, named pool with its own replacement policy: a one-chance clock-sweep that gives each page exactly one cycle to be re-accessed before it becomes a victim.

Related Work

Several prior efforts have explored improvements to PostgreSQL's buffer management:

Year Proposal Authors Outcome
2000 LRU-2 experiment Tom Lane No measurable improvement; not pursued
2002 LRU-K / 2Q for scan resistance Amit Khare, Tom Lane, Neil Padgett Informed 2005 2Q bridge patch
2013 LIRS page replacement Ants Aasma, Tom Lane Discussion only
2014 Correlated reference period / LRU-K Peter Geoghegan, Amit Kapila Prototype only
2014 Lockless StrategyGetBuffer Andres Freund Committed (atomic nextVictimBuffer)
2024 Dynamic shared_buffers without restart Dmitry Dolgov Multiple shared memory segments
2025 NUMA-aware buffer management Tomas Vondra, Jakub Wartak Partitioned freelists per NUMA node
2025 Freelist removal / buffer_strategy_lock elimination Greg Burd Atomic-only clock sweep
2025 Eagerly evict bulkwrite ring Melanie Plageman Batch eviction for async I/O
2026 Batched clock sweep Greg Burd, Jim Mlodgenski 16-20% pgbench improvement on NUMA

Architecture Overview

                        ┌──────────────────────────────┐
                        │        SQL Interface          │
                        │  CREATE / ALTER / DROP        │
                        │  BUFFER POOL                  │
                        └──────────────┬───────────────┘
                                       │
              ┌────────────────────────┼───────────────────────────┐
              │                        │                           │
              ▼                        ▼                           ▼
   ┌──────────────────┐    ┌──────────────────┐       ┌──────────────────┐
   │  DEFAULT pool    │    │  User pools      │       │  RECYCLE pool    │
   │  (shared_buffers)│    │  (DSM-backed)    │       │  (bulk I/O)      │
   │                  │    │                  │       │                  │
   │  clock-sweep     │    │  any handler:    │       │  one-chance      │
   │  (or any         │    │  ARC, CAR, LIRS, │       │  clock-sweep     │
   │  registered      │    │  LRU, OSIC, RCRD,│       │                  │
   │  algorithm via   │    │  KEEP, JAM, ...  │       │                  │
   │  GUC)            │    │                  │       │                  │
   └────────┬─────────┘    └────────┬─────────┘       └────────┬─────────┘
            │                       │                          │
            └───────────────────────┼──────────────────────────┘
                                    │
                          ┌─────────┴──────────┐
                          │ BufferPoolRoutine   │
                          │ vtable (per pool)   │
                          │                     │
                          │  get_victim()       │
                          │  on_hit / on_miss   │
                          │  on_evict           │
                          │  on_new_tag         │
                          │  trickle_iter_*     │
                          │  hint_vacuum        │
                          │  reject_buffer      │
                          │  prefetch_hint      │
                          │  shmem_size / init  │
                          └─────────┬───────────┘
                                    │
                          ┌─────────┴──────────┐
                          │  Trickle Writer    │
                          │  (per dynamic pool │
                          │  background worker)│
                          │                     │
                          │  algorithm-directed │
                          │  dirty page flush   │
                          │  oversubscription   │
                          │  nudging            │
                          └─────────────────────┘

The DEFAULT pool always exists at slot 0 and uses the existing shared_buffers allocation and global BufferDescriptors[] array. User-created pools allocate their own dynamic shared memory (DSM) segments containing private buffer descriptors, data blocks, hash table partitions, and algorithm state. Each backend lazily attaches to a pool's DSM segment on first access.

SQL Interface

Creating a Pool

 CREATE BUFFER POOL name HANDLER handler_function SIZE 'size';
 CREATE BUFFER POOL name REMAINDER HANDLER handler_function;

The HANDLER names a function that returns a BufferPoolRoutine vtable. SIZE is specified as a bare integer string in bytes (e.g. '67108864' for 64 MB); the smallest pool is 16 buffers (128 KB at 8 KB/buffer). The REMAINDER form sizes the pool to whatever shared-buffer budget is not already claimed by other named pools; only one REMAINDER pool can exist at a time.

 -- 64 MB ARC pool for hot OLTP data
 CREATE BUFFER POOL hot_data HANDLER arc_pool_handler SIZE '67108864';
 -- catch-all for the unclaimed remainder of shared_buffers
 CREATE BUFFER POOL leftovers HANDLER lru_pool_handler REMAINDER;

Pool metadata is persisted in pg_bufferpool and pools are recreated on server restart. Up to 64 pools (MAX_BUFFER_POOLS) can exist simultaneously.

Resizing a Pool

 ALTER BUFFER POOL name SET SIZE 'new_size';

This is implemented as drop-and-recreate, not online resize: DSM has no resize API, so the pool's segment is torn down and recreated at the new size. All pages in the pool are flushed and the pool starts cold. This is intentionally not advertised as "online" -- users should expect a cold-cache transient on the affected pool immediately after.

Renaming and Dropping

 ALTER BUFFER POOL name RENAME TO new_name;
 DROP BUFFER POOL [IF EXISTS] name [CASCADE | RESTRICT];

All dirty pages are flushed on drop. The pool cannot be dropped if any buffers are still pinned. Tables previously assigned to a dropped pool fall back to the DEFAULT pool.

Assigning Tables to Pools

 -- assign at table creation
 CREATE TABLE events (id bigint, ts timestamptz, data jsonb)
     WITH (buffer_pool = 'hot_data');
 -- route TOAST overflow to a different pool
 CREATE TABLE documents (id int, body text)
     WITH (overflow_buffer_pool = 'toast_pool');
 -- move an existing table
 ALTER TABLE events SET (buffer_pool = 'archive_pool');
 -- back to the default pool
 ALTER TABLE events RESET (buffer_pool);

The overflow_buffer_pool reloption applies only to heap relations and propagates automatically to the table's TOAST relation as that TOAST relation's buffer_pool. ALTER TABLE ... SET / RESET (overflow_buffer_pool = ...) cascades the change.

Selecting the DEFAULT Pool's Algorithm

The DEFAULT pool's replacement algorithm is selected by the buffer_pool_algorithm GUC (string, PGC_POSTMASTER, default 'clock'). Only clock is built in; any other name must come from an extension loaded via shared_preload_libraries:

 # postgresql.conf
 shared_preload_libraries = 'pg_bp_arc'
 buffer_pool_algorithm    = 'arc'

If the GUC names an algorithm that has not been registered (typo, or extension missing from shared_preload_libraries) the server falls back to clock-sweep with a WARNING rather than refusing to start.

Pool Types

DEFAULT Pool

The DEFAULT pool is the standard shared_buffers allocation. It always exists at slot 0 and is not visible as a target of CREATE BUFFER POOL / DROP BUFFER POOL. Its replacement algorithm is selected at startup via buffer_pool_algorithm.

User Pools

User pools are created with CREATE BUFFER POOL and backed by DSM segments. Each pool has:

  • its own buffer descriptors and data blocks;
  • a multi-partition open-addressed hash table sized by ComputePoolPartitions();
  • algorithm-private state in the same DSM segment;
  • a dedicated trickle-writer background worker;
  • atomic statistics counters (reads, hits, evictions).

A new BufferPoolKind classification distinguishes BUFPOOL_DEFAULT, BUFPOOL_RECYCLE (the well-known system pool, looked up by kind rather than name), and BUFPOOL_USER (catalog-backed user pools).

REMAINDER Pool

A REMAINDER pool claims whatever shared-buffer budget is not already allocated to other user pools. Only one REMAINDER pool can exist at a time; attempting to create a second produces an error referencing the existing one. Its size is recomputed at creation and is fixed for the life of the pool.

RECYCLE Pool

The RECYCLE pool replaces per-backend ring buffers (BufferAccessStrategy) for sequential scans, VACUUM, and bulk-write operations with a single shared pool. Enabled by sizing it through the recycle_pool_buffers GUC:

 # postgresql.conf
 recycle_pool_buffers = 4096    # 0 disables; restart required

Its replacement routine is a one-chance clock sweep: usage_count is cleared (not decremented) on the sweep, so each page gets exactly one cycle to be re-accessed before it becomes a victim. Aggressive enough not to pollute the main pool while still being a real cache rather than a private throwaway ring.

KEEP Pool

KEEP is a built-in routine that never evicts. Pages stay resident until either the pool is dropped or a free buffer is loaded into a slot that has not yet been claimed. When the pool is full and no free buffer is available, KEEP raises ERRCODE_INSUFFICIENT_RESOURCES rather than evicting an existing page. Useful for pinning a small hot working set entirely in memory:

 CREATE BUFFER POOL hot_lookups
     HANDLER keep_pool_handler SIZE '8388608';
 ALTER TABLE country_codes SET (buffer_pool = 'hot_lookups');

JAM (TOAST-Optimised)

JAM is a clock-sweep variant tuned for the write-heavy, read-few, temporal-locality access pattern of TOAST data:

  • new buffers are inserted with usage_count = 1 because few TOAST pages are re-read;
  • on hit, usage_count is boosted modestly and capped at 3;
  • the sweep halves usage_count instead of decrementing it, so pages age out faster than in plain clock-sweep;
  • the trickle iterator yields dirty, cold pages (usage_count == 0) to the trickle writer in algorithm-meaningful order.

The implementation lives in src/backend/storage/buffer/jam_pool.c next to the other built-in routines (clock, KEEP, RECYCLE). It is not heap-specific code and does not depend on heapam internals, even though TOAST is its primary motivating workload. Reachable from SQL via the jam_pool_handler proc:

 CREATE BUFFER POOL toast_jam HANDLER jam_pool_handler SIZE '67108864';
 CREATE TABLE docs (id int, body text)
     WITH (overflow_buffer_pool = 'toast_jam');

JAM is not yet auto-assigned as the heap AM's default overflow pool; the heap AM's relation_overflow_pool callback is a stub awaiting a separate auto-creation mechanism.

Replacement Algorithm Extensions

Six replacement algorithms ship as contrib extensions:

Extension Handler Algorithm Notes
pg_bp_lru lru_pool_handler LRU Classic Least Recently Used; primarily a baseline
pg_bp_osic osic_pool_handler OSIC One-Stage Insert Clock; clock-sweep simplicity with insert-time admission
pg_bp_arc arc_pool_handler ARC Adaptive Replacement Cache; patent expired Feb 2024
pg_bp_lirs lirs_pool_handler LIRS Low Inter-reference Recency Set; strong scan resistance
pg_bp_rcrd rcrd_pool_handler LIRS2 (RCRD) Second-generation LIRS; cheaper bookkeeping under contention
pg_bp_car car_pool_handler CAR Clock with Adaptive Replacement; ARC's adaptivity, clock's lock profile

Each extension registers itself by name from its _PG_init() hook via RegisterDefaultPoolAlgorithm("name", &routine). Names are looked up by string; core does not carry a fixed enum of contrib algorithm identifiers.

Writing a Custom Extension

A replacement algorithm extension implements the BufferPoolRoutine vtable defined in src/include/storage/bufpool.h:

typedef struct BufferPoolRoutine
{
    NodeTag     type;

    /* Access tracking (all optional, may be NULL) */
    void       (*on_hit)(void *state, int buf_id, struct buftag *tag);
    void       (*on_miss)(void *state, struct buftag *tag);
    void       (*on_evict)(void *state, int buf_id, struct buftag *old_tag);
    void       (*on_new_tag)(void *state, int buf_id,
                             struct buftag *new_tag, bool vacuum_hint);

    /* Core eviction (required) */
    BufferDesc *(*get_victim)(void *state,
                              BufferAccessStrategy strategy,
                              uint64 *buf_state, bool *from_ring);

    /* Trickle writer dispatch */
    int         (*sync_start)(void *state, uint32 *complete_passes,
                              uint32 *num_alloc);
    void        (*notify_trickle)(void *state, int bgwprocno);
    void       *(*trickle_iter_begin)(void *state, int max_candidates);
    int         (*trickle_iter_next)(void *state, void *iter);
    void        (*trickle_iter_end)(void *state, void *iter);

    /* Hints (optional) */
    void        (*hint_vacuum)(void *state, bool vacuum_active);
    bool        (*reject_buffer)(void *state, BufferAccessStrategy strategy,
                                 BufferDesc *buf, bool from_ring);
    void        (*prefetch_hint)(void *state, struct buftag *tags, int ntags);

    /* Lifecycle (required) */
    Size        (*shmem_size)(int nbuffers);
    void        (*shmem_init)(void *state, int nbuffers, int first_buf_id,
                              bool init);
    void        (*shutdown)(void *state);
} BufferPoolRoutine;

The trickle_iter_* callbacks are optional. When provided, they let the algorithm direct flush order to its actually-cold pages -- LRU's tail, LIRS's HIR queue, ARC's T1 LRU end -- which is more useful than blindly walking buffer IDs. When NULL the trickle writer falls back to a linear scan.

A new typedef list entry per added type ensures pg_indent remains clean across the series.

Trickle Writer

Each user-created pool spawns a dedicated background worker that periodically flushes dirty pages via SyncOneBuffer:

  • Algorithm-directed flushing through trickle_iter_begin/next/end when the routine provides them, or a linear scan over the pool's descriptor range otherwise.
  • Oversubscription nudging (advisory): when a pool's actual buffer count exceeds its target -- which can happen transiently after another pool is dropped -- the trickle writer prefers excess buffers as eviction candidates and decrements the counter as it does. This is best-effort accounting, not strict enforcement; pinned buffers stay put and the counter is updated honestly.
  • Cross-backend safety: the trickle writer dispatches through a backend-local routine pointer resolved at startup, not through pool->bp_routine read from shared memory. An extension's function pointers are only valid in the address space of the backend that loaded the .so, so reading them from shared memory in another backend is unsafe.
  • Critical-section safety: DropRelationBuffers() and DropRelationsAllBuffers() are called from inside the START_CRIT_SECTION()/END_CRIT_SECTION() block in RelationTruncate. Inside a critical section palloc asserts on CritSectionCount == 0 || allowInCritSection; the dsm_attach path in EnsurePoolAttached would assert. These two paths use TryGetPoolAttached only and skip pools the current backend has not touched, relying on the invariant that any backend that ever placed a buffer in pool P has already attached to P via BufferAllocInPool. An autovacuum worker truncating a system catalog will see no user pools attached, find no buffers to invalidate in any of them, and exit the loop without touching DSM -- which is correct because system catalogs always live in the DEFAULT pool.
  • Vacuum hint: heap_vacuum_rel brackets lazy_scan_heap with PoolHintVacuum(rel->rd_bufpool, true / false) so frequency-aware algorithms (ARC, CAR, LIRS, RCRD) can bias VACUUM-loaded pages toward the LRU end of their recency list. The hint is a no-op for clock-sweep, KEEP, RECYCLE, and JAM.
  • Failure mode: if RegisterDynamicBackgroundWorker fails (typically max_worker_processes exhausted) the user gets a WARNING with an errhint suggesting the fix; the pool still works without proactive flushing.

TOAST and Overflow Routing

 CREATE BUFFER POOL txn_pool  HANDLER arc_pool_handler  SIZE '536870912';
 CREATE BUFFER POOL blob_pool HANDLER lru_pool_handler  SIZE '268435456';
 CREATE TABLE orders (
     id          bigint,
     metadata    jsonb,
     attachment  bytea
 ) WITH (
     buffer_pool          = 'txn_pool',
     overflow_buffer_pool = 'blob_pool'
 );

When create_toast_table() runs it picks up the parent's overflow_buffer_pool reloption and propagates it to the new TOAST relation as the TOAST table's buffer_pool. If the parent does not specify one, the heap AM's relation_overflow_pool callback is consulted; if both produce nothing the TOAST table uses the DEFAULT pool.

ALTER TABLE ... SET / RESET (overflow_buffer_pool = ...) cascades to the TOAST relation, keeping parent and TOAST aligned for as long as the user operates on the parent.

Monitoring

pg_stat_bufferpool

 SELECT name, nbuffers,
        reads, hits,
        round(100.0 * hits / nullif(reads + hits, 0), 2) AS hit_pct,
        evictions,
        target_buffers, current_buffers, oversubscribed
 FROM pg_stat_bufferpool
 ORDER BY name;

Detecting Unhealthy Cache Behaviour

 -- Pools with hit ratio below 90%
 SELECT name, nbuffers,
        round(100.0 * hits / nullif(reads + hits, 0), 2) AS hit_pct
 FROM pg_stat_bufferpool
 WHERE reads + hits > 1000
   AND 100.0 * hits / nullif(reads + hits, 0) < 90
 ORDER BY hit_pct;
 -- Oversubscribed pools (currently holding more buffers than their target)
 SELECT name, target_buffers, current_buffers, oversubscribed
 FROM pg_stat_bufferpool
 WHERE oversubscribed;

Catalog

pg_bufferpool

Column Type Description
oid oid Unique object identifier
bpname name Pool name (unique)
bphandler regproc Handler function OID
bpsize int8 Size in bytes (0 means "unbounded / managed by REMAINDER" or DEFAULT)
 SELECT bpname, bphandler::regproc, bpsize FROM pg_bufferpool;

GUC Parameters

Parameter Type Default Description
buffer_pool_algorithm string 'clock' Replacement algorithm name for the DEFAULT pool. PGC_POSTMASTER; only built-in is clock; extensions register additional names from shared_preload_libraries. Misconfigured values fall back to clock with a WARNING.
recycle_pool_buffers int 0 Buffers reserved for the RECYCLE pool. 0 disables the pool and restores per-backend ring buffers. PGC_POSTMASTER.
trickle_flush_after int varies Pages the trickle writer flushes before issuing an OS writeback hint (mirrors backend_flush_after).
trickle_write_batch_size int 128 Maximum dirty pages written per trickle-writer cycle.

Performance

The framework's contribution to overhead is small and concentrated in three places:

  1. One indirect call (ActivePoolRoutine->get_victim) per StrategyGetBuffer on the DEFAULT pool;
  2. A predicted-not-taken branch in BufHdrGetBlock() for buffers belonging to dynamic pools (buf_id >= NBuffers);
  3. DSM-attach bookkeeping the first time each backend touches a dynamic pool (one-time, lazy).

The cost of running an alternative algorithm comes from that algorithm's own access-tracking, not from the framework. The numbers below quantify both.

Microbenchmark: pluggability cost

A 19-pattern Python microbenchmark suite (pgbench-style queries against a 100k-row table inside a 128 MB shared_buffers, working set fits in cache so all hit_ratio values are 1.0) compares clock-sweep against each contrib algorithm. Each datapoint is the median of multiple runs; the test instance is the same single-socket Linux box across all algorithms.

Mean over all 19 patterns:

Algorithm Mean (s) vs clock
clock 0.06703 baseline
arc 0.07436 +10.9 %
car 0.07542 +12.5 %
lirs 0.07525 +12.3 %
lru 0.07565 +12.9 %
osic 0.07582 +13.1 %

Per-pattern breakdown of the relative slowdown vs. clock-sweep, with clock-sweep absolute time on the left for context:

Pattern clock (s) arc car lirs lru osic
full_scan 0.1976 +3.2 % +4.6 % +2.9 % +4.0 % +4.0 %
filtered_scan 0.1422 +2.6 % +3.2 % +3.2 % +3.5 % +3.9 %
column_projection 0.0553 +2.7 % +2.2 % +4.0 % +4.8 % +4.5 %
zipfian_access 0.0549 +4.5 % +5.2 % +4.5 % +3.6 % +4.7 %
working_set_80_20 0.0258 +5.9 % +7.8 % +6.3 % +7.3 % +4.9 %
temporal_locality 0.0320 +6.4 % +7.4 % +8.0 % +7.3 % +9.6 %
capacity_pressure 0.0248 +7.9 % +7.6 % +8.9 % +10.8 % +8.8 %
sequential_then_random 0.0798 +10.6 % +12.7 % +12.4 % +13.4 % +12.1 %
write_heavy_update 0.0203 +10.0 % +13.0 % +11.8 % +13.7 % +14.0 %
sequential_scan_burst 0.1438 +11.5 % +13.9 % +12.4 % +12.1 % +12.9 %
repeated_hotset 0.1299 +13.0 % +13.9 % +13.8 % +14.2 % +15.9 %
working_set_shift 0.0041 +13.6 % +19.3 % +15.3 % +17.0 % +19.8 %
group_by 0.0628 +12.6 % +17.1 % +13.9 % +14.4 % +14.3 %
aggregation 0.0257 +17.3 % +19.6 % +20.3 % +22.9 % +21.7 %
concurrent_mixed 0.0658 +18.9 % +20.1 % +21.2 % +22.1 % +23.0 %
cyclic_loop 0.1606 +23.2 % +25.5 % +26.4 % +26.8 % +26.3 %
mixed_oltp_scan 0.0435 +29.7 % +30.8 % +35.9 % +37.9 % +38.7 %

What this is and isn't measuring:

  • All five non-clock algorithms have a roughly equal envelope of overhead (10-13 % on the average over all patterns; up to ~30-40 % on the worst-case patterns). That envelope is the cost of doing access-tracking at all (linked-list manipulation, ghost-list lookups, per-access spinlock contention) compared to clock-sweep's single atomic increment of nextVictimBuffer. This is unsurprising and well-documented in the cache-replacement literature.
  • Hit ratio is identical (1.0) across all algorithms for these microbenchmark patterns because the working set fits in cache. These workloads do not exercise the property the alternative algorithms exist for -- they exercise only the per-access overhead of bookkeeping.
  • The framework itself is not visible in the numbers. Clock-sweep on the multi-pool branch matches clock-sweep on master to within run-to-run noise on the same hardware, and the DEFAULT pool dispatches through the same vtable as user pools. The pluggability has no observable cost when the configured algorithm is clock.
  • Where the alternative algorithms earn their cost is in scenarios this microbenchmark cannot construct: working sets larger than cache, scan resistance, mixed OLTP+analytics, and workloads with shifting hot/cold data. A forthcoming HammerDB TPC-C / TPC-H run on a NUMA box (configured for working sets > shared_buffers) is what will exercise that property; the harness lives on the arc-bench branch and the results will be appended to this page.

What the temporal-locality numbers reveal about per-hit cost

The temporal_locality pattern (5 phases of 500 random point queries each: phases 1-3 access disjoint key ranges, phases 4-5 re-visit phases 1 and 2) was designed to exercise ghost-list / frequency tracking, but on this run the working set fits and ghost lists never activate. What the numbers do reveal -- consistently across all four data distributions -- is a structural difference in per-hit bookkeeping cost:

Distribution clock arc car lirs lru osic
random baseline +6.6 % +5.2 % +6.7 % +7.7 % +8.8 %
clustered baseline +5.0 % +4.7 % +5.3 % +5.9 % +4.9 %
low_cardinality baseline +5.1 % +4.7 % +12.6 % +12.2 % +12.4 %
high_null baseline +8.8 % +14.1 % +7.3 % +3.7 % +11.9 %

The pattern is informative even though it isn't testing the intended property:

  • ARC and CAR are the cheapest non-clock algorithms on the typical (random / clustered) distributions, ~5 % over clock. This matches their access-pattern: a hit on a page in T1 promotes once to T2 (one list-remove, one list-append), then later hits on the same page in T2 are no-ops on the data structure. Most of phases 4-5 land in T2 at zero structural cost.
  • LRU is consistently 7-13 % slower than clock because every hit moves the page to the MRU end -- 2,500 list manipulations across the 5 phases, all under the spinlock. ARC and CAR avoid this because hits in T2 don't churn the list.
  • LIRS pays ~13 % on low_cardinality because the HIR/LIR stack pruning logic does extra work on every reference to keep the LIR set bounded; that work is amortised away on random but the same key being hit repeatedly (low_cardinality) makes the per-hit case visible.
  • high_null is the noisy row. Absolute time is ~13 % larger across the board because NULL-bitmap processing dominates page-access time; a 4 ms swing on a 39 ms run is within run-to-run jitter at this scale. No algorithmic conclusion should be drawn from this row without a longer measurement.

The structural finding is the headline: frequency-aware algorithms (ARC, CAR) have lower per-hit overhead than recency-only algorithms (LRU) because they only mutate their data structure on the first hit, not every hit. Exactly the property that makes them attractive for OLTP workloads where the same hot pages are touched thousands of times.

Reproducing

The benchmark harness, including the 19 query patterns, the cache-stress workload generators, the DTrace/bpftrace profiling scripts, the HammerDB TPC-C/TPC-H integration, and the visualisation utilities, lives at:

To reproduce:

 git clone https://github.com/gburd/postgres.git
 cd postgres
 git checkout arc-bench       # builds on top of arc with the harness
 meson setup build --prefix=$PWD/install -Dcassert=true
 ninja -C build install
 cd src/test/benchmarks
 python3 -m benchmarks --shared-buffers 128MB --duration 30 \
                       --algorithms clock,arc,car,lirs,lru,osic \
                       --output /tmp/bp-results

Commit Series

The implementation is organised as 13 substantive commits on the arc branch (oldest to newest):

  1. Add pluggable buffer-pool framework with named pools -- core infrastructure: BufferPoolRoutine vtable, pg_bufferpool catalog, CREATE/ALTER/DROP BUFFER POOL DDL, DSM-backed dynamic pools, per-pool trickle writer, buffer_pool reloption, pg_buffercache 1.8 update, buffer_pool_algorithm string GUC, register-by-handler-name registry.
  2. Add REMAINDER pool, BufferPoolKind classification, and per-pool flags -- REMAINDER pool kind; BufferPoolKind enum (DEFAULT / RECYCLE / USER); bp_target_buffers / bp_current_buffers / bp_oversubscribed advisory counters; bp_use_direct_io flag (plumbing only); REMAINDER keyword.
  3. Add trickle_iter vtable callbacks for replacement algorithms -- new trickle_iter_begin/next/end callbacks; trickle writer dispatches through them when non-NULL. Backend-local routine pointer, not pool->bp_routine, to avoid cross-backend invalid function pointers.
  4. Add KEEP and RECYCLE built-in buffer pool routines -- keep_pool_routine / keep_pool_handler; recycle_pool_routine with one-chance clock-sweep; recycle_pool_buffers GUC; BufferAccessStrategy routing through the RECYCLE pool when configured; lookup by bp_kind not by name.
  5. Add overflow_buffer_pool reloption for the heap table AM -- new overflow_buffer_pool reloption (heap only); rd_overflow_bufpool; new relation_overflow_pool tableam callback (default heap implementation is a stub).
  6. Route heap TOAST tables through overflow_buffer_pool -- create_toast_table() propagates the reloption to the TOAST relation as buffer_pool; ALTER TABLE cascade.
  7. Add JAM buffer pool replacement algorithm for TOAST workloads -- jam_pool.c in src/backend/storage/buffer/ next to clock/KEEP/RECYCLE; jam_pool_handler.
  8. Add pg_bp_lru contrib extension: LRU replacement algorithm
  9. Add pg_bp_osic contrib extension: One-Stage Insert Clock
  10. Add pg_bp_arc contrib extension: Adaptive Replacement Cache
  11. Add pg_bp_lirs contrib extension: Low Inter-reference Recency Set
  12. Add pg_bp_rcrd contrib extension: Recent Consecutive Reuse Distance (LIRS2)
  13. Add pg_bp_car contrib extension: Clock with Adaptive Replacement

Per-commit verification: pg_indent --check clean and ninja warning-free under -Werror at every commit boundary. Full meson test suite passes at the tip (361/361).

Implementation Files

Core (new)

File Description
src/include/storage/bufpool.h BufferPoolRoutine vtable; buffer_pool_algorithm GUC declaration; RegisterDefaultPoolAlgorithm / LookupDefaultPoolAlgorithm; KEEP / RECYCLE / clock pool routine externs.
src/include/storage/bufpool_internals.h BufferPoolDesc, BufferPoolKind, PoolLocalState, hash-partition layout helpers.
src/include/catalog/pg_bufferpool.h pg_bufferpool catalog definition (oid, bpname, bphandler, bpsize).
src/include/catalog/pg_bufferpool.dat Initial DEFAULT row.
src/include/commands/bufferpoolcmds.h CreateBufferPool, AlterBufferPool, DropBufferPool, RenameBufferPool.
src/backend/storage/buffer/bufpool.c Multi-pool descriptor array, DSM lifecycle (CreateDynamicBufferPool, DestroyDynamicBufferPool, ResizeDynamicBufferPool), per-backend lazy attach (EnsurePoolAttached), BufferAllocInPool hot path, trickle-writer BGW.
src/backend/storage/buffer/jam_pool.c JAM replacement routine (jam_pool_routine, jam_pool_handler).
src/backend/commands/bufferpoolcmds.c DDL implementation, name validation, catalog manipulation.

Modified

File Key Changes
src/backend/storage/buffer/bufmgr.c BufHdrGetBlock macro routes buf_id >= NBuffers through GetDynamicPoolBlock; CurrentBufferPool wrapped in PG_TRY/PG_FINALLY in PinBufferForBlock and ExtendBufferedRelShared.
src/backend/storage/buffer/freelist.c Clock-sweep wrapped in BufferPoolRoutine vtable (clock_pool_routine); BufferAccessStrategy recycles through RECYCLE pool when bp_kind == BUFPOOL_RECYCLE exists; KEEP and RECYCLE routine implementations; algo_registry[] name → routine table; RegisterDefaultPoolAlgorithm / LookupDefaultPoolAlgorithm.
src/backend/parser/gram.y CREATE BUFFER POOL, ALTER BUFFER POOL, DROP BUFFER POOL, RENAME, REMAINDER productions; BUFFER, POOL, SIZE, REMAINDER unreserved keywords.
src/backend/access/common/reloptions.c buffer_pool, overflow_buffer_pool reloption types.
src/backend/access/heap/heapam_handler.c relation_overflow_pool tableam callback (currently NULL stub).
src/backend/catalog/toasting.c create_toast_table reads overflow_buffer_pool from parent or AM and writes it as buffer_pool on the TOAST relation.
src/backend/commands/tablecmds.c ATExecSetRelOptions cascades overflow_buffer_pool to TOAST.
src/backend/utils/cache/relcache.c rd_bufpool, rd_overflow_bufpool resolution.
src/backend/utils/misc/guc_parameters.dat buffer_pool_algorithm string GUC; recycle_pool_buffers; trickle-writer tunables.
src/backend/utils/misc/postgresql.conf.sample New parameters added in their natural sections.
contrib/pg_buffercache/ 1.7 → 1.8 update for per-pool columns.
contrib/pg_bp_arc/ ARC extension; Apache 2.0 notice for the Caffeine ArcPolicy.java reference; patent disposition.
contrib/pg_bp_car/ CAR extension; patent disposition.
contrib/pg_bp_lirs/ LIRS extension.
contrib/pg_bp_lru/ LRU extension.
contrib/pg_bp_osic/ OSIC extension.
contrib/pg_bp_rcrd/ RCRD / LIRS2 extension.

References

Research Papers

  • N. Megiddo and D.S. Modha, "ARC: A Self-Tuning, Low Overhead Replacement Cache", USENIX FAST 2003.
  • N. Megiddo and D.S. Modha, "Outperforming LRU with an Adaptive Replacement Cache Algorithm", IEEE Computer 2004.
  • S. Bansal and D.S. Modha, "CAR: Clock with Adaptive Replacement", USENIX FAST 2004.
  • S. Jiang and X. Zhang, "LIRS: An Efficient Low Inter-reference Recency Set Replacement Policy to Improve Buffer Cache Performance", ACM SIGMETRICS 2002.
  • C. Zhong, X. Zhao, and S. Jiang, "LIRS2: An Improved LIRS Replacement Algorithm", 2013.
  • T. Johnson and D. Shasha, "2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm", VLDB 1994.
  • E.J. O'Neil, P.E. O'Neil, and G. Weikum, "The LRU-K Page Replacement Algorithm for Database Disk Buffering", ACM SIGMOD 1993.

PostgreSQL Mailing List

Reference Implementations

Patents

ARC family (IBM):

  • US 6,996,676 -- "System and method for implementing an adaptive replacement cache policy". Filed 2002, granted 2006, expired February 22, 2024.
  • US 7,096,321 -- continuation; also expired.
  • US 7,058,766 -- CAR; also expired.
  • US 8,612,689 -- continuation; also expired.

Sun Microsystems:

  • US 7,469,320 -- ZFS-related; also out of force.

The implementation in this series is included on the basis that the patents have expired; downstreams should still confirm with their own counsel before relying on that conclusion in their jurisdiction.