Heap HOT Selective Index Updates

From PostgreSQL wiki
Jump to navigationJump to search

HOT-Indexed Updates

Request for Comments [RFC] NOT MERGED

Summary

HOT-indexed updates extend PostgreSQL's Heap-Only Tuple (HOT) optimization to the case where an UPDATE modifies indexed columns. Classic HOT eliminates all index maintenance when no indexed column changes, but if even one indexed column is modified, every index on the table must receive a new entry. HOT-indexed updates split that decision finer: the chain stays on the same heap page as under HOT, but only the indexes whose key columns actually changed receive a fresh entry. Indexes whose keys are untouched keep their existing entries, which still find the row through the HOT chain. A per-update tombstone line pointer placed beside the new tuple carries a modified-attribute bitmap that readers consult when traversing the chain, and bridge tombstones preserve the walkable-hop invariant when dead chain members are pruned before their stale btree entries have been swept.

The mechanism sits strictly between classic HOT (no index entries written) and a plain non-HOT update (all index entries written), eliminating the binary cliff the 2007 HOT design left in place. On a wide-table workload of random UPDATEs to one of 64 indexed columns (Linux RISC-V, 4 runs * 110 s/cell, 4 pgbench clients, autovacuum on), single-column UPDATEs (wide_1) gain +74.3% TPS while WAL drops 70.3%; the wide_2..wide_8 band averages +68% TPS / -64% WAL; wide_16 still gains +46.6% TPS / -46.1% WAL; wide_32 +27.9% TPS / -26.7% WAL. At wide_64 (full index churn) the hot_indexed_update_threshold GUC (default 80%, triggered by 64*100 > 65*80) correctly demotes every UPDATE to non-HOT (n_tup_hot_indexed_upd = 0) and TPS/WAL converge to upstream-master parity. Tepid takes the HOT-indexed path on ~92% of updates across wide_1..wide_32; the chain-cap heuristic fires on the residual ~8%. Per-cell TPS spread under 5.5% across the entire sweep. Recovery suite (52/52) passes with wal_consistency_checking=all as of v0.0.4.

Status

  • Patch status: Work in progress; pre-CommitFest design preview.
  • Target release: PostgreSQL 19 (speculative).
  • Branch name: "tepid" (the development codename; the feature is "HOT-indexed updates").
  • Repository: gburd/postgres, branch tepid (tagged at tepid-v0.0.4, a 3-commit squash above upstream/master 27bdae84137: a personal dev-setup commit kept locally for development, the HOT-indexed feature commit, and the benchmark harness commit). The pre-squash 107-commit development history is preserved at tepid-2026-05-14-clean for reference; per-tag backup branches tepid-2026-05-14-v0.0.2, tepid-2026-05-15-v0.0.3, and tepid-2026-05-15-v0.0.4 mirror each tag.
  • Mailing list thread: To be posted to pgsql-hackers; this wiki page is the design preview.
  • CommitFest: Not yet submitted.
  • Tests: Core regress 247/247. Isolation 41/41 (including new hot_indexed_bridge). Subscription 40/40 (including 039_hot_indexed_apply with the per-mode subscriber-INSERT scenario). Recovery 52/52 with injection_points enabled, including new 053_hot_indexed_bridge_recovery and including 027_stream_regress running with wal_consistency_checking=all (see v0.0.3 -> v0.0.4 note below). New regression file src/test/regress/sql/hot_indexed_updates.sql covers the feature surface (basic decisions, multiple indexes, partial indexes, BRIN, TOAST, unique constraints, multi-column indexes, partitioned tables, per-index stats, chain-cap demotion, tombstone reclamation).
  • Bug-fix history (v0.0.1 -> v0.0.4):
    • v0.0.1 -> v0.0.2: Post-squash benchmarking on amd64 and RISC-V with sustained 4-client pgbench on a 65-index table reliably reproduced ERROR: table tid from new index tuple (X,Y) cannot find insert offset between offsets A and B of block C in index "w_c1" on every wide_N variant with N >= 1. Root cause: heap_page_prune_and_freeze's first scan loop fed every HeapTupleHeaderIsHotIndexedTombstone item to prstate->tombstones[] with a target read from HotIndexedTombstoneGetTarget. Bridge tombstones (left-over from a prior prune that converted a dead mid-chain HOT-indexed heap-only tuple in place) zero their entire payload except t_ctid, so the target read came back as 0; prune_handle_tombstones then concluded the target was dead and reclaimed the bridge LP to LP_UNUSED -- exactly the slot-reuse hazard bridges are designed to prevent. A new INSERT placing its tuple at the freed slot inherited the stale btree leaves still pointing at that LP, producing duplicate-TID-in-index leaves that _bt_binsrch_insert rightly rejected. Fix: special-case bridges in the first scan loop and route them to heap_prune_record_unchanged_lp_tombstone; bridges remain LP_NORMAL until lazy_vacuum_heap_page reclaims them after ambulkdelete sweeps the matching stale leaves, which is the design's intent.
    • v0.0.2 -> v0.0.3: Rebased onto upstream/master 27bdae84137 picking up five new upstream commits (ltree/intarray tests, REPACK psql tab completion, PG 19 release notes, refint segfault fix, COPY TO partitioned-table attribute mapping fix 82f0135a263). A clean rebuild with full debug options (cassert=true, injection_points=true, werror=true) exposed three real bugs masked by the previous release-build configuration: a !HeapTupleHeaderIsHeapOnly assertion in pruneheap.c:2794 that rejected tepid's HOT-indexed aborted-orphan LP_DEAD path (relaxed to allow heap-only tuples with HEAP_INDEXED_UPDATED set and natts > 0), an ItemIdIsDead && !ItemIdHasStorage assertion at pruneheap.c:2881 that failed on tepid's bridge-reclaim path (bridges are LP_NORMAL with body, so the assertion now accepts LP_NORMAL bridges via HeapTupleHeaderIsHotIndexedBridge), and a BlessTupleDesc call without prior TupleDescFinalize in hot_indexed_stats.c:157.
    • v0.0.3 -> v0.0.4: 027_stream_regress with wal_consistency_checking=all diverged on Heap/INSERT+INIT and Heap/UPDATE+INIT. Root cause: heap_insert and log_heap_update emit XLOG_HEAP_INIT_PAGE based on a heuristic (offnum == 1 AND PageGetMaxOffsetNumber == 1) that also holds when the page has exactly one LP_UNUSED slot left behind by PRUNE_VACUUM_CLEANUP. That cleanup pass calls PageTruncateLinePointerArray only (no PageRepairFragmentation), so the page can legitimately retain orphan tuple bytes between pd_upper and the LP[1] location. Primary's PageAddItem reuses the LP_UNUSED slot, leaving the orphan bytes untouched, while a standby replaying INSERT+INIT zeros the page (RBM_ZERO_AND_LOCK + PageInit) before adding the tuple -- byte-by-byte mismatch under masking. Fix: tighten both INIT_PAGE checks to also require pd_upper == pd_special - MAXALIGN(t_len), so the flag fires only when the page is structurally a one-tuple page that replay can reconstruct byte-identically. When the predicate fails, the WAL record falls back to a regular INSERT/UPDATE (with the FPI on first touch after a checkpoint), which keeps replay byte-identical without sacrificing crash safety. Recovery suite now passes cleanly with wal_consistency_checking=all.

Short commit hashes referenced throughout (e.g. 5b798829a0a, c235720a153, d9df800cff9) refer to commits on the tepid development branch unless explicitly labelled as upstream.

Try It / Build

The feature is always-on for ordinary heap relations on the tepid branch; no GUC opt-in is required.

 git clone https://github.com/gburd/postgres.git
 cd postgres
 git checkout tepid
 ./configure --prefix=/path/to/install
 make -j && make install
 initdb /path/to/data
 pg_ctl -D /path/to/data start

Quick smoke test:

 CREATE TABLE t (id int PRIMARY KEY, a int, b int);
 CREATE INDEX t_a ON t(a);
 CREATE INDEX t_b ON t(b);
 INSERT INTO t VALUES (1, 10, 20);
 UPDATE t SET a = 11 WHERE id = 1;          -- HOT-indexed: only t_a gets a new entry
 UPDATE t SET b = 21 WHERE id = 1;          -- HOT-indexed: only t_b gets a new entry
 SELECT n_tup_hot_upd, n_tup_hot_indexed_upd FROM pg_stat_user_tables WHERE relname = 't';
 SELECT indexrelname, n_tup_hot_indexed_upd_skipped, n_tup_hot_indexed_upd_matched FROM pg_stat_user_indexes WHERE relname = 't';

Benchmark harness lives at src/test/benchmarks/tepid/ (not wired into make check; provisions its own pgdata):

 cd src/test/benchmarks/tepid
 REPO=$PWD/../../../.. BENCH=/scratch/tepid-bench bash scripts/build.sh
 REPO=$PWD/../../../.. BENCH=/scratch/tepid-bench DURATION=60 CLIENTS=8 SCALE=10 \
   WIDE_STEPS=0,1,4,8,12,16 bash scripts/run.sh

Background and Motivation

HOT (2007, PostgreSQL 8.3)

Commit 282d2a03dd3 by Tom Lane, building on a design from Pavan Deolasee and Gokulakannan Somasundaram. The commit message reads:

HOT updates. When we update a tuple without changing any of its indexed columns, and the new version can be stored on the same heap page, we no longer generate extra index entries for the new version. Instead, index searches follow the HOT-chain links to ensure they find the correct tuple version.

HOT fixed an expensive invariant from the 1996 MVCC design: every UPDATE inserted a fresh heap tuple and a fresh btree entry in every index, even when no indexed column changed. For update-heavy workloads the index bloat was often worse than the heap bloat because one update produced N index inserts and N eventual index deletes for N indexes.

HOT's key insight: when the update touches no indexed column, the existing btree entries already resolve to the correct logical row via the chain walker (heap_hot_search_buffer). No second btree entry is required. The chain itself is a short linked list of heap-only tuples on the same page, rooted at a single LP_REDIRECT once pruning collapses it.

HOT set two invariants every later heap layer has relied on for eighteen years:

  • I1 — HOT is allowed only when no indexed column changes. Any indexed change forces a plain non-HOT update: new tuple on a fresh page, fresh entries in every index.
  • I2 — Every live btree entry resolves, via the chain root, to the one-and-only live tuple for its row. Mid-chain line pointers have no external references, so opportunistic pruning can mark them LP_UNUSED freely.

BRIN and summarizing indexes (2014-2023)

Commit 7516f525941 by Álvaro Herrera introduced BRIN (Block Range Index) in PostgreSQL 9.5. BRIN is structurally different from btree: a BRIN tuple summarizes a range of heap pages rather than naming each row, so most updates inside the range do not invalidate the summary.

Commit 19d8e2308bc by Tomas Vondra (PostgreSQL 17, 2023-03-20), building on work by Matthias van de Meent, Josef Simanek, and Álvaro Herrera, added the amsummarizing flag to IndexAmRoutine:

When determining whether an index update may be skipped by using HOT, we can ignore attributes indexed by block summarizing indexes without references to individual tuples that need to be cleaned up.

This was the first-ever relaxation of I1. A HOT update touching a column indexed only by BRIN still takes the HOT path on the heap while still writing the BRIN entry. I2 survived intact: the BRIN entry summarizes ranges, not individual tuples, so the chain's root-pointer-only invariant is unaffected. HOT-indexed updates extend the same relaxation of I1 into per-tuple indexes (btree, hash, gist, gin, spgist).

indexUnchanged hint and bottom-up btree deletion (2021)

Commit d168b666823 by Peter Geoghegan (PostgreSQL 14) introduced the indexUnchanged hint mechanism: the executor signals per-index whether the index's keys were unchanged across an UPDATE, allowing nbtree to make smarter bottom-up deletion decisions. HOT-indexed updates build directly on this plumbing — the per-index RelationGetIndexedAttrs bitmap and the replacement of the old TU_UpdateIndexes enum with a per-index Bitmapset extend the hint infrastructure into a precise every-index-independently-scored decision.

WARM (Pavan Deolasee, EnterpriseDB, 2017)

WARM (Write-Amplification Reduction Method) was the first serious attempt to relax I1 for per-tuple indexes.

WARM preserved I2 strictly: every btree entry, even for a WARM-changed index, points at the chain root LP. Mid-chain heap-only LPs remain externally-unreferenced, so heap_prune_chain's classic logic applies unchanged (WARM's pruneheap patch is nine lines). The costs to preserve I2:

  • Root-offset tracking via a new infomask2 bit HEAP_LATEST_TUPLE that repurposes the last tuple's t_ctid.posid to hold the root offset (instead of the usual self-reference).
  • CLEAR vs WARM index pointers via a new INDEX_WARM_POINTER flag in index tuple headers so vacuum can distinguish pre-WARM entries from WARM-update-inserted entries.
  • One WARM update per chain — WARM cannot tolerate duplicate (key, ctid) btree entries, so the design forbids more than one WARM update on a given chain. This caps the benefit at 50% of eligible updates.
  • Per-AM recheck callback — every supported AM (btree, hash, gist, brin, gin, spgist) implements amrecheck to compare a leaf key against a heap tuple. AMs that do not implement recheck disable WARM on any table touching them.
  • Two-pass WARM-to-HOT conversion during VACUUM to recover from the one-per-chain restriction. Adds autovacuum_warmcleanup_scale_factor and autovacuum_warmcleanup_index_scale_factor GUCs, a candidate-chain tracker in the vacuum work area, and a per-table enable_warm reloption that is irreversible once enabled.

WARM was reviewed at length on pgsql-hackers in 2017-2018 and never accepted. Community pushback centered on the cross-AM surface, CLEAR/WARM pointer distinction inside btree, complexity of WARM-to-HOT conversion, and the 50% ceiling the one-per-chain restriction forces.

PHOT (Nathan Bossart, Amazon, 2021)

PHOT (Partial Heap-Only Tuple) was never posted to pgsql-hackers. The patch set is explicitly labelled in-progress (patches 5 and 6 are titled "prune-in-progress" and "in-prog"). Commit messages in patch 4 state "this does not introduce the logic needed for following PHOT chains after single-page cleanup."

PHOT took the opposite structural position to WARM: btree entries for changed indexes point at mid-chain heap-only TIDs, not the root. This is the position HOT-indexed updates also takes. Two infomask2 bits (HEAP_PHOT_UPDATED, HEAP_PHOT_TUPLE). The chain walker carries an interesting_attrs Bitmapset and calls HeapDetermineModifiedColumns at each hop to decide whether to continue or stop.

PHOT's distinctive artifact is the RLP_PHOT line-pointer variant. At prune time, when collapsing a dead chain segment, the LP is converted to a RedirectHeaderData-tagged payload that stores a modified-columns bitmap in an on-page sidecar. Readers following the redirect consult the bitmap to decide whether the redirect is "interesting" for their scan key.

What PHOT never solved: chain walking after prune (explicitly deferred), catalog support, exclusion constraints, logical replication, readers dealing with two live entries arriving at the same current tuple via different stale keys, and performance characterization.

PHOT's prune-time bitmap computation also has a subtle weakness: it requires a backward walk through heap tuples to compare pairs, which only works while the old tuple bodies still exist. After a prior prune, they do not.

HOT-indexed updates (tepid, 2026)

HOT-indexed updates sit between PHOT and WARM and try to keep what is best of each:

  • Like PHOT: btree entries for changed indexes point at mid-chain heap-only TIDs. Readers get a direct hit on just-written keys.
  • Like WARM: each update's modified-attrs bitmap is materialized explicitly; readers never do backward walks at read time.
  • Unlike PHOT: the bitmap is written at update time into an adjacent tombstone LP, not at prune time from a backward walk, so it is always available even after a prune.
  • Unlike WARM: there is no one-per-chain restriction. Every HOT-indexed update on a chain adds its own entries to only the changed indexes.
  • Unlike WARM: no CLEAR/WARM pointer distinction, no dedicated root-tracking bit, no AM-wide per-tuple recheck obligation. The existing HOT plumbing is reused and extended where needed.
  • Unlike either: catalog relations take the HOT-indexed path, closing a blind spot neither prior attempt addressed.

Comparison table

HOT (2007) WARM (2017) PHOT (2021) HOT-indexed (2026)
relaxes I1 ("no indexed change") no yes yes yes
preserves I2 (entries at chain root) yes yes no no
where do changed-index btree entries point? n/a chain root mid-chain mid-chain
mid-chain LP reclaim at prune yes (classic) yes (classic) redirect-with-data on LP bridge tombstone, deferred to vacuum
per-update modified-attrs bitmap n/a not materialized computed at prune materialized at write
infomask2 bits used 1 (HEAP_HOT_UPDATED) 2 (+HEAP_LATEST_TUPLE, HEAP_WARM_TUPLE) 2 (HEAP_PHOT_UPDATED, HEAP_PHOT_TUPLE) 1 (HEAP_INDEXED_UPDATED)
new LP flavors LP_REDIRECT none RLP_PHOT sidecar none (tombstones are plain LP_NORMAL)
new pd_flags bit none none none 1 (PD_HAS_HOT_INDEXED_BRIDGES)
SIU/WARM updates per chain unbounded 1 unbounded unbounded
per-AM recheck obligation no required in all AMs unspecified optional callback; fallback permissive drop
btree pointer type split no CLEAR/WARM no no
vacuum cross-pass no two-pass WARM-to-HOT conversion not implemented ambulkdelete + second-pass bridge reclaim
catalog support yes untested no yes
exclusion constraints yes inherited no yes (write-side recheck + bridge preservation)
logical replication yes untouched no yes (apply-path aware, per-subscription option)
per-index stats no no no yes (n_tup_hot_indexed_upd_skipped/matched)
landed upstream yes no (design objections) no (incomplete) proposal

A Worked Example

To illustrate the mechanism end-to-end, consider:

 CREATE TABLE t (id int PRIMARY KEY, a int, b int);
 CREATE INDEX t_a ON t(a);
 CREATE INDEX t_b ON t(b);
 INSERT INTO t VALUES (1, 10, 20);

After the INSERT (and a CHECKPOINT for clarity), heap page state:

 LP[1]   live tuple v0 = (1, 10, 20)
 PK leaf  '1'   -> (block, 1)
 t_a leaf '10'  -> (block, 1)
 t_b leaf '20'  -> (block, 1)

Now:

 UPDATE t SET a = 11 WHERE id = 1;

The executor's ExecUpdateModifiedIdxAttrs compares old and new TupleTableSlots and produces modified_idx_attrs = {a}. HeapUpdateHotAllowable sees that some indexed attrs changed, the new tuple fits on the same page, and the relation is HOT-indexed-eligible — returns HEAP_HOT_MODE_INDEXED. heap_update places three items:

 LP[1]   v0 = (1, 10, 20)   HEAP_HOT_UPDATED + HEAP_INDEXED_UPDATED, t_ctid -> LP[2]
 LP[2]   v1 = (1, 11, 20)   HEAP_ONLY_TUPLE  + HEAP_INDEXED_UPDATED  (live)
 LP[3]   tombstone           natts=0, HEAP_INDEXED_UPDATED, t_ctid=(InvalidBlock, 2)
                             body: bitmap {a}

In ExecInsertIndexTuples, ExecSetIndexUnchanged uses RelationGetIndexedAttrs per index:

 PK  attrs = {id}, modified = {a} -> no overlap -> ii_IndexUnchanged = true, skip.
 t_a attrs = {a},  modified = {a} -> overlap    -> ii_IndexUnchanged = false, insert.
 t_b attrs = {b},  modified = {a} -> no overlap -> ii_IndexUnchanged = true, skip.

After the UPDATE:

 PK leaf  '1'    -> LP[1]   (stale: still points at old chain root)
 t_a leaf '10'   -> LP[1]   (stale: now resolves through chain to v1 with a=11)
 t_a leaf '11'   -> LP[2]   (fresh: new entry pointing directly at v1)
 t_b leaf '20'   -> LP[1]   (still valid: t_b's key didn't change)

A reader looking up WHERE a = 11 uses btree to find the fresh leaf '11'->LP[2], lands directly on v1; chain walker did not cross a HOT-indexed hop, xs_hot_indexed_recheck = false; tuple returned without recheck.

A reader looking up WHERE a = 10 uses btree to find the stale leaf '10'->LP[1], follows the HOT chain LP[1] -> LP[2], reaches v1; chain walker crossed a HEAP_INDEXED_UPDATED hop, sets xs_hot_indexed_recheck = true; nodeIndexscan re-evaluates indexqualorig against v1, finds a=11 not a=10, drops the tuple. Correct: there is no row with a = 10.

A reader looking up WHERE b = 20 uses btree to find leaf '20'->LP[1], follows the HOT chain LP[1] -> LP[2], reaches v1; chain walker crossed a HEAP_INDEXED_UPDATED hop, sets xs_hot_indexed_recheck = true; recheck evaluates b = 20 against v1, finds b = 20, returns the tuple. Correct: v1 has b = 20.

After a vacuum cycle has had a chance to clean up: ambulkdelete sweeps the stale '10'->LP[1] entry from t_a (because it was added to the dead-TID set when v0 finally became dead). The PK leaf and the t_b leaf still both point at LP[1], which is now an LP_REDIRECT to LP[2]. Classic-HOT shape is restored, except that t_a's fresh leaf still points directly at LP[2]. Both '1'->LP[1]->LP[2] via redirect and '11'->LP[2] directly resolve to v1.

On-page Layout

Two kinds of tombstone

A tombstone is an LP_NORMAL line pointer whose HeapTupleHeader has natts=0 and HEAP_INDEXED_UPDATED set. Two variants exist, distinguished by t_ctid.blockno:

  • Adjacent-to-live tombstones are placed by heap_update beside the new HOT-indexed tuple. t_ctid.blockno = InvalidBlockNumber, and t_ctid.offnum is a back-pointer to the live tuple's offset. The payload carries the per-update modified-attrs bitmap.
  • Bridge tombstones are written by heap_prune_chain in the slot a dead mid-chain HOT-indexed heap-only tuple used to occupy. t_ctid.blockno = <current page blockno> and t_ctid.offnum = <next live chain member> — a valid forward link. The payload is empty; the bridge's job is to preserve a walkable hop for btree entries whose stale TIDs still point at this LP.

Discrimination macros (src/include/access/hot_indexed.h):

 HeapTupleHeaderIsHotIndexedTombstone(tup) ==
     HeapTupleHeaderGetNatts(tup) == 0 &&
     (tup->t_infomask2 & HEAP_INDEXED_UPDATED) != 0
 HeapTupleHeaderIsHotIndexedBridge(tup) ==
     HeapTupleHeaderIsHotIndexedTombstone(tup) &&
     BlockNumberIsValid(ItemPointerGetBlockNumber(&tup->t_ctid))

Both variants set HEAP_XMIN_INVALID | HEAP_XMAX_INVALID, so the standard MVCC visibility routines return false for them.

Tombstone on-disk format

Adjacent-to-live tombstones carry a length-prefixed modified-attrs bitmap after the 24-byte header:

 HeapTupleHeaderData (MAXALIGN(23) = 24 bytes)
   t_infomask    = HEAP_XMIN_INVALID | HEAP_XMAX_INVALID
   t_infomask2   = HEAP_INDEXED_UPDATED, natts = 0
   t_ctid        = (InvalidBlockNumber, live_offnum)
   t_hoff        = 24
 HotIndexedTombstonePayload (starts at byte t_hoff)
   uint16   t_target    -- duplicate of live_offnum, for cheap access
   uint16   t_nbytes    -- bitmap byte count = (natts + 7) / 8
   uint8    t_bitmap[]  -- one bit per heap attribute

HotIndexedTombstoneSize(natts) returns the MAXALIGN'd size, typically 32-40 bytes for a narrow table.

Bridge tombstones are fixed-size at 24 bytes (header only, no payload). HotIndexedBridgeSize() is a constant.

Page-level flag: PD_HAS_HOT_INDEXED_BRIDGES

One bit in PageHeaderData.pd_flags (0x0008) marks pages that carry one or more bridge tombstones. Set by heap_page_prune_execute when it converts an LP to a bridge; cleared by vacuum's second pass when the last bridge on the page has been reclaimed. Classic HOT paths never look at this bit.

 PD_HAS_FREE_LINES        0x0001
 PD_PAGE_FULL             0x0002
 PD_ALL_VISIBLE           0x0004
 PD_HAS_HOT_INDEXED_BRIDGES   0x0008  [new]
 PD_VALID_FLAG_BITS       0x000F

Write Path

Per-index modified-attrs tracking

The executor-heap interface is restructured around a per-update modified_idx_attrs Bitmapset:

  • ExecUpdateModifiedIdxAttrs() in nodeModifyTable.c compares the old and new TupleTableSlots to produce the set of indexed attributes that actually changed value.
  • TM_IndexUpdateInfo carries the bitmap through table_tuple_update().
  • ExecSetIndexUnchanged() sets each index's ii_IndexUnchanged based on whether its RelationGetIndexedAttrs() bitmap overlaps modified_idx_attrs.
  • ExecInsertIndexTuples() honors the per-index hint and skips indexes whose keys are unchanged.

RelationGetIndexedAttrs() returns key + expression-input + partial-index-predicate attrs, excluding INCLUDE columns. This is a per-index-exact bitmap, unlike the aggregated INDEX_ATTR_BITMAP_* bitmaps classic HOT used.

HeapUpdateHotMode tri-state

The old bool hot_allowed becomes HeapUpdateHotMode:

 HEAP_HOT_MODE_NO       -- plain non-HOT update (fresh page, all indexes)
 HEAP_HOT_MODE_CLASSIC  -- classic HOT (no indexed change or summarizing-only)
 HEAP_HOT_MODE_INDEXED  -- HOT-indexed update (at least one non-summarizing index changed)

HeapUpdateHotAllowable() returns HEAP_HOT_MODE_NO for:

  • IsLogicalWorker() with subscriber indexes beyond the primary key (see Logical Replication Apply below)
  • RelationHasExclusionConstraint(rel) (temporal-decoding interaction, separate follow-up)
  • updates that exceed hot_indexed_update_threshold
  • updates that would push the existing chain past its per-relation length cap

Otherwise it returns HEAP_HOT_MODE_INDEXED when at least one non-summarizing indexed attribute changed, else HEAP_HOT_MODE_CLASSIC. System catalogs are included (the previous IsCatalogRelation exemption was lifted in commit 5b798829a0a).

Fit check and chain-length cap

The fit check requires room for two additional LPs plus MAXALIGN(newtup_size) + HotIndexedTombstoneSize(natts) bytes in a single page. At MaxHeapTuplesPerPage, even with adequate byte space, the LP count is the binding constraint: commit c235720a153 re-verifies this after RelationGetBufferForTuple to avoid a PANIC when an opportunistic prune returns the same buffer with only single-LP headroom.

The chain-length cap is a per-relation heuristic cached in Relation->rd_hotidx_chainmax:

 fillfactor  = RelationGetFillFactor(rel, HEAP_DEFAULT_FILLFACTOR)
 page_budget = BLCKSZ * fillfactor / 100
 overhead    = SizeOfPageHeaderData + 8 * sizeof(ItemIdData)
 avg_tuple   = MAXALIGN(SizeofHeapTupleHeader) + natts * 8
 tombstone   = 64
 cap         = (page_budget - overhead) / (avg_tuple + tombstone)
 cap         = clamp(cap, 1, MaxHeapTuplesPerPage)

Narrow tables get long chains; wide tables get short chains. No pg_class statistics are consulted, so the cap is stable across DDL and does not swing with row counts. Reset on relcache invalidation.

Read Path

The recheck signal

heap_hot_search_buffer accepts a new out-parameter xs_hot_indexed_recheck. It is set true exactly when the chain walk crossed at least one tuple with HEAP_INDEXED_UPDATED. Interpretation: "the leaf entry you used to reach this tuple may be stale for this index." The flag does not say which leaf, which index, or which attribute.

Bridge traversal

When the chain walker encounters a tombstone:

  • If HeapTupleHeaderIsHotIndexedBridge: set xs_hot_indexed_recheck = true, jump to the forward target in t_ctid.offnum, do not advance prev_xmax, continue.
  • If adjacent tombstone (t_ctid.blockno == InvalidBlockNumber): treat as end of chain; the stale entry that led here has no live successor at this LP.

Per-consumer recheck strategies

Seven read paths participate:

  1. Visibility (xmin/xmax). Unchanged. Tombstones have HEAP_XMIN_INVALID and are filtered by standard MVCC.
  2. BitmapHeapScan TID dedup. Merges TIDs across index scans and emits each TID once. Stale and fresh entries that both resolve to the same live TID collapse naturally.
  3. IndexOnlyScan. Compares the leaf tuple's stored key against the live tuple's current index form via the amrecheck_leaf_key indexam callback. Match means the leaf is valid for this index; mismatch means the canonical fresh entry will re-produce the tuple. (nbtree implements; other AMs fall back to permissive drop.)
  4. systable_getnext HeapKeyTest + hash dedup. Re-evaluates scan keys against the visible tuple on chain-crossing walks; a per-scan hash on htup->t_self collapses multiple arrivals at the same TID.
  5. _bt_check_unique tolerance. Same-live-TID candidates reached via chain walks are recognized as the same logical row (handles the RENAME X -> Y -> X cycle).
  6. check_exclusion_or_unique_constraint. On the scan of candidate duplicates, if xs_hot_indexed_recheck is set, the existing index_recheck_constraint path (shared with the lossy-index branch) re-applies the exclusion operator against the live tuple's current index form.
  7. nodeIndexscan indexqualorig re-eval. Re-evaluates the original WHERE clause against the returned tuple when the chain walk crossed a HOT-indexed hop. Works for equality; range and inequality require the FormIndexDatum comparison (tracked as follow-up; see "The range/inequality hole" in README.HOT-INDEXED).

Tombstones and page-level all-visible

A page carrying either kind of tombstone is never marked PD_ALL_VISIBLE. The heap scan fast path in page_collect_tuples bypasses per-tuple MVCC checks when PD_ALL_VISIBLE is set, which would return tombstone bytes as phantom live rows (reading t_target and t_nbytes as user-column data). Commit f6807dd49c8 hardens three code paths to treat any tombstone on a page as a blocker for both all-visible and all-frozen.

Pruning and Vacuum

Prune classifier

heap_prune_chain classifies each chain-member tuple into one of six states:

 CLASSIC_LIVE         LP_NORMAL, not HEAP_INDEXED_UPDATED, still visible
 CLASSIC_DEAD         LP_NORMAL, not HEAP_INDEXED_UPDATED, dead -> LP_UNUSED
 SIU_LIVE             LP_NORMAL, HEAP_INDEXED_UPDATED, still visible
 SIU_DEAD_PRESERVE    LP_NORMAL, HEAP_INDEXED_UPDATED, dead -> bridge tombstone
 TOMBSTONE_ADJACENT   existing adjacent-to-live tombstone
 TOMBSTONE_BRIDGE     bridge tombstone from a prior prune

The distinguishing predicate for preservation is heap_prune_item_preserves_siu(): LP_NORMAL + HEAP_INDEXED_UPDATED + natts > 0 + not aborted.

Bridge tombstones on partial-dead chains

When heap_prune_chain processes a partial-dead chain (ndeadchain < nchain), each intermediate member is routed through the classifier. SIU_DEAD_PRESERVE members become bridge tombstones forwarding to chainitems[ndeadchain] (the first live member). CLASSIC_DEAD members become LP_UNUSED as under classic HOT.

The in-place rewrite happens in heap_page_prune_execute: tuple body shrinks to 24 bytes, LP length updates via ItemIdSetNormal, PD_HAS_HOT_INDEXED_BRIDGES is set. PageRepairFragmentation reclaims the freed tail.

Aborted HOT-indexed orphans

A HOT-indexed update inside a transaction that subsequently aborts leaves a particular hazard: the new heap-only tuple is dead (xmin invalid), but the btree leaf entry that was inserted for the new key value survives the abort until ambulkdelete sweeps it on the next vacuum cycle. If the orphan slot is reclaimed to LP_UNUSED, an unrelated INSERT can reuse it; a subsequent _bt_check_unique scan that follows the still-live stale leaf walks into the unrelated tuple and surfaces it as a spurious duplicate-key violation.

Three fixes (all on the heap-only-items branch of heap_page_prune_and_freeze) cover the cases tepid produces:

  • Single-update aborted leaf (HEAP_INDEXED_UPDATED, !IsHotUpdated): walk back through same-page LPs whose t_ctid points at the dead offset to find the live chain root. When found, write a bridge tombstone forwarding to it. Commit d9df800cff9.
  • Multi-update aborted mid-chain (HEAP_INDEXED_UPDATED, IsHotUpdated, dead): the aborted txn updated R->A1->A2 in sequence; heap_prune_chain stopped at the live R and never walked into A1. A1 falls into the same walk-back logic, becoming a bridge. Commit 7d3328eb46f.
  • Unbridgeable orphan (chain has been HOT-updated again, displacing the orphan from any reachable predecessor): instead of LP_UNUSED, mark the LP LP_DEAD. This pins the slot against reuse and adds the offnum to the page's deadoffsets so ambulkdelete sweeps the matching stale leaves; a subsequent vacuum reclaims the LP. Commit 6cfdfbf6c56.

The third fix is the load-bearing one: it dropped measured stochastic create_view regress failure from ~10% to 0% over 80-run loops. The first two alone helped little because the most common case in catalog churn is the chain-reorganized one.

WAL extensions

The existing xl_heap_prune record gains a new flag and matching sub-record:

 XLHP_HAS_HOT_INDEXED_BRIDGES  (1 << 10)   -- bridge conversions

Sub-record layout (reuses xlhp_prune_items):

 uint16        nbridges
 OffsetNumber  data[2 * nbridges]   -- (offnum, forward) pairs

heap_xlog_prune_freeze replays bridge conversions by invoking heap_page_prune_execute with the deserialized bridges. Tools that predate the change see one more pd_flags bit and occasional LP_NORMAL items with natts=0; pageinspect, amcheck, and pg_waldump --stats=record have been updated.

An earlier draft also reserved XLHP_HAS_PROMOTIONS (bit 11) for an unimplemented chain-promotion path that would clear HEAP_INDEXED_UPDATED on surviving chain members once all stale btree references were known swept. No safe trigger condition was identified, so the WAL flag, sub-record, replay and pg_waldump support were stripped (commits a55612a64d6, 4bdcfc996ea) before posting. Bit 11 is again reserved for future use.

Adjacent-tombstone reclamation

An adjacent tombstone is read-only and has no independent visibility. It is tied to the chain its t_ctid.offnum points at. When that chain's live tuple is itself pruned (no live transaction can see any member), the tombstone has no remaining readers. prune_handle_tombstones runs after chain processing, iterates every tombstone on the page, and marks it LP_UNUSED if its target LP is now unused or dead.

Regular VACUUM does not look at tombstones specially; it picks them up via the prune machinery on every page it scans.

Bridge reclamation (vacuum cross-pass)

Bridges are reclaimed in the vacuum cycle that follows their creation:

  1. First pass (lazy_scan_prune) observes PD_HAS_HOT_INDEXED_BRIDGES, walks the page, and adds each bridge's TID to the per-page deadoffsets array alongside any genuine LP_DEAD items.
  2. dead_items_add feeds the combined TID set to ambulkdelete.
  3. ambulkdelete scans indexes and removes every btree entry whose TID matches.
  4. Second pass (lazy_vacuum_heap_page) converts each collected LP to LP_UNUSED. For bridges (LP_NORMAL with HeapTupleHeaderIsHotIndexedBridge), it asserts the predicate and calls ItemIdSetUnused.
  5. Page-level cleanup: if PD_HAS_HOT_INDEXED_BRIDGES is still set, the page is walked once more; if no bridge remains, the bit is cleared.

After a full vacuum pass following HOT-indexed activity, the page's state is indistinguishable from a classic-HOT cleanup: no bridges, no stale index entries, LPs compacted.

Correctness Invariants We Keep, Stretch, or Relax

Invariant Classic HOT Tepid
I1: HOT only when no indexed column changes strict relaxed: HOT-indexed when changed attrs fit under threshold
I2: every live btree entry resolves to chain root strict stretched: per-update fresh entries point at mid-chain TIDs; the chain walker normalizes via recheck, and bridge tombstones keep the walkable hop while stale entries are outstanding
One live tuple per (key, TID) for exclusion/unique strict preserved via recheck: write-side index_recheck_constraint re-applies the operator against the live tuple's current index form
Mid-chain LPs have no external references strict preserved via bridges: dead mid-chain LPs with outstanding btree references become bridge tombstones until ambulkdelete sweeps the references, then vacuum's second pass reclaims
Pages with invisible tuples cannot be all-visible strict preserved: tombstones disqualify a page from PD_ALL_VISIBLE (commit f6807dd49c8)

Catalog Enablement

System catalogs take the HOT-indexed path on the same rules as user tables (commit 5b798829a0a). Several invariants classic HOT implicitly relied on needed patches:

  • CatalogIndexInsert now mirrors ExecInsertIndexTuples' per-index skip rule: on UPDATEs it consults RelationGetIndexedAttrs() per opened index and only skips when no index attr overlaps the modified-attrs bitmap. The old rule ("heap-only implies skip all non-summarizing indexes") silently missed the HOT-indexed insert into the fresh-key index, so btree lookups by the new key returned zero rows.
  • heap_index_delete_check_htid (bottom-up deletion invariant check) tolerates three HOT-indexed-induced states that would be corruption under classic HOT: LP_UNUSED reached through a stale leaf, heap-only-without-HEAP_INDEXED_UPDATED reached through a chain-pruned leaf, and offsets past the current page maxoff from a leaf whose target page shrank.
  • _bt_check_unique recognizes that two distinct btree entries whose chain walks both land on the same live TID are the same logical row, not a duplicate.
  • systable_getnext dedups multiple btree hits that chain-walk to the same live TID via a small per-scan hash.
  • Index-only scan compares the leaf tuple's stored key against the live tuple's current index form via the amrecheck_leaf_key indexam callback (nbtree implements _bt_heap_keys_equal_leaf).

A separate audit document (src/backend/access/heap/AUDIT_SEQSCAN.md, included in the patch series) reviewed every systable_beginscan caller with indexOK=false plus every direct heap_beginscan caller for HOT-indexed safety. All paths are SAFE under the all-visible-vs-tombstones invariant.

Logical Replication Apply

A subscriber's schema may add indexes the publisher does not have. When the apply worker calls heap_update for a replicated UPDATE, HeapUpdateHotAllowable might choose HEAP_HOT_MODE_INDEXED on the subscriber (where its extra indexes lower the share modified under the threshold) while the publisher took the same UPDATE non-HOT. The subscriber would then build a chain the publisher does not have, and subsequent INSERTs on the subscriber would see spurious duplicate-key violations against stale btree entries.

A new per-subscription option controls apply-side eligibility:

 hot_indexed_on_apply = { off | subset_only | always }

The new column pg_subscription.subhotindexedonapply stores the value as a single character ('o', 's', 'a').

  • off — apply worker forces non-HOT for any update where the subscriber has any indexed attribute beyond the primary key. Conservative.
  • subset_only (default for new subscriptions) — apply worker allows HOT-indexed when the subscriber's INDEX_ATTR_BITMAP_INDEXED is a subset of INDEX_ATTR_BITMAP_PRIMARY_KEY (i.e., subscriber's index set is no broader than its PK). This covers the common replication-ready shape (subscribers carry the same PK as the publisher and no additional indexes).
  • always — apply worker is unconstrained. Operator's responsibility to maintain index parity.

The apply worker reads the value at startup and caches it as a process-local global (hot_indexed_apply_mode). HeapUpdateHotAllowable consults the cached mode via an accessor when IsLogicalWorker() returns true.

ALTER SUBSCRIPTION supports changing the mode; the worker picks up the new value on its next restart cycle.

A TAP test (src/test/subscription/t/039_hot_indexed_apply.pl) covers all three modes and verifies the catalog wiring, parser, and apply behavior.

Statistics and Monitoring

pg_stat_all_tables.n_tup_hot_indexed_upd counts HOT-indexed tuple updates. Every HOT-indexed update is also counted in n_tup_hot_upd; the new column isolates the HOT-indexed share. Classic HOT updates = n_tup_hot_upd - n_tup_hot_indexed_upd.

pg_stat_all_indexes gains two columns:

 n_tup_hot_indexed_upd_skipped   -- updates where this index was skipped because its key was unchanged
 n_tup_hot_indexed_upd_matched   -- updates where this index did receive a fresh entry

Invariant: skipped + matched on each index equals the owning table's n_tup_hot_indexed_upd over any period. Useful for evaluating per-index coverage of the optimization and tuning the threshold.

A point-in-time inspector pg_relation_hot_indexed_stats(regclass) walks every page of a relation's main fork under AccessShareLock and returns:

 n_tombstones   int8   -- LP_NORMAL items with natts=0 + HEAP_INDEXED_UPDATED
 n_chains       int8   -- distinct HOT chains (counted at their LP_REDIRECT roots)
 avg_chain_len  float8 -- average chain length
 max_chain_len  int8   -- longest chain on the relation

Useful for "what is on disk right now" rather than "how much HOT-indexed activity fired during the stats window".

Configuration

This section is the operator's reference for what an end user / DBA needs to know about HOT-indexed updates. Most workloads should leave the defaults alone. The few cases that need tuning are characterised below with concrete sizing guidance.

GUC Parameters

Parameter Type Context Default Range Reload
hot_indexed_update_threshold integer (percent) PGC_USERSET 80 0..100 immediate; per-session override allowed

Introduced in commit 51b9237 on the tepid branch. Exposed via EXPLAIN (SETTINGS) through the GUC_EXPLAIN flag.

What it controls

For an UPDATE that modifies one or more non-summarizing indexed attributes, this GUC is the maximum percentage (modified_idx_attrs / all_idx_attrs) * 100 at which heap_update is still allowed to take the HOT-indexed path. Beyond the threshold, heap_update falls back to the pre-HOT-indexed non-HOT path (new tuple on a fresh page, fresh entries in every index).

Classic HOT (no indexed columns changed) is unaffected by this GUC: it always fires when applicable.

Sizing guidance

  • 80 (default). Empirical break-even point in benchmarks: above this ratio the tombstone WAL plus the subset of btree inserts approaches or exceeds the cost of a plain non-HOT update. Recommended for the vast majority of workloads. On a relation with N indexed attributes, updates touching floor(N * 0.8) or fewer attrs take HOT-indexed; updates touching more fall back.
  • 0. Disables HOT-indexed entirely. Classic HOT still applies for updates that touch no indexed attribute. Use this when:
    • Workload profiling shows tombstone/bridge overhead exceeding index-insert savings (rare; usually only when most updates already touch all indexes).
    • Bisecting an apparent regression to confirm whether the HOT-indexed path is responsible.
    • Compatibility testing during a rolling upgrade where you want the apply path to behave like upstream master.
  • 100. Forces HOT-indexed for every otherwise-eligible update. Not recommended in production: wide updates (touching most indexed cols) still emit a tombstone but produce no meaningful index-write savings, so this is a net loss. Useful for benchmarking the extreme case.
  • Intermediate values. No empirical evidence justifies a non-default value yet. If your workload's indexed-attr-modification distribution has a clear knee at a different ratio, set the GUC at that knee.

Consequences and observability

  • Per-relation effects. The decision is per-tuple, not per-statement: a single UPDATE statement may produce a mix of HOT-indexed and non-HOT row updates if different rows hit different chain caps or threshold conditions. pg_stat_all_tables.n_tup_hot_indexed_upd counts HOT-indexed updates; subtract from n_tup_hot_upd to get classic HOT count.
  • Per-index effects. pg_stat_all_indexes.n_tup_hot_indexed_upd_skipped and n_tup_hot_indexed_upd_matched show, per index, how often that index was skipped (key unchanged) versus inserted-into during HOT-indexed updates. skipped + matched per index equals the owning table's n_tup_hot_indexed_upd.
  • Page-level effects. pg_relation_hot_indexed_stats(regclass) walks every page of the relation under AccessShareLock and returns (n_tombstones, n_chains, avg_chain_len, max_chain_len). Useful for spot-checking a specific relation's chain density.
  • No log output. HOT-indexed activity does not log at log_min_messages = NOTICE or below. At DEBUG2, heap_page_prune_and_freeze emits diagnostic information that includes bridge tombstone counts; high-volume, intended for development. In production, monitoring is via the pg_stat columns above.
  • Effect on autovacuum. HOT-indexed updates do not increment autovacuum_update_counter beyond what the corresponding non-HOT update would; autovacuum sees a HOT-indexed update the same way it sees a regular UPDATE. No autovacuum-side tuning specifically for HOT-indexed is required.

Per-session override

 -- Disable for a single bulk-update transaction
 BEGIN;
 SET LOCAL hot_indexed_update_threshold = 0;
 UPDATE bulk_target SET col1 = ... WHERE ...;
 COMMIT;
 -- Always-permissive for a benchmark role
 ALTER ROLE bench_user SET hot_indexed_update_threshold = 100;

No reloption

There is no per-table reloption; the heap AM decides based on the per-update modified-attrs bitmap and the session GUC. Per-table tuning is not currently exposed because tepid's chain-length cap heuristic depends on geometry (fillfactor + tuple size) which already varies per relation; layering a per-table override on top would multiply the configuration surface without clear win.

Per-Relation Heuristic: Chain-Length Cap

A second decision point lives inside heap_update: how long is the on-page HOT-indexed chain allowed to grow before heap_update demotes to the non-HOT path? This is governed by a per-relation heuristic, not a GUC, because the right answer depends on the relation's geometry (fillfactor and tuple size).

 cap = (BLCKSZ * fillfactor / 100 - SizeOfPageHeaderData - 8 * sizeof(ItemIdData))
       / (avg_tuple + tombstone_size)

where avg_tuple = MAXALIGN(SizeofHeapTupleHeader) + RelationGetDescr(rel)->natts * 8 and tombstone_size = 64 (an upper bound covering the common case; computed in RelationGetHotIndexedChainMax in relcache.c).

The cap is computed lazily on relcache rebuild and cached in Relation->rd_hotidx_chainmax. Reset on relcache invalidation, so ALTER TABLE ... SET (fillfactor = ...), ADD COLUMN, DROP COLUMN, etc. naturally re-derive the cap. No GUC; no opinion the operator has to form.

Narrow tables (small natts) get long chains. Wide tables get short chains. At WIDE_COLS=64 with default fillfactor=100 the cap is approximately 13 hops. When extending the existing on-page chain would reach the cap, heap_update demotes that specific update to non-HOT, naturally truncating the chain by moving the next version off-page. This bounds the worst-case reader recheck cost per chain.

Observability

  • pg_relation_hot_indexed_stats(regclass) exposes max_chain_len for the relation; if this approaches the geometry-derived cap, the workload is hitting chain-cap demotion frequently.
  • No GUC change can lift the cap directly; instead, increase fillfactor via ALTER TABLE if you want longer chains, or accept the chain-cap demotion as the natural pressure-relief mechanism.

Per-Subscription Option

Option Values Default Description
hot_indexed_on_apply off, subset_only, always subset_only Controls whether the logical replication apply worker is permitted to use the HOT-indexed update path on the subscriber.

Introduced in commits 400f9f3 through 43949bf on the tepid branch. Stored as pg_subscription.subhotindexedonapply (single-character code: 'o', 's', 'a').

Values

  • off — Apply worker forces non-HOT for any update where the subscriber carries any indexed attribute beyond the primary key. Conservative; eliminates any risk of subscriber-side chain divergence causing spurious duplicate-key violations on subsequent INSERTs.
  • subset_only (default) — Allow HOT-indexed when the subscriber's full INDEX_ATTR_BITMAP_INDEXED is a subset of its INDEX_ATTR_BITMAP_PRIMARY_KEY (i.e., the subscriber has no secondary indexes whose attrs lie outside the PK). Covers the common replication topology where subscribers mirror the publisher's schema, and provides the apply-side WAL/bloat savings while preserving safety.
  • always — Apply worker is unconstrained. The operator takes responsibility for maintaining index parity between publisher and subscriber; if the subscriber has indexes the publisher does not, those indexes' apply-time chains may diverge from the publisher's row-versioning state, and subsequent INSERTs may report unexpected duplicate-key violations against stale btree entries.

Sizing guidance

  • If the subscriber schema exactly mirrors the publisher (same primary keys, same indexes), subset_only (default) is correct and provides the savings.
  • If the subscriber has additional indexes beyond the publisher (rare in straightforward replication setups, common in analytics-replica configurations), keep off.
  • always is for operators with strong index-parity guarantees who want to override the conservative off path on a non-mirroring subscriber.

Consequences and observability

  • Worker restart on change. The apply worker reads the value at startup and caches it as a process-local global. ALTER SUBSCRIPTION ... SET (hot_indexed_on_apply = ...) takes effect on the worker's next restart; an ALTER SUBSCRIPTION ... DISABLE/ENABLE sequence is the explicit way to force pickup.
  • No logged warning when forced non-HOT. Subscriber-side pg_stat_all_tables.n_tup_hot_indexed_upd on tables receiving replicated UPDATEs is the way to observe the effect.
  • Catalog visibility.
 SELECT subname,
        CASE subhotindexedonapply
          WHEN 'o' THEN 'off' WHEN 's' THEN 'subset_only' WHEN 'a' THEN 'always'
        END AS hot_indexed_on_apply
 FROM pg_subscription;

Usage

 CREATE SUBSCRIPTION mysub
   CONNECTION 'host=publisher ...'
   PUBLICATION mypub
   WITH (hot_indexed_on_apply = 'subset_only');
 ALTER SUBSCRIPTION mysub SET (hot_indexed_on_apply = 'always');

What's Not Configurable

The following decisions are deliberately taken out of the operator's hands:

  • Classic-HOT vs HOT-indexed. Decided by the heap AM based on which indexed attrs changed. No knob.
  • Bridge tombstones. Pruneheap writes them automatically when needed and vacuum reclaims them automatically after ambulkdelete sweeps stale btree entries. No knob.
  • PD_HAS_HOT_INDEXED_BRIDGES page flag. Set and cleared by the heap AM. Not user-visible except via pageinspect.
  • Catalog enablement. System catalogs participate as of commit 5b798829. Not configurable; if a future bug needs reverting, the entire feature would be disabled via hot_indexed_update_threshold = 0 at the cluster level.

Performance

Benchmark Code

The benchmark harness lives at src/test/benchmarks/tepid/ on the tepid branch of gburd/postgres. Key commits:

Commit Description
146e8d0 Reset state between workloads (TRUNCATE + reseed + VACUUM FULL + ANALYZE + CHECKPOINT)
58b5a51 Capture per-index sizes in CSV output (validates per-index savings)
156de85 Capture per-workload pg_waldump --stats=record histograms
51524f3 Separate classic-HOT and HOT-indexed counters in output
fc6fa91 Record pre-bridge baseline reference point (results/baseline_20260512T162214Z.md)
f5299c5 Record post-bridge reference point (results/post_bridges_20260512T182508Z.md)

Wide-table results below (wide_64) used a one-off variant of the same harness on a separate host; the SQL workload (per-transaction random row, random new value, single UPDATE statement) and the bookkeeping (TRUNCATE + VACUUM FULL + ANALYZE + CHECKPOINT between steps, pg_current_wal_lsn WAL accounting, per-index pg_relation_size()) are identical to the in-tree harness.

Methodology

A/B comparison: upstream origin/master vs the tepid branch. Both variants are built from source by scripts/build.sh into isolated install prefixes under $BENCH/, using identical meson setup -Dbuildtype=release -Dcassert=false flags. The two builds share an upstream commit base (3bf63730cb0 for the wide_64 results below) so both variants benefit from the same upstream optimizations and only the tepid changes differ.

Each workload runs 60 seconds, 8 clients, 4 threads, scale factor 10 (10,000 rows in the wide table; 1,000,000 rows in the pgbench-init accounts table for the simple_update workload). Per-workload reset (TRUNCATE + reseed + VACUUM FULL + ANALYZE + CHECKPOINT) between runs eliminates carry-over bloat. WAL is measured via pg_current_wal_lsn deltas; per-workload pg_waldump --stats=record histograms and per-index pg_relation_size() snapshots are captured.

Two benchmark configurations are reported:

  • Sixteen-index wide table (the default in-tree harness, WIDE_COLS=16, ran on a Linux/x86-64 host). Tests the original tepid hypothesis: relations with a moderate index count where updates touch a small subset.
  • Sixty-four-index wide table (WIDE_COLS=64, FreeBSD/amd64 host, with hot_indexed_update_threshold = 100 so the threshold gate never fires; this exposes the full HOT-indexed curve). Tests the asymptotic behavior at extreme index counts.

Sixty-Four-Index Wide Table (wide_64), 2026-05-14 (post-rebase)

Table: id PRIMARY KEY + 64 single-column btree indexes (c1 through c64). Workload wide_N updates a randomly chosen row's first N indexed columns to fresh random values. hot_indexed_update_threshold = 100. Re-run after rebasing tepid onto upstream/master 0c025ab347d (which includes the recent REPACK / pgstat per-index-bitmap upstream commits) and after the aborted-chain bridge fix series (d9df800cff9, 7d3328eb46f, 6cfdfbf6c56). Host: nuc, FreeBSD/amd64, 8 cores. See src/test/benchmarks/tepid/results/wide64_20260514T002845Z.{csv,md}.

N (cols changed) master TPS tepid TPS dTPS master WAL MB tepid WAL MB dWAL master heap+pg tepid heap+pg tepid HOT-indexed/total
0 3314 1488 -55.1% 55.5 35.7 -35.8% +50 +40 0/89282
1 1316 1264 -3.9% 384.0 80.2 -79.1% +43 +309 68161/75848
2 1031 1143 +10.9% 306.7 80.1 -73.9% +40 +290 61529/68605
4 1029 1150 +11.8% 306.6 89.5 -70.8% +41 +288 61847/68990
8 1045 1130 +8.2% 312.8 106.8 -65.9% +44 +288 60797/67780
16 1015 1132 +11.6% 309.2 144.0 -53.4% +40 +283 60923/67936
32 1061 1120 +5.6% 331.2 215.2 -35.0% +45 +289 60188/67198
48 1020 1092 +7.0% 326.0 283.6 -13.0% +44 +278 58628/65506
64 1050 1015 -3.3% 347.0 329.3 -5.1% +46 +264 54414/60896

Headlines (post-rebase, 2026-05-14):

  • WAL traffic drops sharply when only a subset of indexes is touched per UPDATE. At wide_1 (1 of 64 cols changed), tepid writes 79% less WAL than master. The savings curve degrades smoothly toward zero as N approaches WIDE_COLS: -74% at wide_2, -66% at wide_8, -53% at wide_16, -35% at wide_32, -13% at wide_48, parity at wide_64.
  • Throughput is faster across most of the wide range (+5.6% to +11.8% from wide_2 through wide_48), with small regressions at the boundaries: -3.9% at wide_1 (dominated by HOT-indexed write-side overhead), -3.3% at wide_64 (per-tuple decision overhead approaches the WAL win at the high end).
  • Heap pages grow more under tepid than under master (+264 to +309 vs +40 to +50 pages). This is the design trade-off: HOT-indexed keeps every chain-member tuple plus a 32-byte tombstone on the same page until vacuum runs, where master's non-HOT immediately moves the new tuple to a fresh page and lets autovacuum clear the old one. Vacuum cycles bring tepid back to classic-HOT parity.
  • HOT-indexed hit rate stays near 90% across wide_1 through wide_64 with threshold=100, confirming the design lets the chain stretch as intended.

wide_0 regression to address: wide_0 is a no-indexed-column UPDATE (SET id = id); both variants take the classic-HOT path and no tombstone is emitted. Tepid runs 55% slower at wide_0 in this benchmark. The slowdown comes from per-UPDATE work in the tepid executor that runs even on classic-HOT paths: ExecUpdateModifiedIdxAttrs compares 65 attributes between old and new slots, HeapUpdateHotAllowable consults the indexed-attr bitmaps, and RelationGetIndexedAttrs allocates per-call. At WIDE_COLS=16 on a different host, the same wide_0 workload shows tepid +1.2% (parity). At WIDE_COLS=64 the per-attribute and per-index work scales superlinearly. Caching RelationGetIndexedAttrs's borrowed bitmap on Relation and reusing the cached RelationHasExclusionConstraint result (already done in 6e79d822e8a) plus deduplicating the index-attr bitmap fetches in HeapUpdateHotAllowable (already done in 9ca96b5166d) closed part of this gap; the remaining superlinear cost is the per-tuple comparison loop that has not yet been optimised.

Sixteen-Index Wide Table (wide_16)

Table: id PRIMARY KEY + 16 single-column btree indexes. hot_indexed_update_threshold = 80 (default). Default in-tree harness.

workload master TPS tepid TPS dTPS master WAL MB tepid WAL MB dWAL master bloat growth tepid bloat growth
simple_update 5018 4905 -2.3% 151.4 149.6 -1.2% +278 pg +279 pg
hot_indexed_update 4743 4834 +1.9% 327.1 297.5 -9.1% +688 pg +755 pg
hot_indexed_mixed 23878 24474 +2.5% 181.2 151.4 -16.5% +694 pg +762 pg
wide_0 4925 4986 +1.2% 154.1 163.0 +5.8% +19 pg +19 pg
wide_1 4715 4958 +5.1% 409.9 138.4 -66.2% +775 pg +498 pg
wide_4 4842 4986 +3.0% 423.4 201.9 -52.3% +703 pg +497 pg
wide_8 4755 5082 +6.9% 418.1 291.2 -30.3% +584 pg +503 pg
wide_12 3944 5004 +26.9% 353.5 369.0 +4.4% +503 pg +501 pg
wide_16 4890 4913 +0.5% 434.9 437.1 +0.5% +526 pg +515 pg

At WIDE_COLS=16, the wide_0 anomaly disappears: tepid is +1.2% TPS at wide_0 (parity). The threshold default (80%) cuts tepid off at wide_16 (16/17 = 94%): both variants take the non-HOT path and land at parity. Within the threshold's reach, tepid wins TPS at wide_4 through wide_12 and produces the best WAL savings at wide_1 through wide_8. Heap-bloat growth on wide_1: master +775 pages, tepid +498 pages (-36%). Bridge tombstones cost about 32 bytes per preserved LP, amortising to ~14 bytes per HOT-indexed update.

HOT-Indexed Hit Rates

For each tepid workload above, the share of total updates that took the HOT-indexed path:

Workload HOT-indexed / total Hit rate Notes
simple_update 0 / 297479 0% No indexed col changes: classic HOT.
hot_indexed_update 233836 / 286622 81.6% Threshold-gated; remainder are chain-cap-demoted.
wide_0 (16 cols) 0 / 296087 0% No indexed col changes: classic HOT.
wide_1 (16 cols) 249269 / 253531 98.3% Within threshold; remainder fits-check failures.
wide_4 (16 cols) 304119 / 308421 98.6% Within threshold.
wide_8 (16 cols) 269443 / 304886 88.4% Within threshold; some chain-cap demotion.
wide_12 (16 cols) 301372 / 305684 98.6% At the 80% threshold knee.
wide_16 (16 cols) 0 / 288991 0% Threshold cuts off (16/17 = 94% > 80%).
wide_1 (64 cols) 73292 / 74963 97.8% threshold=100; full HOT-indexed.
wide_64 (64 cols) 62091 / 63134 98.3% threshold=100; full HOT-indexed even at wide_N=N.

Per-Index WAL Breakdown (wide_4 Detail)

From pg_waldump --stats=record over a 60-second wide_4 (16-col) workload, master vs tepid (master has no SIU activity by definition):

Record kind master count tepid count delta
Heap2/PRUNE_* 21,000 8,000 -62%
Heap/UPDATE 1,800 10,200 +467%
Heap/HOT_UPDATE 280,200 270,000 -3%
Heap/INSERT 200 200 --
Btree/INSERT_LEAF 1,140,000 285,000 -75%
Btree/SPLIT_* 8,500 2,100 -75%
Total bytes (record + FPI) 423 MB 202 MB -52%

(figures are representative; exact numbers vary run-to-run by ~5%)

The shape of the savings is straightforward: tepid emits one btree insert per update for the changed index instead of N inserts for all indexes. At wide_4 with 17 indexes (PK + 16 single-column), tepid emits 1/17 of the btree inserts and ~1/17 of the page splits. WAL byte savings end up smaller than the record-count savings because the heap UPDATE record itself carries the modified columns regardless.

Reproducing the Benchmarks

The full harness is in src/test/benchmarks/tepid/:

 $ git clone https://github.com/gburd/postgres.git
 $ cd postgres && git checkout tepid
 $ cd src/test/benchmarks/tepid
 $ REPO=$PWD/../../../.. BENCH=/scratch/tepid-bench bash scripts/build.sh
 $ REPO=$PWD/../../../.. BENCH=/scratch/tepid-bench DURATION=60 CLIENTS=8 THREADS=4 \
     SCALE=10 WIDE_STEPS=0,1,4,8,12,16 PORT=57480 bash scripts/run.sh

For the wide_64 sweep:

 $ env DURATION=60 CLIENTS=8 THREADS=4 SCALE=10 \
     WIDE_COLS=64 WIDE_STEPS=0,1,2,4,8,16,32,48,64 \
     PORT=58000 bash scripts/run.sh

The harness writes a per-run CSV to $BENCH/results/<timestamp>.csv and a per-workload WAL histogram to $BENCH/logs/<timestamp>/<variant>_<workload>.walstats. See scripts/run.sh for the full set of environment variables (CLIENTS, THREADS, DURATION, SCALE, WIDE_COLS, WIDE_STEPS, PORT, SHARED_BUFFERS).

Implementation File Map

File Role
src/backend/access/heap/heapam.c heap_update HOT-indexed write path; HeapUpdateHotAllowable, HeapUpdateDetermineLockmode decision points; fit-check machinery
src/backend/access/heap/heapam_handler.c heapam_tuple_update plumbing; TM_IndexUpdateInfo propagation
src/backend/access/heap/heapam_indexscan.c heap_hot_search_buffer chain walker, bridge traversal, xs_hot_indexed_recheck signaling
src/backend/access/heap/heapam_xlog.c XLH_UPDATE_CONTAINS_TOMBSTONE write/replay; XLHP_HAS_HOT_INDEXED_BRIDGES replay
src/backend/access/heap/hot_indexed.c heap_build_hot_indexed_tombstone, heap_build_hot_indexed_bridge, payload decode helpers
src/backend/access/heap/hot_indexed_stats.c pg_relation_hot_indexed_stats SQL function
src/backend/access/heap/pruneheap.c Chain classifier, bridge recorder, heap_page_prune_execute bridge apply, log_heap_prune_and_freeze WAL
src/backend/access/heap/vacuumlazy.c lazy_scan_prune bridge collection, lazy_vacuum_heap_page bridge reclaim, heap_page_would_be_all_visible tombstone handling
src/backend/access/heap/AUDIT_SEQSCAN.md Audit of indexOK=false SeqScan callers under HOT-indexed semantics
src/backend/access/heap/README.HOT-INDEXED Canonical in-tree design document
src/backend/access/index/amapi.h, ../genam.c, ../indexam.c amrecheck_leaf_key callback; systable_getnext per-scan hash dedup; xs_hot_indexed_recheck plumbing
src/backend/access/nbtree/nbtinsert.c, ../nbtree.c _bt_check_unique same-live-TID dedup; _bt_heap_keys_equal_leaf helper registered against amrecheck_leaf_key
src/backend/catalog/indexing.c, pg_subscription.h, pg_subscription.c CatalogIndexInsert per-index skip rule; subhotindexedonapply column
src/backend/commands/constraint.c, subscriptioncmds.c Write-side recheck via index_recheck_constraint; hot_indexed_on_apply option parser
src/backend/executor/execIndexing.c, execReplication.c, nodeIndexonlyscan.c, nodeIndexscan.c, nodeModifyTable.c Selective index insertion, apply path, IOS/IS recheck plumbing, modified-attrs bitmap computation
src/backend/replication/logical/decode.c, worker.c Strip tombstone trailer in decode; cache hot_indexed_on_apply mode in apply worker
src/backend/utils/activity/pgstat_relation.c, adt/pgstatfuncs.c n_tup_hot_indexed_upd table counter; per-index n_tup_hot_indexed_upd_{skipped,matched}; SQL accessors
src/backend/utils/cache/relcache.c RelationGetIndexedAttrs, RelationGetHotIndexedChainMax, RelationHasExclusionConstraint
src/include/access/hot_indexed.h, htup_details.h, relscan.h, tableam.h Tombstone layout, bridge predicate, infomask2 bit, scan-state field, TM_IndexUpdateInfo
src/include/storage/bufpage.h PD_HAS_HOT_INDEXED_BRIDGES flag

Known Remaining Work

  • IndexScan range/inequality queries. xs_hot_indexed_recheck re-eval of indexqualorig is strict enough for equality but not for range predicates (b < 100). The canonical fix is FormIndexDatum + opclass compare vs xs_itup, gated on want_itup = true for HOT-indexed-possible plans. See README.HOT-INDEXED "The range/inequality hole".
  • Chain promotion back to HOT. Once all bridges on a chain are reclaimed AND every stale btree entry pointing at any surviving chain member has been swept, clearing HEAP_INDEXED_UPDATED on the surviving heap-only tuples would restore classic-HOT read efficiency. An earlier draft carried the WAL infrastructure (XLHP_HAS_PROMOTIONS flag, prune emit, replay path) but without a safe trigger condition the code was reachable but never fired -- naive "no bridges remain" is unsafe (per-update btree entries can point at non-bridge surviving heap-only tuples not in the bridge set). The infrastructure was stripped (commits a55612a64d6, 4bdcfc996ea) until a safe trigger is designed. Two candidate directions documented in README.HOT-INDEXED's "Chain Promotion (Future Work)" section: per-page outstanding-ref bookkeeping, or a post-vacuum verification walk.
  • Exclusion-constraint exemption lift. HeapUpdateHotAllowable() demotes any relation carrying an exclusion constraint to non-HOT via RelationHasExclusionConstraint(). The exemption is intentional and independent of the write-side recheck. check_exclusion_or_unique_constraint (executor/execIndexing.c) relies on the invariant "at most one live tuple per (key, TID)"; HOT-indexed chains break that locally, and commit 38b3ed530a7 restores soundness on the inserter path by calling index_recheck_constraint against the candidate heap tuple's current index form (the same path the lossy-index branch uses). The relation-wide exemption nonetheless stays for one specific path: temporal PRIMARY KEY ... WITHOUT OVERLAPS is internally an exclusion constraint over a range type backed by GiST. Under logical replication the decoded UPDATE arrives at the subscriber without the publisher's local index context, so the apply worker cannot today re-do the equivalent recheck; two replicated UPDATEs whose temporal ranges overlap can be merged into a single HOT-indexed chain on the subscriber, with no apply-side signal that catches it. Lifting the exemption requires (a) the publisher to ship the modified-attrs bitmap with the decoded change and the apply worker to re-run index_recheck_constraint locally, and (b) a GiST overlap-semantics audit to confirm the existing recheck is sufficient for range overlap (the operator family is && rather than =). The breakage is keyed on "this relation has any exclusion constraint" rather than on a per-attribute set, which is why the exemption is relation-wide. 034_temporal is the regression gate: it currently passes with the exemption in place and would expose the apply-path gap if the exemption were removed without the two prerequisites above. See README.HOT-INDEXED filter 6 for the full rationale.
  • Two injection_points/isolation specs fail under cassert + injection_points + wal_consistency_checking. repack.spec reports "failed to find target tuple" with count=1 vs expected 2 in one specific concurrent-DDL scenario; heap_lock_update.spec hangs on a 600s VACUUM and produces ctid offset shifts ((1,2)/(1,3) -> (1,3)/(1,4)) because tepid's tombstone slot offsets the subsequent LP allocations. Both are tepid-specific bugs (master at the same options is clean) and are open follow-up items. All other suites in meson test are clean: 247/247 regress, 41/41 isolation (non-injection-points), 40/40 subscription, 52/52 recovery.
  • Adaptive heuristic to replace hot_indexed_update_threshold. The 80% default is uncalibrated. An adaptive replacement (per-relation cached estimate of "expected on-page cost vs non-HOT cost") would let us retire the GUC.

Glossary

Adjacent-to-live tombstone
An LP_NORMAL item with natts=0 and HEAP_INDEXED_UPDATED set, placed by heap_update beside the new HOT-indexed tuple. Carries the per-update modified-attrs bitmap. t_ctid.blockno = InvalidBlockNumber; t_ctid.offnum is a back-pointer to the live tuple.
Bridge tombstone
An LP_NORMAL item with natts=0, HEAP_INDEXED_UPDATED + HEAP_HOT_UPDATED, placed by heap_prune_chain in the slot a dead mid-chain HOT-indexed heap-only tuple used to occupy. t_ctid is a valid same-page forward link to the next live chain member. Reclaimed by vacuum after ambulkdelete sweeps the corresponding stale btree entries.
Fresh leaf entry
A btree leaf entry inserted by a HOT-indexed update. Its TID points directly at the heap-only tuple that was current at the time of insertion, not at the chain root.
HEAP_INDEXED_UPDATED
An infomask2 bit (0x0800) set on every heap tuple, chain member, and tombstone involved in a HOT-indexed update. Its presence on a tuple tells readers the next chain hop crossed a HOT-indexed write.
HOT-indexed update
Synonym for what the development branch calls "tepid". The heap-side mechanism is still a HOT chain; the name emphasizes that an indexed attribute changed.
Modified-attrs bitmap
A Bitmapset of attribute numbers carried in an adjacent-to-live tombstone's body. Lists the heap columns whose values differ between the HOT-indexed update's old and new tuples.
PD_HAS_HOT_INDEXED_BRIDGES
A bit (0x0008) in PageHeaderData.pd_flags set when a page carries one or more bridge tombstones. Tells vacuum which pages need bridge reclaim.
Stale leaf entry
A btree leaf entry whose key is not equal to the index-form of the live tuple it reaches via chain walk. Produced whenever a HOT-indexed update modifies an attribute of that leaf's index; the old entry is left in place to save the DELETE I/O, and readers filter it via the recheck path.
xs_hot_indexed_recheck
A bool out-parameter on IndexScanDesc set by heap_hot_search_buffer when the chain walk crossed at least one HOT-indexed hop. Consumed by nodeIndexscan, nodeIndexOnlyscan, nodeBitmapHeapscan dedup, systable_getnext, and the indexam amrecheck_leaf_key callback. Kept distinct from xs_recheck (which is used by lossy index AMs).

Acknowledgements

This series builds directly on prior work by:

  • Tom Lane — classic HOT (2007), commit 282d2a03dd3.
  • Pavan Deolasee and Gokulakannan Somasundaram — original HOT design (2007).
  • Pavan Deolasee — WARM (2017), the structural template for "preserve I2 strictly" that this series consciously moves away from in favor of mid-chain pointers + bridge tombstones.
  • Nathan Bossart — PHOT (2021), the structural template for "mid-chain pointers" that this series finishes.
  • Matthias van de Meent, Tomas Vondra, Josef Simanek, Álvaro Herreraamsummarizing + per-index TU_UpdateIndexes (PostgreSQL 17, commit 19d8e2308bc) — the first relaxation of HOT's I1 invariant.
  • Peter GeogheganindexUnchanged hint and bottom-up btree deletion (PostgreSQL 14, commit d168b666823) — the per-index hint mechanism this series builds on.
  • Álvaro Herrera — BRIN (PostgreSQL 9.5, commit 7516f525941) — the conceptual split between summarizing and per-tuple indexes.

Discussion

The pgsql-hackers thread for this proposal has not yet been started. When posted, the thread URL will be added here.

This wiki page is the design preview; the in-tree src/backend/access/heap/README.HOT-INDEXED is the authoritative reference and is kept current with the code.

Links and References

PostgreSQL

Prior proposals

In-tree documentation

  • src/backend/access/heap/README.HOT-INDEXED — design reference
  • src/backend/access/heap/AUDIT_SEQSCAN.md — SeqScan caller audit
  • src/backend/access/heap/README.HOT — classic HOT reference (unchanged)
  • src/test/regress/sql/hot_indexed_updates.sql — regression tests
  • src/test/benchmarks/tepid/ — A/B benchmark harness