Heap HOT Selective Index Updates

HOT with Selective Index Updates

Request for Comments [RFC] NOT MERGED

HOT is one of the most important performance optimizations in PostgreSQL history, but it has a significant limitation: it applies only when no indexed columns are modified. This blocks optimization on wide tables where only a few columns are indexed. Various efforts have tried to lift this limitation, including selective index updates proposals, deferred index maintenance, and partial HOT chains. However, none have been committed to core PostgreSQL.

The flaw in those previous efforts falls into two categories:

Design complexity: Some proposals required undo-chain navigation (difficult without a full UNDO infrastructure). Others used background workers to defer index updates (but queries see stale indexes during maintenance windows). Still others tried to track per-column index eligibility at the tuple level (too much memory overhead).

Correctness gaps: Previous attempts didn't adequately handle concurrent prune/vacuum scenarios, leading to orphaned references or index corruption under load. They also struggled with replication safety—subscriber indexes can diverge from publisher indexes, and prior proposals didn't address this systematically.

This proposal, HOT-indexed updates, differs in three ways:

1. Attribute-bitmap staleness, not a value recheck. A HOT-indexed update stays on the HOT chain and maintains only the indexes whose attributes changed. Each new tuple records, inline in its tail, which indexed attributes changed at that hop; a reader unions the bitmaps of the hops it crosses and drops an entry whose index columns overlap that union. This is access-method agnostic and, crucially, correct under a value cycled away and back (ABA) — the case that sank WARM's value recheck.

2. Collapse to xid-free stubs, no "convert back" step. Prune rewrites a dead chain prefix into xid-free forwarding stubs that preserve each surviving hop's bitmap; VACUUM later reclaims the stubs and re-points the root redirect, collapsing back to classic HOT. There is no persistent per-tuple state to reconcile.

3. Replication safety: a per-subscription option hot_indexed_on_apply (off / subset_only / always) gates the apply path, so a HOT-indexed update of a replica-identity attribute leaves a stale leaf only when the apply worker's RI lookup can tolerate it.

---

Measured performance (indicative)

A/B run of two release (cassert=off) builds — origin/master vs the SIU series — on a single Apple Silicon laptop (macOS), pgbench, scale 5 (siu_table = 500k rows with 3 secondary indexes; wide_table = 5k rows with 16 secondary indexes + PK), 8 clients / 4 threads, 20 s per cell. pgbench runs for a fixed time, so each variant completes a different number of updates; the write-amplification signal is therefore reported as WAL bytes per update.

Workload (indexed cols changed)	TPS master→tepid	WAL/update master→tepid
simple_update (control; HOT both)	32.6k → 32.1k (0%)	265 → 265 B (0%)
hot_indexed_update (1 of 4)	58.2k → 68.6k (+18%)	636 → 487 B (−23%)
wide, 1 of 17 indexes	33.6k → 41.7k (+24%)	1466 → 598 B (−59%)
wide, 8 of 17 indexes	37.4k → 47.3k (+26%)	1498 → 1015 B (−32%)
wide, 16 of 17 indexes	36.3k → 37.3k (+3%)	1530 → 1490 B (−3%)
read_indexscan (read-only)	164.4k → 161.3k (−2%)	n/a (no writes)

Reading the table:

The win scales with how many indexes are skipped. Changing one indexed column on a 17-index table cuts WAL per update by ~59% and lifts throughput ~24%; changing all but one (16 of 17) leaves almost nothing to skip, so SIU converges to master (within noise).
simple_update changes no indexed column (HOT on both variants); it is a control, and the identical WAL/update and throughput confirm SIU adds no overhead to the existing HOT path.
read_indexscan is read-only on a freshly reset table with no stale entries. master and tepid are at parity (−2%, within run-to-run noise): the crossed-attribute read path adds no per-scan key comparison, and — after the read path no longer materializes the leaf IndexTuple — no per-scan itup cost either.

Caveats: a single short run on a laptop, so absolute TPS and post-run index sizes are noise- and autovacuum-sensitive. WAL-per-update and the read parity are the robust signals; the directional throughput gains are consistent with them. A multi-run sweep on dedicated hardware (the harness in src/test/benchmarks/siu supports it) remains future work.

---

The Problem

Heap bloat and index bloat due to the MVCC model in heap pushes the cost of in-place updates into the VACUUM (or pruning) process. Avoiding as much of that bloat as possible saves space and I/O.

PostgreSQL's heap-only tuple (HOT) optimization works when a new version of a row has room to co-exist on the same page as the old version. When that is the case then you can avoid updating indexes as long as you've not modified any indexed attributues. Change one and you're forced to update all indexes - even those that don't reference any changed attributes. A small optimization was made to address summarizing indexes, which always need updates, in cases where no non-summarizing indexes were updated. As it is before this patch, HOT can avoide index updates when only non-key columns change. This is fantastic, but it still updates every index, even if the index doesn't care about the columns that changed.

Consider a table with fifty columns and twenty indexes (one on supplier_id, another on price, and then 18 more). An UPDATE that changes only price but not supplier_id (or other indexed columns) still updates 20 indexes, not one. For wide tables with many indexes, this creates unnecessary write amplification (aka bloat) in the index that must be cleaned up by a VACUUM and reconciled during index scans.

The solution is simple: if an index doesn't reference a column, then the index doesn't care whether that column changed. We should be able to skip updating such indexes... in theory.

---

How This Works

Step 1: Identify What Indexed Attributes Changed

When we UPDATE a row, the executor (ExecUpdateModifiedIdxAttrs, which replaces heapam's old HeapDetermineColumnsInfo) compares the old and new tuples over the relation's indexed-attribute set and builds a bitmap of the indexed attributes whose values actually changed (modified_idx_attrs):

changed an indexed attribute → set its bit
otherwise → leave it clear

HeapUpdateHotAllowable then classifies the update from that bitmap:

HEAP_UPDATE_ALL_INDEXES — not HOT (e.g. all indexed attributes changed, an expression-index input changed, a system catalog, or the per-subscription apply gate); a new tuple and an entry in every index.
HEAP_UPDATE_HOT — no indexed attribute changed; classic HOT.
HEAP_SELECTIVE_INDEX_UPDATE — some, but not all, indexed attributes changed; stay on the HOT chain and maintain only the changed indexes.

A simple table with two indexes:

Table: users(id, email, bio, status)
Indexes: idx_email(email), idx_status(status)

And then an UPDATE where an one of the two indexed columns is mutated.

UPDATE users SET status = 'active' WHERE id = 1;

We note that a subset of indexes overlap with the changed attributes, so we can selectively update those.

  changed indexed attrs = {status}     (email unchanged)
  not empty, not "all indexed"         => HEAP_SELECTIVE_INDEX_UPDATE
  maintain idx_status (insert a fresh entry); skip idx_email

Step 2: Mark the New Tuple and Record Which Attributes Changed

The new version is a heap-only tuple linked into the chain via t_ctid, exactly like classic HOT, plus the HEAP_INDEXED_UPDATED bit (t_infomask2 0x0800) and, appended after its attribute data, a fixed-size bitmap of the attributes that changed at this hop. The executor inserts a fresh index entry only into the changed indexes, and that fresh entry points at the new heap-only tuple's own TID, not at the chain root. Unchanged indexes are not touched: their existing entries still resolve through the chain.

There is no "tombstone" line pointer. The bitmap is inline in the data-bearing tuple; its length is ceil(natts/8) bytes, sized by the tuple's own attribute count at write time, so it survives ADD COLUMN (see #Bitmap sizing across DDL).

INSERT (1, 'a@x', 'active'):
  LP[1] = T1(email='a@x', status='active')   root; idx_email,idx_status -> LP[1]

UPDATE status='paused' (HOT-indexed, change {status}):
  LP[1] = T1   root, HEAP_HOT_UPDATED, t_ctid -> 2
  LP[2] = T2(email='a@x', status='paused')   heap-only, INDEXED_UPDATED{status}
  idx_status gains a fresh entry ('paused') -> LP[2]
  idx_email is NOT touched: its entry ('a@x') -> LP[1] still resolves the chain

Step 3: Reads Drop Stale Entries by the Crossed-Attribute Bitmap

A pre-update index entry can now be stale: it chain-leads to a live tuple whose current key differs. The read side detects this without any value comparison. heap_hot_search_buffer walks the chain to the live tuple and unions the per-hop bitmaps of every hop crossed after the arriving entry's own tuple (the entry's own producing hop does not count — a fresh entry is never stale for its own index). The index-access layer tests that union against the arriving index's key columns: overlap ⇒ stale, drop; disjoint ⇒ current, return. The row a stale entry would have surfaced is re-supplied by the fresh entry the same update planted.

Scan status='paused' via idx_status -> LP[2]:
  arrive AT LP[2] (own hop {status} not counted); no later hop crossed
  crossed = {}    => current => return T2.                              OK
Scan status='active' via idx_status -> LP[1] (stale):
  cross ->2 {status};  crossed = {status};  {status} & {status} = {status}
                  => stale => drop (T2 is supplied by the 'paused' entry).  OK

This is access-method agnostic: it never reconstructs or compares an index key, so btree, hash, GiST, GIN and SP-GiST all work, and a scan never has to materialize the leaf IndexTuple for staleness purposes.

Index-only scans: the all-visible page problem and the per-entry staleness check

Index-only scans (IOS) need two things from SIU that an ordinary index scan does not, because an IOS deliberately tries to answer a query without touching the heap — it serves the result columns out of the index tuple (xs_itup) and skips the heap fetch whenever the target page is marked all-visible in the visibility map (VM). Both of those shortcuts are unsafe over a HOT-indexed chain unless we account for them.

(1) All-visible pages, or the lack of them over redirect/stub chains. A stale leaf points (via the chain) at a live tuple whose current key differs from the leaf's stored key. If the page holding that chain were marked all-visible, an IOS would take the VM fast path, skip the heap fetch, and return the leaf's stale key as the answer — wrong results, with no opportunity to detect the problem. SIU prevents this from the prune side: a page that carries anything a stale leaf can still resolve through — a preserved live HEAP_INDEXED_UPDATED member, an LP_REDIRECT that forwards into one, or a collapse-survivor stub — is deliberately kept out of the visibility map (prune forces set_all_visible = false for it; see heap_prune_record_redirect and the stub recorders, with the same guard re-applied in heap_page_would_be_all_visible). So a page that could surface a stale entry is never all-visible, and an IOS over it is forced down the heap-fetch path below, where staleness can be detected. Conversely, a page that genuinely is all-visible cannot hold a live SIU chain, so the VM fast path stays correct and needs no extra check.

(2) Re-checking the entry against the live tuple during the scan. Once forced to fetch the heap, the IOS still intends to return values from xs_itup, not from the heap tuple it just read. That is exactly where a stale leaf would do damage: xs_itup holds the old key, so returning it would surface a value the live row no longer has. The scan therefore re-checks, per entry, whether the leaf it arrived through is stale for this index before trusting xs_itup. For the read path that check is the crossed-attribute bitmap test, not a literal key comparison: the chain walk has accumulated the union of the modified-attrs bitmaps it crossed, and the index-access layer sets xs_hot_indexed_stale iff that union overlaps this index's columns. If stale, IOS drops the entry (ExecClearTuple + continue); the fresh entry the same update planted returns the row with correct values via its own path.

IOS for an indexed column, entry arrived via a stale leaf:
  VM_ALL_VISIBLE(page)? -> no (prune kept the SIU page out of the VM)
  index_fetch_heap()    -> walks the chain, accumulates crossed = {changed attrs}
  xs_hot_indexed_stale  -> crossed overlaps this index's columns => true
  => drop the entry (do NOT return xs_itup's stale key); the fresh entry serves it.

Why a bitmap test and not an actual key comparison here: the read path is access-method agnostic and must not depend on reconstructing or comparing keys (that would defeat the whole design and would not work uniformly across AMs). The one place SIU does compare keys is the unique-insert check (_bt_check_unique), where a missed conflict is corruption and the bitmap's lossy "something changed" verdict is not enough (see Appendix A); that is a write-side correctness gate, not the IOS read path.

Step 4: Prune Collapses Dead Chains to Xid-Free Stubs

A dead mid-chain HOT-indexed tuple cannot be reclaimed to LP_UNUSED while a not-yet-swept stale entry can still arrive at it, and its bitmap is what later readers union. Prune collapses a dead prefix: each preserved dead key tuple is rewritten in place as an xid-free stub (LP_NORMAL, HEAP_INDEXED_UPDATED, natts == 0, frozen XMIN/XMAX_INVALID, t_ctid.offnum forwarding to the next survivor, carrying the same inline bitmap); a dead member whose attributes are wholly subsumed by later hops is reclaimed outright instead. The root becomes an LP_REDIRECT to the first survivor. Readers step through stubs transparently and still cross every surviving hop's bitmap.

The collapse rides the existing prune/freeze WAL; no new record type. Once VACUUM's index cleanup has swept the stale leaves and the whole chain is dead, a later prune reclaims the stubs and re-points the root redirect, collapsing back to classic HOT.

Reclamation is amortized onto reads via the existing opportunistic heap_page_prune_opt sites, which prune a heap page while it is pinned for a scan rather than deferring everything to VACUUM. The sequential, bitmap, and index-fetch scan paths already do this; this work adds the same to heap_fetch, the TID-addressed fetch used by TID scans, EvalPlanQual rechecks, and RI/trigger refetches. Those fetches land directly on the page holding a recently-updated row -- disproportionately likely to carry dead versions and collapse-survivor stubs -- yet previously pruned nothing. heap_page_prune_opt is self-gating, so the added call is nearly free when there is nothing to reclaim.

---

Why This Is Correct

The correctness argument rests on the crossed-attribute bitmap being the staleness authority, plus the placement of fresh entries:

1. Exactly one entry per index resolves the row. A fresh entry points at the heap-only tuple whose key it matched, so its walk to the live tuple crosses no later hop that changed its index's key — the union is disjoint and it is kept. A stale entry's walk does cross such a hop — the union overlaps and it is dropped. No duplicates, no lost rows.

2. ABA is handled. If an indexed value is cycled away and back (X → Y → X), the live tuple's key equals both the ancestor entry's key and the fresh entry's key. A value recheck would keep both and return the row twice; the bitmap drops the ancestor because the column changed after it, regardless of the coincident value (worked trace below).

3. The union is complete. Every crossed live hop and every collapse-survivor stub contributes its bitmap, and collapse reclaims a dead member only when its attributes are a subset of the surviving later hops — so a reader crossing the survivors still sees every collapsed hop's attributes. Disjointness therefore reliably means current.

4. Unique checks compare values with the opclass comparator. _bt_check_unique fetches the conflicting tuple under SnapshotDirty and, when the chain walk crossed a HOT-indexed hop, compares its live index key against the arriving leaf using the index's own ordering procedure (_bt_heap_keys_equal_leaf, BTORDER_PROC under each column's collation). Using the opclass comparator — not a bitwise comparison — recognizes a cycled key as the same logical row while still detecting a genuinely live duplicate that is opclass-equal but not bitwise-identical (numeric 1.0 vs 1.00, float -0.0 vs 0.0). The recheck is required for correctness, not just an optimization: in the in-flight window of a restoring (Y→X) update the fresh X entry is not yet inserted, so the stale ancestor leaf is the only witness of the conflict, and the recheck routes that hit into _bt_doinsert's xwait wait-and-recheck (a bitmap-only verdict would skip it and admit a duplicate). It reads plain key columns straight from the heap slot and evaluates no indexed expression: an UPDATE touching an expression-index attribute is disqualified from HOT-indexed (HeapUpdateHotAllowable), so an expression index is never the one receiving the fresh entry whose insert runs this check. This helper is internal to nbtree and is not on the read path.

5. A stub-bearing page is never PD_ALL_VISIBLE, so the all-visible seqscan and index-only-scan fast paths cannot surface stub bytes as a phantom row; amcheck enforces this as an on-disk invariant. WAL replay is verified byte-for-byte under wal_consistency_checking, including the collapse/stub records.

---

Worked Example: Selective Maintenance, ABA, and Chain Collapse

Notation follows the in-tree README and RFC: LP[n] is the line pointer at offset n; {a,b} is a modified-attrs bitmap; ->k is the same-page t_ctid successor; a tuple's bitmap = the attrs that changed on the hop into it.

Example 1: selective maintenance and a stale drop

Table t(id PK, a, b, c), indexes t_a(a), t_b(b), t_c(c), fillfactor 50. INSERT (1,10,20,30); UPDATE a=11; UPDATE b=21; UPDATE c=31.

Chain:
  LP[1] v1(a=10,b=20,c=30)  root, HEAP_HOT_UPDATED, ->2          dead
  LP[2] v2(a=11,b=20,c=30)  heap-only, INDEXED_UPDATED{a}, ->3   dead
  LP[3] v3(a=11,b=21,c=30)  heap-only, INDEXED_UPDATED{b}, ->4   dead
  LP[4] v4(a=11,b=21,c=31)  heap-only, INDEXED_UPDATED{c}        live

Index entries (fresh entries point mid-chain at the tuple they matched):
  t_a:  (10)->LP[1] stale     (11)->LP[2] fresh
  t_b:  (20)->LP[1] stale     (21)->LP[3] fresh
  t_c:  (30)->LP[1] stale     (31)->LP[4] fresh

Scan a=11 via t_a -> LP[2]:
  arrive AT LP[2] (own hop {a} not counted); cross ->3 {b}, ->4 {c}.
  crossed={b,c};  t_a keys={a};  {a} & {b,c} = {}  => fresh => return v4.  OK
Scan a=10 via t_a -> LP[1] (stale):
  cross ->2 {a}, ->3 {b}, ->4 {c};  crossed={a,b,c};  {a}&{a,b,c}={a} => drop.  OK
  (v4 is supplied once, by the fresh (11)->LP[2] entry.)

Example 2: ABA — the case a value recheck gets wrong

INSERT (1,a=10); UPDATE a=11; UPDATE a=10. (a cycles 10 → 11 → 10)

  LP[1] v1(a=10) root ->2 dead
  LP[2] v2(a=11) {a} ->3 dead
  LP[3] v3(a=10) {a}     live
  t_a:  (10)->LP[1] stale     (11)->LP[2] stale     (10)->LP[3] fresh

Scan a=10 finds TWO entries with key 10 (LP[1] and LP[3]):
  via LP[3]: zero hops crossed => fresh => return v3.                  OK
  via LP[1]: cross ->2 {a}, ->3 {a}; crossed={a}; {a}&{a}={a} => drop.  OK
Returned exactly once.  A value recheck would compare leaf key 10 against live
a=10 for BOTH entries and keep both -> duplicate.  The bitmap drops the
ancestor because a *changed* after LP[1], regardless of the coincident value.

Example 3: collapse to xid-free stubs (back toward classic HOT)

From Example 1, VACUUM finds LP[1..3] dead, LP[4] live. Walking the dead prefix from the live end and accumulating the union of later hops (laterattrs): a member is reclaimed if its bitmap is a subset of later hops (its entries are already superseded), otherwise it is kept as a stub.

  seed laterattrs from the live remainder LP[4]: {c}.
  LP[3] {b}: {b} not-subset {c}   -> keep as stub forwarding ->4.  laterattrs={b,c}.
  LP[2] {a}: {a} not-subset {b,c} -> keep as stub forwarding ->3.  laterattrs={a,b,c}.
  LP[1] root -> LP_REDIRECT ->2 (first survivor).

Result:
  LP[1] redirect ->2
  LP[2] stub{a}  forward ->3      (xid-free, natts==0)
  LP[3] stub{b}  forward ->4      (xid-free, natts==0)
  LP[4] live v4

Scan a=11 via t_a (11)->LP[2]:
  arrive AT LP[2] stub (own segment {a} not counted); forward ->3 stub {b},
  ->4 {c};  crossed={b,c};  {a}&{b,c}={} => fresh => return v4.        OK
Scan a=10 via t_a (10)->LP[1] redirect ->2:
  follow redirect to LP[2] (now a crossed segment) {a}, ->3 {b}, ->4 {c};
  crossed={a,b,c};  {a}&{a,b,c}={a} => stale => drop.                 OK

There is no "redirect-with-data": the bitmap lives on the stub itself, not on the redirect. Once every entry into the chain is swept by ambulkdelete and the whole chain is dead, VACUUM reclaims the stubs to LP_UNUSED and re-points the root redirect straight at the live tuple — the page is back to classic HOT, with no metadata remaining.

Example 3a: prune and vacuum, step by step

A fuller trace of the same chain that separates what prune does (the collapse) from what VACUUM does (the index sweep and the final reclaim). Note there is no "redirect-with-data": the root becomes a plain LP_REDIRECT and the per-hop bitmaps live on the stubs, which the reader crosses one by one.

Table siu_collapse(id, a, b, c) with indexes siu_coll_a(a), siu_coll_b(b), siu_coll_c(c): INSERT (1,10,20,30); UPDATE a=11; UPDATE b=21; UPDATE c=31;

(0) Chain after the three HOT-indexed updates, before any prune. Each new version is a heap-only tuple carrying the bitmap of what changed at its hop; each changed index got a fresh entry at the new tuple's own TID, and the pre-update entries remain (now stale).

  LP[1] v1(a=10,b=20,c=30)  root, HEAP_HOT_UPDATED, ->2     dead
  LP[2] v2(a=11,b=20,c=30)  heap-only, {a}, ->3            dead
  LP[3] v3(a=11,b=21,c=30)  heap-only, {b}, ->4            dead
  LP[4] v4(a=11,b=21,c=31)  heap-only, {c}                 live

  siu_coll_a:  (10)->LP[1] stale     (11)->LP[2] fresh
  siu_coll_b:  (20)->LP[1] stale     (21)->LP[3] fresh
  siu_coll_c:  (30)->LP[1] stale     (31)->LP[4] fresh

(1) PRUNE collapses the dead prefix (on-access via heap_page_prune_opt, or in VACUUM's first pass). It finds LP[1..3] dead, LP[4] live, and walks from the live end accumulating laterattrs (the union of later hops):

  seed laterattrs = LP[4] {c}
  LP[3] {b}: {b} not subset of {c}     -> keep as stub ->4;  laterattrs={b,c}
  LP[2] {a}: {a} not subset of {b,c}   -> keep as stub ->3;  laterattrs={a,b,c}
  LP[1] root                           -> LP_REDIRECT ->2 (first survivor)

A dead member is reclaimed outright (LP_DEAD) instead of stubbed only when its bitmap is a subset of the later hops — then no live entry references it and a later survivor still carries its attributes. Here none qualify, so all three are kept. Result:

  LP[1] redirect ->2
  LP[2] stub {a}  forward ->3   (xid-free: XMIN/XMAX_INVALID, natts==0)
  LP[3] stub {b}  forward ->4   (xid-free)
  LP[4] live v4

The page is kept non-all-visible while a stub remains, so index-only scans heap-fetch through it. The stale and fresh leaves still point where they did; only the heap changed.

(2) Reads against the collapsed page:

Query a=11 via siu_coll_a, fresh entry (11)->LP[2]:
  arrive AT LP[2] stub (its own {a} is the entry's own hop, not counted);
  cross ->3 {b}, ->4 {c};  crossed={b,c};  key {a};  {a}&{b,c}={}
  => current => return v4.                                         OK
Query b=21 via siu_coll_b, fresh entry (21)->LP[3]:
  arrive AT LP[3] stub; cross ->4 {c};  crossed={c};  {b}&{c}={}
  => current => return v4.                                         OK
Query a=10 via siu_coll_a, STALE entry (10)->LP[1]:
  LP[1] is a plain redirect -> follow to LP[2]; now crossing the collapsed
  segment: LP[2] {a}, ->3 {b}, ->4 {c};  crossed={a,b,c};  {a}&{a,b,c}={a}
  => stale => drop (v4 is supplied once, by the fresh (11)->LP[2] entry).  OK

(3) VACUUM index cleanup (ambulkdelete) removes the now-removable stale leaves (10/20/30 -> LP[1]); kill_prior_tuple and bottom-up deletion also remove them opportunistically. VACUUM's heap second pass (lazy_vacuum_heap_page) does NOT collapse or re-point anything; it only turns LP_DEAD line pointers into LP_UNUSED.

(4) Final reclaim. Once every entry into the chain has been swept and the whole chain is dead, a later PRUNE reclaims the stubs to LP_UNUSED and re-points the root redirect straight at the live tuple:

  LP[1] redirect ->4      (or reclaimed if no entry references the root)
  LP[2] LP_UNUSED
  LP[3] LP_UNUSED
  LP[4] live v4

No SIU metadata remains on the page; it is indistinguishable from a classic-HOT chain that has been pruned.

Bitmap sizing across DDL

The inline bitmap is ceil(natts/8) bytes, sized by the tuple's own natts at write time, not the relation's current natts. ADD COLUMN raises the relation's natts without rewriting existing tuples, so a chain can hold hops sized for different natts; the sharp case is crossing an 8-attribute boundary, where the byte count grows. Every consumer locates a hop's bitmap from that hop's own write-time natts (HotIndexedTupleBitmapNatts: HeapTupleHeaderGetNatts for a live tuple, the stub's stashed natts otherwise — a stub keeps its write-time natts in the unused block half of t_ctid, since the offset half is the forward link).

t(c1 PK,...,c7, payload)  -- exactly 8 attrs; t_c2(c2), t_c7(c7)
  UPDATE c7=71; UPDATE c7=72;        chain bitmaps are 1 byte (natts=8)
  ALTER TABLE t ADD COLUMN c9 int;   relation natts 8 -> 9; ceil 1 -> 2
  UPDATE c7=73;                      this hop's bitmap is 2 bytes (natts=9)

  LP[1] v1(c7=70) root ->2 dead   (1-byte)
  LP[2] v2(c7=71) {c7} ->3 dead   (1-byte)
  LP[3] v3(c7=72) {c7} ->4 dead   (1-byte)
  LP[4] v4(c7=73) {c7}     live   (2-byte)

Scan an unchanged c2 via t_c2 -> LP[1] (stale): cross ->2,->3,->4, each located
  by its own write-time natts; crossed={c7}; {c2}&{c7}={} => current => return v4.
  Sizing LP[2,3] with the relation's *current* natts=9 would misread a data
  byte as bitmap and could wrongly drop the current c2 entry; per-hop sizing
  avoids it.  OK

DROP COLUMN keeps the attnum slot (it never renumbers), so bit positions and natts are unchanged and existing bitmaps stay aligned. CREATE INDEX/REINDEX over a live chain indexes each live tuple under its own TID, so the new entries cross no later hop and are never stale.

---

Testing

The feature is covered by a regression suite, an adversarial isolation spec, a crash-recovery TAP test, replication and logical-decoding tests, a pg_upgrade test, and the amcheck/pg_surgery guards (all passing with cassert=true):

Regression (hot_indexed_updates.sql): eligibility and

 classification; selective maintenance across multiple/composite indexes; the
 crossed-attribute read path for equality and range scans; a key cycled away
 and back (ABA); TOASTed indexed columns; partial-index predicate flips;
 non-btree access methods (hash incl. ABA, GIN, GiST); CREATE INDEX/REINDEX
 over an existing chain; and DDL after a chain exists (CREATE/DROP INDEX,
 ADD COLUMN crossing a bitmap-size boundary, DROP COLUMN).

Isolation (hot_indexed_adversarial.spec): concurrent

 UPDATE/VACUUM/prune and index scans, key cycling, aborts, and reader
 consistency across a concurrent collapse.

Recovery (054_hot_indexed_recovery.pl): crash + WAL replay

 of chains and stub collapse, byte-identical under
 wal_consistency_checking = 'all'.

Replication (039_hot_indexed_apply.pl,

 040_hot_indexed_replica_identity.pl, test_decoding): the
 per-subscription apply modes, replica identity FULL and USING INDEX (incl. a
 cycled USING INDEX key), and logical decoding over chains.

pg_upgrade (009_hot_indexed.pl): a relation with chains,

 an ABA-cycled column, a TOASTed indexed column, and VACUUM-collapsed stubs
 carried across a major-version upgrade and re-verified.

amcheck / pg_surgery: verify_heapam recognizes HOT-indexed tuples and

 stubs; pg_surgery refuses to freeze/kill a stub.

---

Scope

Supported and tested under any access method (the read path is access-method agnostic): selective maintenance across multiple indexes; partial indexes (including predicate flips); exclusion constraints, including a temporal PRIMARY KEY ... WITHOUT OVERLAPS (the exclusion recheck remains authoritative for conflict detection while the bitmap drops stale entries); summarizing indexes (BRIN, via the existing summarizing path); partitioned tables (a within-partition update is HOT-indexed on the leaf heap; a partition-key change is a cross-partition delete+insert and is never HOT); TOASTed indexed columns; and replica identity FULL / USING INDEX on the apply path.

Deliberately ineligible (carve-outs): system catalogs; an UPDATE that changes an expression-index input (the bitmap is attribute-granular and cannot tell whether the expression's value changed — expression-aware selective maintenance is deferred); an UPDATE that changes every indexed attribute (nothing to skip); and the logical-replication apply path unless permitted by hot_indexed_on_apply.

---

The Three Transitions: Where the Bitmap Lives

The modified-attrs bitmap is never persisted in WAL records or in a permanent tuple-header field. It is born inline on the heap-only tuple, is preserved (re-homed) on a stub when prune collapses the chain, and disappears entirely once the chain is fully dead. Here are the three transitions, in the same notation as the worked examples.

Transition 1: Bitmap inline on the new heap-only tuple (the UPDATE)

A HOT-indexed UPDATE links a heap-only tuple into the chain and appends, in the final ceil(natts/8) bytes of its item, the bitmap of attributes changed at this hop. A fresh index entry for each changed index points at this tuple's own TID.

Initial:
  LP[1] = T1(a=10,b=20,c=30)   root

After UPDATE c=31 (change {c}):
  LP[1] = T1   root, HEAP_HOT_UPDATED, ->2          (still live here)
  LP[2] = T2(a=10,b=20,c=31)   heap-only, INDEXED_UPDATED, inline bitmap {c}
  t_c gains a fresh entry (31) -> LP[2]; t_a, t_b untouched.

The bitmap is on the data-bearing tuple itself — there is no separate tombstone line pointer, and nothing about it is written to WAL beyond the ordinary heap-update record that already carries the new tuple's bytes.

Transition 2: Bitmap re-homed on an xid-free stub (prune/collapse)

When the dead prefix is collapsed, each preserved dead key tuple is rewritten in place as an xid-free stub that keeps its own inline bitmap and forwards to the next survivor; a member whose attributes are wholly subsumed by later hops is reclaimed instead. The root becomes a plain LP_REDIRECT.

After UPDATE c=31 {c}, UPDATE b=21 {b}, then T1,T2 die (T3 live):
  LP[1] = LP_REDIRECT ->2            (first survivor; no payload)
  LP[2] = stub, INDEXED_UPDATED, natts==0, bitmap {c}, forward ->3, frozen
  LP[3] = T3(a=10,b=21,c=31)   live, INDEXED_UPDATED, bitmap {b}

The bitmap did not move into the redirect (there is no "redirect-with-data"); it stays on the stub, which a reader crosses just like a live hop. The stub is XMIN/XMAX_INVALID, so it holds back nothing for freezing, and it preserves its write-time natts in the unused block half of t_ctid so the bitmap stays correctly sized after a later ADD COLUMN.

Transition 3: Bitmap gone (back to classic HOT)

Once ambulkdelete has swept every stale leaf and the whole chain is dead, a later prune reclaims the stubs to LP_UNUSED and re-points the root redirect straight at the live tuple.

  LP[1] = LP_REDIRECT ->3      (or reclaimed, depending on remaining refs)
  LP[2] = LP_UNUSED
  LP[3] = T3   live

No bitmap remains anywhere; the page is indistinguishable from a classic-HOT chain. The metadata was ephemeral throughout: inline on a tuple, then on a stub, then nothing — never encoded permanently in tuple headers or WAL, and never replicated as per-column metadata (a subscriber sees only ordinary UPDATE/prune records).

Eligibility

There is no cost heuristic and no GUC: any UPDATE that changes a non-summarizing indexed attribute is HEAP_SELECTIVE_INDEX_UPDATE unless a carve-out applies. HeapUpdateHotAllowable decides this from the modified_idx_attrs bitmap and the per-relation indexed-attribute set (RelationGetIndexedAttrs).

The carve-outs that force HEAP_UPDATE_ALL_INDEXES are:

Every indexed attribute changed. Nothing can be skipped, so a plain

 non-HOT update is cheaper (it avoids the chain walk and bitmap overhead).
 This is an exact test, not a percentage.

An expression-index input changed. The bitmap is attribute-granular

 and cannot tell whether the expression's value changed; expression-aware
 selective maintenance is deferred.  (This is the same kind of restriction the
 partial-index one used to be, and is expected to be liftable once tested.)

System catalogs. A catalog UPDATE that changes a non-summarizing

 indexed attribute stays classic HOT but never takes the HOT-indexed path:
 catalogs are reached through access paths (systable scans, SnapshotDirty
 unique checks) not yet proven safe.

The logical-replication apply path, gated per subscription by

 hot_indexed_on_apply (off / subset_only (default) / always): a
 HOT-indexed update of a replica-identity attribute leaves a stale leaf the
 apply worker's RI lookup must tolerate, which it does only when the indexed
 attributes are a subset of (or equal to, for off) the replica identity.

Everything else — multiple/partial/composite indexes, any access method, summarizing (BRIN) columns, exclusion constraints, partitioned tables, TOASTed columns — is eligible and tested (see Scope and Testing). An earlier value-recheck draft needed several extra restrictions (partial-index predicates, a chain-length cap); they were removed once the crossed-attribute bitmap became the staleness authority — the cap in particular never bounded anything (it measured from the chain tail) and growth is instead bounded naturally by the page and by prune/collapse.

Why Previous Efforts Failed

WARM (Write-Amplification Reduction Mechanism)

The WARM proposal by the author of HOT had an "alternating update" model: tuples in a chain would alternate between being "hot updatable" (new versions can be added without index updates) and "cold" (fully indexed). The idea: on write-heavy workloads, a series of updates would alternate, some skipping indexes and some updating them. Indexes maintain references to the root LP, not mid-chain.

Why it failed:

Unpredictable benefit: Whether alternation helps depends on access patterns. A read-heavy query might hit a "cold" tuple and still need index scans. A write-heavy workload might be updating the same tuple repeatedly, making the alternating state machine ineffective.

Operator confusion: In production, having some index entries be fresh and others stale is hard to reason about. When does alternation trigger? When does it not? The state machine adds cognitive overhead.

Incomplete solution: WARM didn't address what happens when the chain is pruned or when readers encounter stale entries. Does alternation survive VACUUM? Can readers rely on it?

The community decided the complexity wasn't justified by the unpredictable payoff. WARM was never committed.

PHOT (Partial HOT)

PHOT ("partial heap only tuples", Nathan Bossart, pgsql-hackers, 2021-02-09) tracked per-tuple which indexed columns were modified. When a new version skipped updating indexes, PHOT would record which columns actually changed. Readers would consult this metadata during index scans to validate stale entries.

Why it was promising but ultimately failed:

Metadata persistence problem: PHOT needed to track "which columns were

modified" using LP_DEAD space on the page identified during pruning as well as in WAL records. This couples index metadata (which columns are indexed?) with tuple state (which columns changed?). When replicating to a subscriber with different indexes, this becomes intractable.

Chain walking after prune: PHOT required readers to walk HOT chains

consulting per-tuple metadata. But if VACUUM prunes a chain member before all readers process it, readers might miss tuples or hit dangling pointers. PHOT deferred this problem explicitly, acknowledging a correctness gap as a WIP.

Concurrent prune/vacuum races: PHOT had no robust answer for: What if

VACUUM removes a chain member while a reader is traversing it? What if prune modifies metadata while readers consult it? The per-tuple metadata made concurrency reasoning harder, not easier.

WAL encoding complexity: Encoding per-tuple column-modification info in

every WAL record bloats the log and couples the replication stream to index structure. Subscriber index changes break assumptions made at publication time.

Replication safety: Under logical replication, subscriber indexes can

diverge from publisher indexes (drop index, add index, change index columns). PHOT had no mechanism to detect divergence or prevent corruption.

PHOT was ambitious but required solving too many hard problems simultaneously. The proposal stalled around 2016.

Why HOT-indexed Updates Is Different

This work takes a different approach:

Metadata is ephemeral, not persistent. Like PHOT, we track which

indexed columns were modified—but only inline on the heap-only tuple (and, after collapse, on its xid-free stub), never replicated across the chain or encoded permanently in WAL. A reader consults the bitmaps it crosses during chain traversal; the ordinary heap-update and prune records already carry everything needed, so the replication stream is not coupled to index structure.

Chain walking is deterministic and the bitmap is positional per hop. A

fresh entry points at the tuple whose key it matched, so it crosses no later key-changing hop; a stale entry does. Staleness is the crossed-attribute overlap, not a value recheck, so an ABA cycle cannot fool it.

Collapse needs no "convert back" bookkeeping. Prune rewrites the dead

prefix to stubs that preserve their bitmaps and forward to the next survivor; once the chain is fully dead VACUUM reclaims the stubs and re-points the root redirect. There is no alternating state machine and no per-tuple flag to clear.

Replication safety is explicit. We provide a per-subscription option to

control whether HOT-indexed updates are used on the apply side. Administrators choose: safe (always off), compatible (on only when index sets match), or risky (always on, requires manual sync). No silent corruption.

---

Acknowledgements

This series builds directly on prior work by:

Pavan Deolasee and Tom Lane — classic HOT (2007), commit 282d2a03dd3.
Pavan Deolasee and Gokulakannan Somasundaram — original HOT design (2007).
Pavan Deolasee — WARM (2017), the structural template this series consciously moves away from: it replaces WARM's value recheck with an attribute-bitmap staleness test on mid-chain pointers.
Nathan Bossart — PHOT (2021), the structural template for "mid-chain pointers" that this series finishes.
Matthias van de Meent, Tomas Vondra, Josef Simanek, Álvaro Herrera — amsummarizing and the per-index update-decision relaxation (PostgreSQL 17, commit 19d8e2308bc) — the first relaxation of HOT's I1 invariant. (This series replaces the TU_UpdateIndexes result code with an executor-side modified-attrs bitmap.)
Peter Geoghegan — indexUnchanged hint and bottom-up btree deletion (PostgreSQL 14, commit d168b666823) — the per-index hint mechanism this series builds on.
Álvaro Herrera — BRIN (PostgreSQL 9.5, commit 7516f525941) — the conceptual split between summarizing and per-tuple indexes.

Discussion

The pgsql-hackers thread for this proposal has not yet been started. When posted, the thread URL will be added here.

This wiki page is the design preview; the in-tree src/backend/access/heap/README.HOT-INDEXED is the authoritative reference and is kept current with the code.

Links and References

PostgreSQL

HOT wiki page — https://wiki.postgresql.org/wiki/HOT
PostgreSQL documentation: Heap-Only Tuples (HOT)
Commit 282d2a03dd3: HOT updates (Tom Lane, 2007-09-20)
Commit 7516f525941: BRIN: Block Range Indexes (Álvaro Herrera, 2014-11-07)
Commit d168b666823: Enhance nbtree index tuple deletion (Peter Geoghegan, 2021-01-13)
Commit 19d8e2308bc: Ignore BRIN indexes when checking for HOT updates (Tomas Vondra, 2023-03-20)

Prior proposals

WARM proposal on pgsql-hackers (2017) — Pavan Deolasee, EnterpriseDB
"partial heap only tuples" (PHOT) on pgsql-hackers (2021) — Nathan Bossart, Amazon

In-tree documentation

src/backend/access/heap/README.HOT-INDEXED — design reference
src/backend/access/heap/README.HOT — classic HOT reference (unchanged)
src/test/regress/sql/hot_indexed_updates.sql — regression tests
src/test/benchmarks/siu/ — A/B benchmark harness