Freezing/skipping strategies patch: motivating examples

From PostgreSQL wiki
Jump to: navigation, search

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger tables. This patch is proposed for Postgres 16, and builds on related work added to Postgres 15 (see Postgres 15 commits 0b018fab, f3c15cbe, and 44fa8488).

See also: CF entry for patch series

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

Note that the patch completely removes aggressive mode VACUUM. Antiwraparound autovacuums will still exist, but become much rarer. Antiwraparound autovacuums should only be needed in true emergencies with this work in place.

Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM (barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer to freeze eagerly rather than lazily) is an issue of central importance for the project.

Examples

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

Simple append-only

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

Today, on Postgres HEAD

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

 automatic vacuum of table "regression.public.pgbench_history": index scans: 0
 pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
 tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
 removable cutoff: 101761411, which was 767958 XIDs old when operation ended
 frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
 index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
 I/O timings: read: 0.004 ms, write: 0.000 ms
 avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
 buffer usage: 43958 hits, 1 misses, 1 dirtied
 WAL usage: 21461 records, 1 full page images, 1274525 bytes
 system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the Routine Vacuuming/the Autovacuum Daemon section of the docs (starting at it may be beneficial to lower the table's autovacuum_freeze_min_age...), but this relies on the DBA actually having a direct understanding of the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

Patch

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

 automatic vacuum of table "regression.public.pgbench_history": index scans: 0
 pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
 tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
 removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
 frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
 index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
 I/O timings: read: 0.004 ms, write: 0.000 ms
 avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
 buffer usage: 96544 hits, 1 misses, 1 dirtied
 WAL usage: 47945 records, 1 full page images, 2837081 bytes
 system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default is 4GB.

Here is where the patch starts to diverge from HEAD:

 automatic vacuum of table "regression.public.pgbench_history": index scans: 0
 pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
 tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
 removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
 frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
 index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
 I/O timings: read: 0.013 ms, write: 0.000 ms
 avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
 buffer usage: 100044 hits, 2 misses, 2 dirtied
 WAL usage: 99331 records, 2 full page images, 21720187 bytes
 system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

 automatic vacuum of table "regression.public.pgbench_history": index scans: 0
 pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
 tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
 removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
 new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
 frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
 index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
 I/O timings: read: 384.224 ms, write: 1003.192 ms
 avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
 buffer usage: 906525 hits, 216130 misses, 476907 dirtied
 WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
 system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much later VACUUM -- even this seems a bit too close to aggressive VACUUM).

Update: v10 of the patch series avoids the freeze spike that you see here (here we show v9 behavior). So in v10 the same-catch up process will only happen during some much later insert-driven autovacuum, when the pgbench_history table has become far larger (while eager freezing would kick in at the same point as in v9). The relative cost of catch-up freezing won't be too great under this improved scheme; we expect to have no more than about an extra 5% of rel_pages to freeze when it happens now (far less than the 52% of rel_pages shown here on a percentage basis, though not in absolute terms).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

 automatic vacuum of table "regression.public.pgbench_history": index scans: 0
 pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
 tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
 removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
 new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
 frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
 index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
 I/O timings: read: 0.013 ms, write: 0.009 ms
 avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
 buffer usage: 294142 hits, 4 misses, 4 dirtied
 WAL usage: 293397 records, 4 full page images, 64139188 bytes
 system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

Continual inserts with updates

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: BenchmarkSQL Timed Driver docs, written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way. Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload, because there is a relatively high number of updates around the same key space/physical heap blocks at the same time, but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which approach will work best at the level of each heap page.

Today, on Postgres HEAD

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from earlier):

 ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG:  automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
 pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
 tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
 removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
 frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
 index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
 index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
 I/O timings: read: 471323.673 ms, write: 23125.683 ms
 avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
 buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
 WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
 system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

 ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG:  automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
 pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
 tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
 removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
 new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
 frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
 index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
 index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
 I/O timings: read: 558603.794 ms, write: 61424.318 ms
 avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
 buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
 WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
 system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to 0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll do less freezing as a consequence of doing more vacuuming!

Patch

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

 ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG:  automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
 pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
 tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
 removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
 new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
 frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
 index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
 index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
 I/O timings: read: 431408.682 ms, write: 29564.972 ms
 avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
 buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
 WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
 system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this example table/workload. While relfrozenxid advanced significantly less than it did in the earlier pgbench_history example, it nevertheless advanced by a huge amount by any traditional measure (in particular, by much more than the vacuum_freeze_min_age-based cutoff requires).

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all, there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

Scanned pages, visibility map snapshot

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

 pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
 pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
 pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
 pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
 pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
 pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
 pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

 pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
 pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
 pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
 pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
 pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
 pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

Mixed inserts and deletes

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite). Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

Today, on Postgres HEAD

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

Patch

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a typical example:

 ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG:  automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
 pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
 tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
 removable cutoff: 226132760, which was 33124 XIDs old when operation ended
 frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
 index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
 index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
 I/O timings: read: 398.127 ms, write: 4.778 ms
 avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
 buffer usage: 289436 hits, 931 misses, 10659 dirtied
 WAL usage: 138965 records, 13760 full page images, 86768691 bytes
 system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do continue to have some influence. But even with a table like this one, that requires very little if any freezing, the costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum for this table), but with one or two key differences:

 ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG:  automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
 pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
 tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
 removable cutoff: 227707625, which was 33391 XIDs old when operation ended
 new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
 frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
 index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
 index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
 I/O timings: read: 693.757 ms, write: 5.070 ms
 avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
 buffer usage: 308003 hits, 18304 misses, 49 dirtied
 WAL usage: 137127 records, 2587 full page images, 25969243 bytes
 system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that takes place at a point where the table's age is about 60% - 65% of the way to the point that autovacuum needs to launch an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now (though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though perhaps not Postgres 15, which has the "use oldest extant XID for relfrozenxid" mechanism added by commit 0b018fab).

Opportunistically advancing relfrozenxid with bursty, real-world workloads

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load. It's likely that there is considerable variation in how each table needs to be vacuumed based on application characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively. Costs are a big part of the picture here, which has mostly been missing before now.

Constantly updated tables (usually smaller tables)

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits 0b018fab, f3c15cbe, and 44fa8488) that the new patch series builds on. It's included here because it shows the continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has only a loose relationship with freeze debt, which is hard to predict but tends to be fairly fixed for a given table/workload. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used pgbench_tellers table just as easily):

 automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
 pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
 tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
 removable cutoff: 340103448, which was 0 XIDs old when operation ended
 new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
 frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
 index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
 index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
 I/O timings: read: 0.000 ms, write: 0.000 ms
 avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
 buffer usage: 304 hits, 0 misses, 0 dirtied
 WAL usage: 209 records, 0 full page images, 20627 bytes
 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages. In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload) that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount, and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)