PostgreSQL wiki - User contributions [en]

PostgreSQL 16 Open Items

2023-06-10T21:10:05Z

Pgeoghegan: Close out "Cleaning up nbtree after logical decoding on standby work"

== Open Issues ==

'''NOTE''': Please place new open items at the end of the list.

'''NOTE''': If known, please list the Owner of the open item.

* Switch to ICU for 17?
** Owner: Jeff Davis
** {{messageLink|82c4c816-06f6-d3e3-ba02-fca4a5cef065@enterprisedb.com|I suggest waiting until next week to commit it and then see what happens}}
** [https://commitfest.postgresql.org/42/4169/ CF Entry]
* {{messageLink|e587e2ee-7de0-88a2-10f8-c7cf001bab8c%40postgrespro.ru|psql: Add role's membership options to the \du+ command}}
** [https://commitfest.postgresql.org/43/4116/ CF Entry]
** NOTE: This is not a committed feature for v16
* {{messageLink|874jp9f5jo.fsf@news-spur.riddles.org.uk|The rules for choosing default ICU locale seem pretty unfriendly}}
** Owner: Jeff Davis
* {{messageLink|ZEZDj1H61ryrmY9o@msg.df7cb.de|could not extend file "base/5/3501" with FileFallocate(): Interrupted system call}}
** Owner: Andres Freund
** Original commit: {{PgCommitURL|4d330a61bb1}}
* {{messageLink|DFBB2D25-DE97-49CA-A60E-07C881EA59A7@winand.at|Inconsistent nulling bitmap in nestloop parameters}}
** Owner: Tom Lane

== Decisions to Recheck Mid-Beta ==

* [https://www.postgresql.org/message-id/268fd337-8bb7-92e6-0da2-416c022c11f3%40enterprisedb.com Reconsider a utility_query_id GUC to control if query jumbling of utilities can go through the past string-only mode and the new mode?]
** Potential owner: Michael Paquier

== Older bugs affecting stable branches ==

=== Live issues ===

* [https://www.postgresql.org/message-id/flat/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com RecoveryConflictInterrupt() is unsafe in a signal handler]
** This seems to [https://www.postgresql.org/message-id/447238.1651082925%40sss.pgh.pa.us explain buildfarm failures in 031_recovery_conflict.pl]
** Affects all stable branches.

* [https://www.postgresql.org/message-id/CAH2-WzkjjCoq5Y4LeeHJcjYJVxGm3M3SAWZ0%3D6J8K1FPSC9K0w%40mail.gmail.com REINDEX on a system catalog can leave index with two index tuples whose heap TIDs match]
** In other words, there is a rare case where the HOT invariant is violated. Same HOT chain is indexed twice due to confusion about which precise heap tuple should be indexed.
** Unclear what the user impact is.
** Affects all stable branches.

* [https://www.postgresql.org/message-id/20201001021609.GC8476%40telsasoft.com memory leak with JIT inlining]
** [https://www.postgresql.org/message-id/flat/20210331040751.GU4431%40telsasoft.com#cc34872765add8e483e05009212d9d39 Another report of (same?) issue and reproducer] [https://www.postgresql.org/message-id/flat/9f73e655-14b8-feaf-bd66-c0f506224b9e%40stephans-server.de Another report] [https://www.postgresql.org/message-id/flat/16707-f5df308978a55bf8%40postgresql.org Another report] [https://www.postgresql.org/message-id/flat/CAPH-tTxLf44s3CvUUtQpkDr1D8Hxqc2NGDzGXS1ODsfiJ6WSqA%40mail.gmail.com Another report] [https://www.postgresql.org/message-id/flat/a53cacb0-8835-57d6-31e4-4c5ef196de1a@deepbluecap.com Another report]

* [https://www.postgresql.org/message-id/flat/dc9dd229-ed30-6c62-4c41-d733ffff776b%40xs4all.nl TOAST fetches could perhaps occur after the needed data has been removed]
** The symptom originally reported in the thread was fixed by {{PgCommitURL|9f4f0a0dad4c7422a97d94e4051c08ec6d181dd6}}, but nobody is very happy with the status quo in this area. Do we need to do more now?
** Affects all stable branches.

* [https://www.postgresql.org/message-id/ZArVOMifjzE7f8W7%40paquier.xyz Requiring recovery.signal or standby.signal when recovering with a backup_label]
** This is a rather old behavior that affects all stable branches, still not something that should be backpatched as-is.

* {{messageLink|cfcca574-6967-c5ab-7dc3-2c82b6723b99@mail.ru|pg_visibility's pg_check_visible() yields false positive when working in parallel with autovacuum}}
** {{messageLink|1649062270.289865713@f403.i.mail.ru|Thread with patch}} [https://commitfest.postgresql.org/43/3739/ CF Entry]

* {{messageLink|1516594.1681482708@sss.pgh.pa.us|We are not compatible with newly-released LLVM 16}}
** {{messageLink|CA%2BhUKGKNX_%3Df%2B1C4r06WETKTq0G4Z_7q4L4Fxn5WWpMycDj9Fw%40mail.gmail.com|Patch}}
** Owner: Thomas Munro (volunteer LLVM API change chaser)

* {{messageLink|20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de|DROP DATABASE is interruptible}}
** Additional discussion: {{messageLink|01020187577238cf-da8c0f4a-3ab9-445a-8c74-31ef51439f30-000000%40eu-west-1.amazonses.com|"PANIC: could not open critical system index 2662" - twice}}

=== Fixed issues ===

* [https://www.postgresql.org/message-id/CAEze2WgGiw%2BLZt%2BvHf8tWqB_6VxeLsMeoAuod0N%3Dij1q17n5pw%40mail.gmail.com Non-replayable WAL records through overflows and >MaxAllocSize lengths]
** In other words; we can write xlog records that we can't read (plus potentially actual WAL corruption); making the instance unrecoverable, and blocks any replication.
** Exploitation seems limited to WAL records of 2PC and logical replication, and extension-generated WAL.
** Affects all stable branches.
** Fixed at: {{PgCommitURL|8fcb32db98eda1ad2a0c0b40b1cbb5d9a7aa68f0}} and {{PgCommitURL|ffd1b6bb6f8a2ffc929699772610c6925364dbb3}} for HEAD.

* [https://www.postgresql.org/message-id/flat/CAC+AXB26a4EmxM2suXxPpJaGrqAdxracd7hskLg-zxtPB50h7A@mail.gmail.com Fix fseek() detection of unseekable files on WIN32]
** Fixed at: {{PgCommitURL|a923e21631a29dc8b8781d7d02b5003d0df64ca3}} and {{PgCommitURL|765f5df726918bcdcfd16bcc5418e48663d1dd59}}, down to 14.

* {{messageLink|CAAKRu_bETD%2BAri600h6fRjX2p8rJSeMAUp%3D_y88juqOZgouTSg%40mail.gmail.com|Can't disable autovacuum cost delay through storage parameter}}
** Fixed at: {{PgCommitURL|bfac8f8bc4a44c67c9f35b5266676278e4ba1217}}, down to 11.

* {{messageLink|CAJ7c6TMBTN3rcz4%3DAjYhLPD_w3FFT0Wq_C15jxCDn8U4tZnH1g@mail.gmail.com| EPQ misbehaves for inherited/partitioned tables}}
** Fixed at: {{PgCommitURL|70b42f279}}, down to 14.

== Non-bugs ==

* {{messageLink|17862-1ab8f74b0f7b0611@postgresql.org|WindowAgg startup costs don't take into account partition bound. Can lead to incorrect use of cheap startup plans}}
** {{messageLink|CAApHDvrB0S5BMv+0-wTTqWFE-BJ0noWqTnDu9QQfjZ2VSpLv_g@mail.gmail.com|Patch to fix and discussion}}

== Resolved Issues ==

=== resolved before 16beta2 ===
* {{messageLink|CAH2-Wz%3D8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg%40mail.gmail.com|Cleaning up nbtree after logical decoding on standby work}}
** Owner: Peter Geoghegan, Andres Freund
** Original commit: {{PgCommitURL|61b313e4}}
** Fixed at: {{PgCommitURL|d088ba5a}}
* {{messageLink|CAMbWs4_tuVn9EwwMcggGiZJWWstdXX_ci8FeEU17vs+4nLgw3w@mail.gmail.com|Assert failure and wrong query results due to incorrectly removing PHV}}
** Owner: Tom Lane
** Fixed at: {{PgCommitURL|9a2dbc614e6e47da3c49daacec106da32eba9467}}
* {{messageLink|CAMbWs4-_vwkBij4XOQ5ukxUvLgwTm0kS5_DO9CicUeKbEfKjUw%40mail.gmail.com|Assert failure of the cross-check for nullingrels}}
** Owner: Tom Lane
** Original commit: {{PgCommitURL|2489d76c4}}
** [https://commitfest.postgresql.org/43/4250/ CF Entry]
** Fixed at: {{PgCommitURL|991a3df22}}

=== resolved before 16beta1 ===
* {{messageLink|CAHewXNnu7u1aT%3D%3DWjnCRa%2BSzKb6s80hvwPP_9eMvvvtdyFdqjw%40mail.gmail.com|ERROR: wrong varnullingrels (b 5 7) (expected (b)) for Var 3/3}}
** Fixed at: {{PgCommitURL|d0f952691}}
* {{messageLink|d46f9265-ff3c-6743-2278-6772598233c2%40pgmasters.net|Possible regression setting GUCs on \connect}}
** Owner: Alexander Korotkov
** Discussion on reverting {{PgCommitURL|096dd80f3}}
** Original commit: {{PgCommitURL|096dd80f3}}
** Reverted at: {{PgCommitURL|b9a7a822723aebb16cbe7e5fb874e5124745b07e}}

* Planner makes improper clause pushdown decisions due to outer-join-aware-Vars changes
** {{messageLink|0b819232-4b50-f245-1c7d-c8c61bf41827@postgrespro.ru|Clause accidentally pushed down}}
** {{messageLink|CAHewXNks3w_Vy9CWoVtHx1XSaeiFpsOzh-zy5eu0Khp1PtG1sA@mail.gmail.com|wrong results due to qual pushdown}}
** Original commit: {{PgCommitURL|2489d76c4}}
** Fixed at: {{PgCommitURL|9df8f903eb6758be5a19e66cdf77e922e9329c31}}

* Revert {{PgCommitURL|ec386948948}}, per {{messageLink|20230330105325.y6uvpalspynf2frt@alvherre.pgsql|Re: "variable not found in subplan target list"}}
** Reverted at {{PgCommitURL|5472743d9e8}}

* [https://www.postgresql.org/message-id/CAEZATCWETioXs5kY8vT6BVguY41_wD962VDk%3Du_Nvd7S1UXzuQ%40mail.gmail.com ERROR: ORDER/GROUP BY expression not found in targetlist]
** Fixed at: {{PgCommitURL|da5800d5fa636c6e10c9c98402d872c76aa1c8d0}}

* [https://www.postgresql.org/message-id/20230212233711.GA1316@telsasoft.com various elogs hit by sqlsmith (ExecRTCheckPerms() and many prunable partitions)]
** Fixed at: {{PgCommitURL|c7468c73f7b6e842a53c12eaee5578a76a8fa7a6}}

* [https://www.postgresql.org/message-id/20230228235834.GC30529@telsasoft.com pg_dump: zlib compression fails for empty objects (LOs)]
** Fixed at: {{PgCommitURL|00d9dcf5bebbb355152a60f0e2120cdf7f9e7ddd}}

* [https://www.postgresql.org/message-id/20230227044910.GO1653@telsasoft.com pg_dump: lz4 compression uses no persistent state and writes a block header for every row]
** Fixed at: {{PgCommitURL|0070b66fef21e909adb283f7faa7b1978836ad75}}

* {{messageLink|3590249.1680971629@sss.pgh.pa.us|Assertion failure with parallel full hash join}}
** Fixed at: {{PgCommitURL|b37d051b0e59e4324e346655a27509507813db79}}

* {{messageLink|ZDDO6jaESKaBgej0@tamriel.snowman.net|De-revert "Add support for Kerberos credential delegation"}}
** Owner: Stephen Frost
** Original commit: {{PgCommitURL|3d4fa227bce4294ce1cc214b4a9d3b7caa3f0454}}
** Revert: ({{PgCommitURL|3d03b24c350ab060bb223623bdff38835bd7afd0}}
** De-Revert: {{PgCommitURL|6633cfb21691840c33816a6dacaca0b504efb895}}
** Resolved at: {{PgCommitURL|f7431bca8b0138bdbce7025871560d39119565a0}}

* {{messageLink|c39be3c5-c1a5-1e33-1024-16f527e251a4@enterprisedb.com|SSL tests break on non-existing system CA pool}}
** Fixed at: {{PgCommitURL|0b5d1fb36adda612bd3d5d032463a6eeb0729237}}

* {{messageLink|CAD21AoBS7o6Ljt_vfqPQPf67AhzKu3fR0iqk8B%3DvVYczMugKMQ%40mail.gmail.com|VacuumUpdateCosts() logging condition incorrect for some initial values of vacuum_cost_delay}}
** Fixed at: {{PgCommitURL|a9781ae11ba2fdb44a3a72c9a7ebb727140b25c5}}

* {{messageLink|CA%2BhUKGJ-ZPJwKHVLbqye92-ZXeLoCHu5wJL6L6HhNP7FkJ%3DmeA%40mail.gmail.com|check_strxfrm_bug()}}
** Owner: Thomas Munro
** Fixed at: {{PgCommitURL|7d3d72b55edd1b7552a9a358991555994efab0e9}}

* {{messageLink|20230317230930.nhsgk3qfk7f4axls%40awork3.anarazel.de|Should we remove vacuum_defer_cleanup_age?}}
** Owner: Andres Freund
** Fixed at: {{PgCommitURL|1118cd37eb61e6a2428f457a8b2026a7bb3f801a}}

* {{messageLink|2fefa454-5a70-2174-ddbf-4a0e41537139@gmail.com|Add two missing tests in 035_standby_logical_decoding.pl}}
** Fixed at: {{PgCommitURL|376dc820531bafcbf105fff74c5b14c23d9950af}}
** Fixed at: {{PgCommitURL|a6e04b1d20c2e9cece9b64bb5b36ebfdc3a9031b}}

* {{messageLink|b32bed1b-0746-9b20-1472-4bdc9ca66d52@gmail.com|Performance regression due to SQLValueFunction removal}}
** Fixed at: {{PgCommitURL|d8c3106bb60e4f87be595f241e173ba3c2b7aa2c}}

* {{messageLink|20230419172326.dhgyo4wrrhulovt6%40awork3.anarazel.de|pg_stat_io not tracking smgrwriteback() is confusing}}
** Owner: Andres Freund
** Fixed at: {{PgCommitURL|093e5c57d506783a95dd8feddd9a3f2651e1aeba}}

* {{messageLink|ZFhCyn4Gm2eu60rB@paquier.xyz|Table data compression is broken with pg_dump --compress lz4}}
** Owner: Tomas Vondra
** Fixed at: {{PgCommitURL|1a05c1d252993b0a59c58a6daf91a2df9333044f}}

* {{messageLink|94ae9bca-5ebb-1e68-bb7b-4f32e89fefbe@gmail.com|Valgrind unhappy with LZ4F code in pg_dump}}
** Owner: Tomas Vondra
** Fixed at: {{PgCommitURL|3c18d90f8907e53c3021fca13ad046133c480e4d}}

* {{messageLink|20230509190247.3rrplhdgem6su6cg@awork3.anarazel.de|walsender performance regression due to logical decoding on standby changes}}
** Owner: Andres Freund
** Original commit: {{PgCommitURL|e101dfac}}
** Fixed at: {{PgCommitURL|bc971f4025c378ce500d86597c34b0ef996d4d8c}}

== Won't Fix ==

* Is it OK that WL_SOCKET_ACCEPT is less fair on Windows than on Unix (and than the coding before 16) when there are multiple server sockets configured?
** {{messageLink|CA%2BhUKG%2BA2dk29hr5zRP3HVJQ-_PncNJM6HVQ7aaYLXLRBZU-xw%40mail.gmail.com|WL_SOCKET_ACCEPT fairness on Windows}} has a (blind) patch to fix that, but would need a Windows hacker to test
** Owner: Thomas Munro
** Original commit: {{PgCommitURL|7389aad6}}
** Issue reclassified as a non-critical improvement to be [https://commitfest.postgresql.org/43/4263/ considered for 17]

== Important Dates ==

Current schedule:

* Beta 2: TBD
* Beta 1: May 25, 2023
* Feature Freeze: April 8, 2023 0:00 AoE ('''Last Day to Commit Features''')

== See also ==

* [[Release Management Team]]
* [[PostgreSQL 15 Open Items]]

[[Category:Open_Items]]

Committing checklist

2023-05-24T20:23:34Z

Pgeoghegan: Add advice about commit.template file

This document is an attempt to list common checks that PostgreSQL project [[Committers]] may want to adopt as part of a checklist of things to check before pushing. There are certain classic mistakes that even experienced committers have been known to make occasionally. In the real world, many mistakes happen when a step is skipped over during a routine process, perhaps caused by a seemingly insignificant last minute change. It's important to learn from these mistakes.

This checklist isn't intended as something that committers will adopt wholesale. Rather, it is intended as a starting point for creating your own semi-customized checklist. Since your final checklist is supposed to be used more or less mechanically, it shouldn't ever be too long, and should be organized into sections to make it easier to skip items where irrelevant. In short, if it's worth adopting something as a standard practice that you return to again and again, it's probably also worth writing that down, to formalize it. Use discretion when deciding what makes sense for you.

= Basic checks =

* Double-check release build compiler warnings.

* make check-world.
** You may want to speed this up by using the following recipe:
make -j16 -s install;make -Otarget -j10 -s check-world && echo "quick make-check world success" || echo "quick make-check world failure"

* Consider the need for a catversion bump.

* Don't assume that you haven't broken the doc build if you make even a trivial doc change.
** Removing a GUC can break instances in the release notes where they're referenced.
** Even grep can miss this, since references to the GUC will have dashes rather than underscores, plus possibly other variations.

* Validate err*() calls against https://www.postgresql.org/docs/devel/static/error-style-guide.html

* Validate *printf calls for trailing newlines.

* Spellcheck the patch.

* Verify that long lines are not better broken into several shorter lines:
git diff origin/master | grep -E '^(\+|diff)' | sed 's/^+//' | expand -t4 | awk "length > 78 || /^diff/"

* Run pgindent, pgperltidy, and reformat-dat-files on changed files; keep the changes minimal.

* Run pgperlcritic on modified Perl files.

* Update version numbers, if needed:
CATALOG_VERSION_NO, PG_CONTROL_VERSION, XLOG_PAGE_MAGIC, PGSTAT_FILE_FORMAT_ID

* Update function/other OIDs, if needed;

= Regression test checks =

* When adding core regression test files, make sure that they're added to both serial and parallel schedules.
(But release 14 and later have only the parallel schedule.)

* Look for alternative output files for any regression test you're updating the output of.
** Some tests have alternative output files to work around portability issues.
** Most of the time it works to just apply the same patch to the other variants as the delta you're observing for the output file that's relevant to your own platform.
** Occasionally you may have to just see what the buildfarm says.

= Git checks =

== Basic ==

* Do a dry run before really pushing by using --dry-run.

* Look at "git status"; anything missing?

* Author and committer timestamps should match.

This can be an issue if you're in the habit of rebasing, or apply a patch with "git am". Make sure that your setup displays both in "git log", by specifying "--pretty=fuller", or changing the git format config. The easiest way to make both timestamps match is to amend the commit like so:

git commit --amend --reset-author

If you have "autosetuprebase = always" in your git config, then a last minute "git pull" could cause a rebase, which could cause author and committer timestamps to diverge a bit. In practice, small differences between author and committer timestamp are not considered to be a problem.

* Write log message (consider creating a [https://www.git-scm.com/docs/git-commit/2.38.0#Documentation/git-commit.txt--tltfilegt .gitmessage commit.template template file] to make this easier):
Discussion: https://postgr.es/m/XXXXXXXXXXX
Back-patch depth?
What should the release notes say?
Credit any reviewer.

* When making references to other commits, it's a good idea to use the first 9 chars of the commit SHA. Fewer than 9 means there will be no hyperlink in the HTTP interface. More than 9 is not required.

* Note compatibility issues in commit message, so that they'll get picked up later, when release notes are written.

* Check merge with master (not applicable to commits).

* If you're using a dedicated ssh key with a passphrase, you may find it useful to deliberately disable it when you're done pushing:

$ ssh-add -d ~/.ssh/id_rsa_postgres

== Backpatching and git ==

Commit messages for multiple branches should be identical when back-patching, in order to have tooling recognize the redundancy for purposes of compiling release notes, and other things of that nature.

* Easiest way to get commit metadata consistent is to not worry about commit messages outside of the master branch at first. Commit message on backbranches could initially be something like "pending 9.6".

* Perform the following procedure on each back branch when you're done, by checking out each individual branch in gitmaster local clone, and doing this for master branch commit which has good commit message:

git commit --amend --reset-author -C <commit>

You now have the same commit message on each branch. This means that the <code>src/tools/git_changelog</code> utility script will present the commits from each affected local branch together, as one logical change. (This script is used as a starting point when writing back branch release notes. Note that the concept of "one logical change" is not a standard git concept.)

* Use <code>git push origin : --dry-run</code> to dry-run pushing all branches at once. Once satisfied, remove --dry-run to actually push. --dry-run is doubly important if you push each branch individually.

= Maintaining ABI compatibility while backpatching =

Avoid breaking ABI compatibility. It's unacceptable for extensions built against an earlier point release to break in a more recent point release.

* You can only really change the signature of a function with local linkage, perhaps with a few rare exceptions.
* You cannot modify any struct definition in src/include/*. If any new members must be added to a struct, put them at the end in backbranches. It's okay to have a different struct layout in master. Even then, extensions that allocate the struct can break via a dependency on its size.
* Move new enum values to the end.

See [https://postgr.es/m/1315116.1603900649@sss.pgh.pa.us this message] for more considerations on ABI preservation.

= GUC checks =

* When adding a new GUC, postgresql.conf.sample needs to be updated, too.

* Is the GUC group the right one?

= Advanced smoke tests =

* Valgrind memcheck + "make installcheck".

* CLOBBER_CACHE_ALWAYS.

* When doing anything that touches WAL-logging, consider creating a replica, and making sure that wal_consistency_checking=all passes on replica while master runs "make installcheck". WAL_DEBUG makes any bug that this throws up easier to isolate.

* "#define COPY_PARSE_PLAN_TREES" and "#define WRITE_READ_PARSE_PLAN_TREES" can catch omissions or other mistakes when "src/backend/nodes/*" were changed.

* Various tests that are only run on certain platforms, enabled [https://www.postgresql.org/docs/devel/regress-run.html using PG_TEST_EXTRA or EXTRA_TESTS environment variables]. For example, PG_TEST_EXTRA='ssl' and EXTRA_TESTS='collate.linux.utf8' tests.

* check for unaligned access with things from c.h like -fsanitize=alignment

* sqlsmith (for grammar changes, and ??)

Committing checklist

2023-05-24T20:15:21Z

Pgeoghegan: Remove obsolete guidance on testing older branches

This document is an attempt to list common checks that PostgreSQL project [[Committers]] may want to adopt as part of a checklist of things to check before pushing. There are certain classic mistakes that even experienced committers have been known to make occasionally. In the real world, many mistakes happen when a step is skipped over during a routine process, perhaps caused by a seemingly insignificant last minute change. It's important to learn from these mistakes.

This checklist isn't intended as something that committers will adopt wholesale. Rather, it is intended as a starting point for creating your own semi-customized checklist. Since your final checklist is supposed to be used more or less mechanically, it shouldn't ever be too long, and should be organized into sections to make it easier to skip items where irrelevant. In short, if it's worth adopting something as a standard practice that you return to again and again, it's probably also worth writing that down, to formalize it. Use discretion when deciding what makes sense for you.

= Basic checks =

* Double-check release build compiler warnings.

* make check-world.
** You may want to speed this up by using the following recipe:
make -j16 -s install;make -Otarget -j10 -s check-world && echo "quick make-check world success" || echo "quick make-check world failure"

* Consider the need for a catversion bump.

* Don't assume that you haven't broken the doc build if you make even a trivial doc change.
** Removing a GUC can break instances in the release notes where they're referenced.
** Even grep can miss this, since references to the GUC will have dashes rather than underscores, plus possibly other variations.

* Validate err*() calls against https://www.postgresql.org/docs/devel/static/error-style-guide.html

* Validate *printf calls for trailing newlines.

* Spellcheck the patch.

* Verify that long lines are not better broken into several shorter lines:
git diff origin/master | grep -E '^(\+|diff)' | sed 's/^+//' | expand -t4 | awk "length > 78 || /^diff/"

* Run pgindent, pgperltidy, and reformat-dat-files on changed files; keep the changes minimal.

* Run pgperlcritic on modified Perl files.

* Update version numbers, if needed:
CATALOG_VERSION_NO, PG_CONTROL_VERSION, XLOG_PAGE_MAGIC, PGSTAT_FILE_FORMAT_ID

* Update function/other OIDs, if needed;

= Regression test checks =

* When adding core regression test files, make sure that they're added to both serial and parallel schedules.
(But release 14 and later have only the parallel schedule.)

* Look for alternative output files for any regression test you're updating the output of.
** Some tests have alternative output files to work around portability issues.
** Most of the time it works to just apply the same patch to the other variants as the delta you're observing for the output file that's relevant to your own platform.
** Occasionally you may have to just see what the buildfarm says.

= Git checks =

== Basic ==

* Do a dry run before really pushing by using --dry-run.

* Look at "git status"; anything missing?

* Author and committer timestamps should match.

This can be an issue if you're in the habit of rebasing, or apply a patch with "git am". Make sure that your setup displays both in "git log", by specifying "--pretty=fuller", or changing the git format config. The easiest way to make both timestamps match is to amend the commit like so:

git commit --amend --reset-author

If you have "autosetuprebase = always" in your git config, then a last minute "git pull" could cause a rebase, which could cause author and committer timestamps to diverge a bit. In practice, small differences between author and committer timestamp are not considered to be a problem.

* Write log message:
Discussion: https://postgr.es/m/XXXXXXXXXXX
Back-patch depth?
What should the release notes say?
Credit any reviewer.

* When making references to other commits, it's a good idea to use the first 9 chars of the commit SHA. Fewer than 9 means there will be no hyperlink in the HTTP interface. More than 9 is not required.

* Note compatibility issues in commit message, so that they'll get picked up later, when release notes are written.

* Check merge with master (not applicable to commits).

* If you're using a dedicated ssh key with a passphrase, you may find it useful to deliberately disable it when you're done pushing:

$ ssh-add -d ~/.ssh/id_rsa_postgres

== Backpatching and git ==

Commit messages for multiple branches should be identical when back-patching, in order to have tooling recognize the redundancy for purposes of compiling release notes, and other things of that nature.

* Easiest way to get commit metadata consistent is to not worry about commit messages outside of the master branch at first. Commit message on backbranches could initially be something like "pending 9.6".

* Perform the following procedure on each back branch when you're done, by checking out each individual branch in gitmaster local clone, and doing this for master branch commit which has good commit message:

git commit --amend --reset-author -C <commit>

You now have the same commit message on each branch. This means that the <code>src/tools/git_changelog</code> utility script will present the commits from each affected local branch together, as one logical change. (This script is used as a starting point when writing back branch release notes. Note that the concept of "one logical change" is not a standard git concept.)

* Use <code>git push origin : --dry-run</code> to dry-run pushing all branches at once. Once satisfied, remove --dry-run to actually push. --dry-run is doubly important if you push each branch individually.

= Maintaining ABI compatibility while backpatching =

Avoid breaking ABI compatibility. It's unacceptable for extensions built against an earlier point release to break in a more recent point release.

* You can only really change the signature of a function with local linkage, perhaps with a few rare exceptions.
* You cannot modify any struct definition in src/include/*. If any new members must be added to a struct, put them at the end in backbranches. It's okay to have a different struct layout in master. Even then, extensions that allocate the struct can break via a dependency on its size.
* Move new enum values to the end.

See [https://postgr.es/m/1315116.1603900649@sss.pgh.pa.us this message] for more considerations on ABI preservation.

= GUC checks =

* When adding a new GUC, postgresql.conf.sample needs to be updated, too.

* Is the GUC group the right one?

= Advanced smoke tests =

* Valgrind memcheck + "make installcheck".

* CLOBBER_CACHE_ALWAYS.

* When doing anything that touches WAL-logging, consider creating a replica, and making sure that wal_consistency_checking=all passes on replica while master runs "make installcheck". WAL_DEBUG makes any bug that this throws up easier to isolate.

* "#define COPY_PARSE_PLAN_TREES" and "#define WRITE_READ_PARSE_PLAN_TREES" can catch omissions or other mistakes when "src/backend/nodes/*" were changed.

* Various tests that are only run on certain platforms, enabled [https://www.postgresql.org/docs/devel/regress-run.html using PG_TEST_EXTRA or EXTRA_TESTS environment variables]. For example, PG_TEST_EXTRA='ssl' and EXTRA_TESTS='collate.linux.utf8' tests.

* check for unaligned access with things from c.h like -fsanitize=alignment

* sqlsmith (for grammar changes, and ??)

Meson

2023-04-19T16:56:27Z

Pgeoghegan: /* Test related commands */

== PostgreSQL devel documentation ==

See [https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section] of PostgreSQL devel docs.

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| <code>./configure [options]</code>
|| <code>meson setup [options] [builddir] sourcedir</code>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| <code>perl src/tools/msvc/mkvcbuild.pl</code>
|| <code>meson setup --backend vs [options] [builddir] sourcedir</code>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| <code>./configure --help</code>
|| <code>meson configure</code>
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| <code>./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature</code>
|| <code>meson setup|configure -D$option=$value</code>
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| <code>--enable-cassert</code>
|| <code>-Dcassert=true</code>
||
|-
|| enable debug symbols
|| <code>./configure --enable-debug</code>
|| <code>meson configure|setup -Ddebug=true</code>
||
|-
|| specify compiler
|| <code>CC=compiler ./configure</code>
|| <code>CC=compiler meson setup</code>
|| <code>CC</code> is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| <code>CFLAGS=options ./configure</code>
|| <code>meson configure|setup -Dc_args=options</code>
|| <code>CFLAGS</code> is also checked, but only for meson setup
|-
|| build
|| <code>make -s</code>
|| <code>ninja</code>
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| <code>make</code>
|| <code>ninja -v</code>
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| <code>make install</code>
|| <code>ninja install</code>
|| use <code>meson install --quiet</code> for a less verbose experience
|-
|| install files that changed only
||
|| <code>meson install --only-changed</code>
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| <code>make clean</code>
|| <code>ninja clean</code>
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| <code>cd doc/ && make html && make man</code>
|| <code>ninja docs</code>
|| Builds html documentation and man pages
|}

==== Build directory ====

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the <code>-C</code> flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| <code>meson test --list</code>
|| Only shows tests from "tmp_install" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup], since it is the default (<code>--setup tmp_install</code> is implied here)
|-
|| list running/installcheck test variants
||
|| <code>meson test --setup running --list</code>
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| <code>make check-world</code>
|| <code>meson test -v</code>
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| <code>make installcheck-world</code>
|| <code>meson test -v --setup running</code>
|| Currently [https://postgr.es/m/CAH2-Wz=X7=5jU-+XXJaqQRZja_fseEtrd_dGJa0Wpb74OpsgEA@mail.gmail.com makes brittle assumptions] about test libraries being installed
|-
|| run main regression tests
|| <code>make check</code>
|| <code>meson test -v --suite setup --suite regress</code>
|| <code>--suite setup</code> required to get a <code>tmp_install</code> directory; see below
|-
|| run specific contrib test suite
|| <code>make -C contrib/amcheck check</code>
|| <code>meson test -v --suite setup --suite amcheck</code>
|| <code>--suite setup</code> required to get a <code>tmp_install</code> directory; see below
|-
|| run main regression tests against existing server
|| <code>make installcheck</code>
|| <code>meson test -v --setup running --suite regress-running</code>
||
|-
|| run specific contrib test suite against existing server
|| <code>make -C contrib/amcheck installcheck</code>
|| <code>meson test -v --setup running --suite amcheck-running</code>
|| "running" amcheck suite variant doesn't include TAP tests
|}

==== Test structure ====

When running a specific test suite against a temporary throw away installation, <code>--suite setup</code> should generally be specified. Otherwise the tests could end up running against a stale <code>tmp_install</code> directory, causing general confusion. This [https://postgr.es/m/20230209205605.zo5gfhli22g2kdm2@awork3.anarazel.de workaround] is not required when running tests against an existing server (via the <code>running</code> test setup and variant test suites), since of course the installation directory being tested is whatever directory the external server installation uses.

Note that the top-level/default project name is <code>postgresql</code>, which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be omitted] when using a reasonably recent meson version (meson 0.46 or later), which we assume here.

You can list all of the tests from a given suite as follows:

<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck
ninja: no work to do.
postgresql:amcheck / amcheck/regress
postgresql:amcheck / amcheck/001_verify_heapam
postgresql:amcheck / amcheck/002_cic
postgresql:amcheck / amcheck/003_cic_2pc
</pre>

Note that there are distinct <code>running</code>/installcheck suites for most of the standard setup suites, though not all of the tests actually carry over to the <code>running</code> variant suites, as shown here:
<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck-running
ninja: no work to do.
postgresql:amcheck-running / amcheck-running/regress
</pre>

==== Running individual regression test scripts via an installcheck-tests style workflow ====

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

As documented on the [https://mesonbuild.com/Getting-meson.html meson website], a MSI installer is also available.

Using the most recent version of ActivePerl may be a bit challenging, as there is no direct access to a "perl" command except if enabling a project registered in the ActivePerl website, with a command like that:
<pre>
state activate --default
</pre>

An easy way to set up things is to install Chocolatey, and rely on StrawberryPerl. Here are the main packages to worry about:
<pre>
choco install winflexbison
choco install sed
choco install gzip
choco install strawberryperl
choco install diffutils
</pre>

The compiler detected will depend on the Command Prompt type used. For MSVC, use the command prompt installed for VS. A native Command prompt or Powershell may finish by linking to Chocolatey's gcc, which may be OK, still be careful with what's reported by meson setup.

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-04-19T16:48:57Z

Pgeoghegan: Comments about dependency issue belong under plain "--suite regress-running", not under "meson test -v --setup running --suite regress-running"

== PostgreSQL devel documentation ==

See [https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section] of PostgreSQL devel docs.

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| <code>./configure [options]</code>
|| <code>meson setup [options] [builddir] sourcedir</code>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| <code>perl src/tools/msvc/mkvcbuild.pl</code>
|| <code>meson setup --backend vs [options] [builddir] sourcedir</code>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| <code>./configure --help</code>
|| <code>meson configure</code>
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| <code>./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature</code>
|| <code>meson setup|configure -D$option=$value</code>
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| <code>--enable-cassert</code>
|| <code>-Dcassert=true</code>
||
|-
|| enable debug symbols
|| <code>./configure --enable-debug</code>
|| <code>meson configure|setup -Ddebug=true</code>
||
|-
|| specify compiler
|| <code>CC=compiler ./configure</code>
|| <code>CC=compiler meson setup</code>
|| <code>CC</code> is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| <code>CFLAGS=options ./configure</code>
|| <code>meson configure|setup -Dc_args=options</code>
|| <code>CFLAGS</code> is also checked, but only for meson setup
|-
|| build
|| <code>make -s</code>
|| <code>ninja</code>
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| <code>make</code>
|| <code>ninja -v</code>
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| <code>make install</code>
|| <code>ninja install</code>
|| use <code>meson install --quiet</code> for a less verbose experience
|-
|| install files that changed only
||
|| <code>meson install --only-changed</code>
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| <code>make clean</code>
|| <code>ninja clean</code>
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| <code>cd doc/ && make html && make man</code>
|| <code>ninja docs</code>
|| Builds html documentation and man pages
|}

==== Build directory ====

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the <code>-C</code> flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| <code>meson test --list</code>
||
|-
|| list running/installcheck test variants
||
|| <code>meson test --setup running --list</code>
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| <code>make check-world</code>
|| <code>meson test -v</code>
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| <code>make installcheck-world</code>
|| <code>meson test -v --setup running</code>
|| Currently [https://postgr.es/m/CAH2-Wz=X7=5jU-+XXJaqQRZja_fseEtrd_dGJa0Wpb74OpsgEA@mail.gmail.com makes brittle assumptions] about test libraries being installed
|-
|| run main regression tests
|| <code>make check</code>
|| <code>meson test -v --suite setup --suite regress</code>
|| <code>--suite setup</code> required to get a <code>tmp_install</code> directory; see below
|-
|| run specific contrib test suite
|| <code>make -C contrib/amcheck check</code>
|| <code>meson test -v --suite setup --suite amcheck</code>
|| <code>--suite setup</code> required to get a <code>tmp_install</code> directory; see below
|-
|| run main regression tests against existing server
|| <code>make installcheck</code>
|| <code>meson test -v --setup running --suite regress-running</code>
||
|-
|| run specific contrib test suite against existing server
|| <code>make -C contrib/amcheck installcheck</code>
|| <code>meson test -v --setup running --suite amcheck-running</code>
|| "running" amcheck suite variant doesn't include TAP tests
|}

==== Test structure ====

When running a specific test suite against a temporary throw away installation, <code>--suite setup</code> should generally be specified. Otherwise the tests could end up running against a stale <code>tmp_install</code> directory, causing general confusion. This [https://postgr.es/m/20230209205605.zo5gfhli22g2kdm2@awork3.anarazel.de workaround] is not required when running tests against an existing server (via the <code>running</code> test setup and variant test suites), since of course the installation directory being tested is whatever directory the external server installation uses.

Note that the top-level/default project name is <code>postgresql</code>, which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be omitted] when using a reasonably recent meson version (meson 0.46 or later), which we assume here.

You can list all of the tests from a given suite as follows:

<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck
ninja: no work to do.
postgresql:amcheck / amcheck/regress
postgresql:amcheck / amcheck/001_verify_heapam
postgresql:amcheck / amcheck/002_cic
postgresql:amcheck / amcheck/003_cic_2pc
</pre>

Note that there are distinct <code>running</code>/installcheck suites for most of the standard setup suites, though not all of the tests actually carry over to the <code>running</code> variant suites, as shown here:
<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck-running
ninja: no work to do.
postgresql:amcheck-running / amcheck-running/regress
</pre>

==== Running individual regression test scripts via an installcheck-tests style workflow ====

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

As documented on the [https://mesonbuild.com/Getting-meson.html meson website], a MSI installer is also available.

Using the most recent version of ActivePerl may be a bit challenging, as there is no direct access to a "perl" command except if enabling a project registered in the ActivePerl website, with a command like that:
<pre>
state activate --default
</pre>

An easy way to set up things is to install Chocolatey, and rely on StrawberryPerl. Here are the main packages to worry about:
<pre>
choco install winflexbison
choco install sed
choco install gzip
choco install strawberryperl
choco install diffutils
</pre>

The compiler detected will depend on the Command Prompt type used. For MSVC, use the command prompt installed for VS. A native Command prompt or Powershell may finish by linking to Chocolatey's gcc, which may be OK, still be careful with what's reported by meson setup.

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-04-19T16:41:58Z

Pgeoghegan: Document assumption that test libraries are available from install directory with "--setup running", which has been broken since commit b6a0d469ca

== PostgreSQL devel documentation ==

See [https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section] of PostgreSQL devel docs.

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| <code>./configure [options]</code>
|| <code>meson setup [options] [builddir] sourcedir</code>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| <code>perl src/tools/msvc/mkvcbuild.pl</code>
|| <code>meson setup --backend vs [options] [builddir] sourcedir</code>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| <code>./configure --help</code>
|| <code>meson configure</code>
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| <code>./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature</code>
|| <code>meson setup|configure -D$option=$value</code>
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| <code>--enable-cassert</code>
|| <code>-Dcassert=true</code>
||
|-
|| enable debug symbols
|| <code>./configure --enable-debug</code>
|| <code>meson configure|setup -Ddebug=true</code>
||
|-
|| specify compiler
|| <code>CC=compiler ./configure</code>
|| <code>CC=compiler meson setup</code>
|| <code>CC</code> is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| <code>CFLAGS=options ./configure</code>
|| <code>meson configure|setup -Dc_args=options</code>
|| <code>CFLAGS</code> is also checked, but only for meson setup
|-
|| build
|| <code>make -s</code>
|| <code>ninja</code>
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| <code>make</code>
|| <code>ninja -v</code>
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| <code>make install</code>
|| <code>ninja install</code>
|| use <code>meson install --quiet</code> for a less verbose experience
|-
|| install files that changed only
||
|| <code>meson install --only-changed</code>
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| <code>make clean</code>
|| <code>ninja clean</code>
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| <code>cd doc/ && make html && make man</code>
|| <code>ninja docs</code>
|| Builds html documentation and man pages
|}

==== Build directory ====

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the <code>-C</code> flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| <code>meson test --list</code>
||
|-
|| list running/installcheck test variants
||
|| <code>meson test --setup running --list</code>
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| <code>make check-world</code>
|| <code>meson test -v</code>
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| <code>make installcheck-world</code>
|| <code>meson test -v --setup running</code>
||
|-
|| run main regression tests
|| <code>make check</code>
|| <code>meson test -v --suite setup --suite regress</code>
|| <code>--suite setup</code> required to get a <code>tmp_install</code> directory; see below
|-
|| run specific contrib test suite
|| <code>make -C contrib/amcheck check</code>
|| <code>meson test -v --suite setup --suite amcheck</code>
|| <code>--suite setup</code> required to get a <code>tmp_install</code> directory; see below
|-
|| run main regression tests against existing server
|| <code>make installcheck</code>
|| <code>meson test -v --setup running --suite regress-running</code>
|| Currently [https://postgr.es/m/CAH2-Wz=X7=5jU-+XXJaqQRZja_fseEtrd_dGJa0Wpb74OpsgEA@mail.gmail.com makes brittle assumptions] about test libraries being installed
|-
|| run specific contrib test suite against existing server
|| <code>make -C contrib/amcheck installcheck</code>
|| <code>meson test -v --setup running --suite amcheck-running</code>
|| "running" amcheck suite variant doesn't include TAP tests
|}

==== Test structure ====

When running a specific test suite against a temporary throw away installation, <code>--suite setup</code> should generally be specified. Otherwise the tests could end up running against a stale <code>tmp_install</code> directory, causing general confusion. This [https://postgr.es/m/20230209205605.zo5gfhli22g2kdm2@awork3.anarazel.de workaround] is not required when running tests against an existing server (via the <code>running</code> test setup and variant test suites), since of course the installation directory being tested is whatever directory the external server installation uses.

Note that the top-level/default project name is <code>postgresql</code>, which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be omitted] when using a reasonably recent meson version (meson 0.46 or later), which we assume here.

You can list all of the tests from a given suite as follows:

<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck
ninja: no work to do.
postgresql:amcheck / amcheck/regress
postgresql:amcheck / amcheck/001_verify_heapam
postgresql:amcheck / amcheck/002_cic
postgresql:amcheck / amcheck/003_cic_2pc
</pre>

Note that there are distinct <code>running</code>/installcheck suites for most of the standard setup suites, though not all of the tests actually carry over to the <code>running</code> variant suites, as shown here:
<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck-running
ninja: no work to do.
postgresql:amcheck-running / amcheck-running/regress
</pre>

==== Running individual regression test scripts via an installcheck-tests style workflow ====

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

As documented on the [https://mesonbuild.com/Getting-meson.html meson website], a MSI installer is also available.

Using the most recent version of ActivePerl may be a bit challenging, as there is no direct access to a "perl" command except if enabling a project registered in the ActivePerl website, with a command like that:
<pre>
state activate --default
</pre>

An easy way to set up things is to install Chocolatey, and rely on StrawberryPerl. Here are the main packages to worry about:
<pre>
choco install winflexbison
choco install sed
choco install gzip
choco install strawberryperl
choco install diffutils
</pre>

The compiler detected will depend on the Command Prompt type used. For MSVC, use the command prompt installed for VS. A native Command prompt or Powershell may finish by linking to Chocolatey's gcc, which may be OK, still be careful with what's reported by meson setup.

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

PgCon 2023 Developer Meeting

2023-04-15T23:38:15Z

Pgeoghegan: /* RSVPs */

A meeting of the interested PostgreSQL developers is being planned for Tuesday 30 May, 2023 at the University of Ottawa, prior to pgCon 2023. In order to keep the numbers manageable, this meeting is by '''invitation only'''.
Any questions regarding the invitations to this event should be directed to the team of individuals tasked with coming up with the list of people to invite:

* Andres Freund
* Stephen Frost
* Dave Page

An Unconference will be held on Friday for in-depth discussion of technical topics.

This is a PostgreSQL Community event.

== Meeting Goals ==

* Define the schedule for the upcoming releases
* Address any proposed timing, policy, or procedure issues
* Receive updates from project sub-teams on their activities and discuss any resulting issues or concerns.
* Address any proposed [http://en.wikipedia.org/wiki/Wicked_problem Wicked problems]

== Time & Location ==

The meeting will (probably) be:

* 9:00AM to 12PM
* DMS 3105 - Desmarais Hall, 55 Laurier Avenue East
* University of Ottawa.

Lunch will be served during the meeting.

== COVID-19 ==

The University of Ottawa's COVID-19 guidance can be found at https://www.uottawa.ca/en/covid-19. Wearing of masks at the Developer Meeting will be optional, however we do ask that people do not attend if they have COVID symptoms or have tested positive.

== RSVPs ==

The following people have RSVPed to the meeting (in alphabetical order, by surname). Note that we can accommodate a '''maximum of 30'''!

# Joe Conway
# Jeff Davis
# Peter Eisentraut
# Andres Freund
# Stephen Frost
# Etsuro Fujita
# Peter Geoghegan
# Magnus Hagander
# Jonathan Katz
# Alexander Korotkov
# Tom Lane
# Heikki Linnakangas
# Noah Misch
# Thomas Munro
# Dave Page
# Michael Paquier
# Melanie Plageman
# David Rowley
# Tomas Vondra

The following people will not be in Ottawa, and do not plan to attend:

# Masao Fujii
# Daniel Gustafsson
# Tatsuo Ishii
# Dean Rasheed

== Agenda Items ==

* 16.0 release and commitfest schedule (Dave)
* Improvements to table AM API (Alexander)
* ''Please add suggestions for agenda items here. (with your name)''

==Agenda==

{| border="1" cellpadding="4" cellspacing="0"
!Time
!Item
!Presenter

|- style="font-style:italic;background-color:lightgray;"
|09:00 - 09:10
|Welcome and introductions
|Dave Page

|-
|09:10 - 09:20
|Release and commitfest schedules
|Dave Page

|-
|??:?? - ??:??
|TBD
|TBD

|- style="font-style:italic;background-color:lightgray;"
|10:30 - 11:00
|Coffee break
|All

|-
|??:?? - ??:??
|TBD
|TBD

|-
|11:50 - 12:00
|Any other business
|Dave Page

|- style="font-style:italic;background-color:lightgray;"
|12:00
|Lunch
|

|}

Note: This timetable is a rough guide only. Items will start as soon as the previous discussion is complete (breaks will not move materially however). Any remaining time before lunch may be used for Commitfest item triage or other activities.

[[Category:Developer Meeting]]

PgCon 2023 Developer Meeting

2023-04-10T16:26:57Z

Pgeoghegan: /* RSVPs */

A meeting of the interested PostgreSQL developers is being planned for Tuesday 30 May, 2023 at the University of Ottawa, prior to pgCon 2023. In order to keep the numbers manageable, this meeting is by '''invitation only'''.
Any questions regarding the invitations to this event should be directed to the team of individuals tasked with coming up with the list of people to invite:

* Andres Freund
* Stephen Frost
* Dave Page

An Unconference will be held on Friday for in-depth discussion of technical topics.

This is a PostgreSQL Community event.

== Meeting Goals ==

* Define the schedule for the upcoming releases
* Address any proposed timing, policy, or procedure issues
* Receive updates from project sub-teams on their activities and discuss any resulting issues or concerns.
* Address any proposed [http://en.wikipedia.org/wiki/Wicked_problem Wicked problems]

== Time & Location ==

The meeting will (probably) be:

* 9:00AM to 12PM
* DMS 3105 - Desmarais Hall, 55 Laurier Avenue East
* University of Ottawa.

Lunch will be served during the meeting.

== COVID-19 ==

The University of Ottawa's COVID-19 guidance can be found at https://www.uottawa.ca/en/covid-19. Wearing of masks at the Developer Meeting will be optional, however we do ask that people do not attend if they have COVID symptoms or have tested positive.

== RSVPs ==

The following people have RSVPed to the meeting (in alphabetical order, by surname). Note that we can accommodate a '''maximum of 30'''!

# Joe Conway
# Jeff Davis
# Peter Eisentraut
# Andres Freund
# Stephen Frost
# Peter Geoghegan
# Etsuro Fujita
# Magnus Hagander
# Jonathan Katz
# Alexander Korotkov
# Tom Lane
# Heikki Linnakangas
# Noah Misch
# Thomas Munro
# Dave Page
# Michael Paquier
# Melanie Plageman
# David Rowley
# Tomas Vondra

The following people will not be in Ottawa, and do not plan to attend:

# Masao Fujii
# Daniel Gustafsson
# Tatsuo Ishii
# Dean Rasheed

== Agenda Items ==

* 16.0 release and commitfest schedule (Dave)
* ''Please add suggestions for agenda items here. (with your name)''

==Agenda==

{| border="1" cellpadding="4" cellspacing="0"
!Time
!Item
!Presenter

|- style="font-style:italic;background-color:lightgray;"
|09:00 - 09:10
|Welcome and introductions
|Dave Page

|-
|09:10 - 09:20
|Release and commitfest schedules
|Dave Page

|-
|??:?? - ??:??
|TBD
|TBD

|- style="font-style:italic;background-color:lightgray;"
|10:30 - 11:00
|Coffee break
|All

|-
|??:?? - ??:??
|TBD
|TBD

|-
|11:50 - 12:00
|Any other business
|Dave Page

|- style="font-style:italic;background-color:lightgray;"
|12:00
|Lunch
|

|}

Note: This timetable is a rough guide only. Items will start as soon as the previous discussion is complete (breaks will not move materially however). Any remaining time before lunch may be used for Commitfest item triage or other activities.

[[Category:Developer Meeting]]

PgCon 2023 Developer Meeting

2023-04-04T18:39:25Z

Pgeoghegan: Add my name

A meeting of the interested PostgreSQL developers is being planned for Tuesday 30 May, 2023 at the University of Ottawa, prior to pgCon 2023. In order to keep the numbers manageable, this meeting is by '''invitation only'''.
Any questions regarding the invitations to this event should be directed to the team of individuals tasked with coming up with the list of people to invite:

* Andres Freund
* Stephen Frost
* Dave Page

An Unconference will be held on Friday for in-depth discussion of technical topics.

This is a PostgreSQL Community event.

== Meeting Goals ==

* Define the schedule for the upcoming releases
* Address any proposed timing, policy, or procedure issues
* Receive updates from project sub-teams on their activities and discuss any resulting issues or concerns.
* Address any proposed [http://en.wikipedia.org/wiki/Wicked_problem Wicked problems]

== Time & Location ==

The meeting will (probably) be:

* 9:00AM to 12PM
* DMS 3105 - Desmarais Hall, 55 Laurier Avenue East
* University of Ottawa.

Lunch will be served during the meeting.

== COVID-19 ==

The University of Ottawa's COVID-19 guidance can be found at https://www.uottawa.ca/en/covid-19. Wearing of masks at the Developer Meeting will be optional, however we do ask that people do not attend if they have COVID symptoms or have tested positive.

== RSVPs ==

The following people have RSVPed to the meeting (in alphabetical order, by surname). Note that we can accommodate a '''maximum of 30'''!

# Joe Conway
# Jeff Davis
# Peter Eisentraut
# Andres Freund
# Stephen Frost
# Etsuro Fujita
# Magnus Hagander
# Jonathan Katz
# Alexander Korotkov
# Tom Lane
# Heikki Linnakangas
# Noah Misch
# Thomas Munro
# Dave Page
# Michael Paquier
# Melanie Plageman
# David Rowley
# Peter Geoghegan

The following people will not be in Ottawa, and do not plan to attend:

# Masao Fujii
# Daniel Gustafsson
# Tatsuo Ishii
# Dean Rasheed

== Agenda Items ==

* 16.0 release and commitfest schedule (Dave)
* ''Please add suggestions for agenda items here. (with your name)''

==Agenda==

{| border="1" cellpadding="4" cellspacing="0"
!Time
!Item
!Presenter

|- style="font-style:italic;background-color:lightgray;"
|09:00 - 09:10
|Welcome and introductions
|Dave Page

|-
|09:10 - 09:20
|Release and commitfest schedules
|Dave Page

|-
|??:?? - ??:??
|TBD
|TBD

|- style="font-style:italic;background-color:lightgray;"
|10:30 - 11:00
|Coffee break
|All

|-
|??:?? - ??:??
|TBD
|TBD

|-
|11:50 - 12:00
|Any other business
|Dave Page

|- style="font-style:italic;background-color:lightgray;"
|12:00
|Lunch
|

|}

Note: This timetable is a rough guide only. Items will start as soon as the previous discussion is complete (breaks will not move materially however). Any remaining time before lunch may be used for Commitfest item triage or other activities.

[[Category:Developer Meeting]]

Getting a stack trace of a running PostgreSQL backend on Linux/BSD

2023-02-19T00:44:09Z

Pgeoghegan: /* Jumping back and forth through a recording use GDB commands */

[[Generating a stack trace of a PostgreSQL backend|Up to parent]]

== Linux and BSD ==

Linux and BSD systems generally use the [http://gcc.gnu.org/ GNU compiler collection] and the [http://www.gnu.org/software/gdb/ GNU Debugger] ("gdb"). It's pretty trivial to get a stack trace of a process.

(If you want more than just a stack trace, take a look at the [[Developer FAQ]] which covers interactive debugging).

=== Installing External symbols ===

(BSD users who installed from ports can skip this)

On many Linux systems, debugging info is separated out from program binaries and stored separately. It's often not installed when you install a package, so if you want to debug the program (say, get a stack trace) you will need to install debug info packages. Unfortunately, the names of these packages vary depending on your distro, as does the procedure for installing them.

Some generic instructions (unrelated to PostgreSQL) are maintained on the GNOME Wiki [http://live.gnome.org/GettingTraces/DistroSpecificInstructions here].

==== On Debian ====

http://wiki.debian.org/HowToGetABacktrace

Debian Squeeze (6.x) users will also need to install gdb 7.3 from backports, as the gdb shipped in Squeeze doesn't understand the PIE executables used in newer PostgreSQL builds.

==== On Ubuntu ====

First, follow the instructions on the Ubuntu wiki entry [https://wiki.edubuntu.org/DebuggingProgramCrash DebuggingProgramCrash].

Once you've finished enabling the use of debug info packages as described, you will need to use the <code>list-dbgsym-packages.sh</code> script linked to on that wiki article to get a list of debug packages you need. Installing the debug package for postgresql alone is not sufficient.

After following the instructions on the Ubuntu wiki, download the script to your desktop, open a terminal, and run: 
<pre>
$ sudo apt-get install $(sudo bash Desktop/list-dbgsym-packages.sh -t -p $(pidof -s postgres))
</pre>

==== On Fedora ====

All Fedora versions: [https://fedoraproject.org/wiki/StackTraces#debuginfo FedoraProject.org - StackTraces]

==== Other distros ====

In general, you need to install at least the debug symbol packages for the PostgreSQL server and client as well as any common package that may exist, and the debug symbol package for libc. It's a good idea to add debug symbols for the other libraries PostgreSQL uses in case the problem you're having arises in or touches on one of those libraries.

=== Collecting a stack trace ===

==== How to tell if a stack trace is any good ====

Read this section and keep it in mind as you collect information using the instructions below. Making sure the information you collect is actually useful will save you, and everybody else, time and hassle.

It is vitally important to have debugging symbols available to get a useful stack trace. If you do not have the required symbols installed, backtraces will contain lots of entries like this:

<pre>
#1 0x00686a3d in ?? ()
#2 0x00d3d406 in ?? ()
#3 0x00bf0ba4 in ?? ()
#4 0x00d3663b in ?? ()
#5 0x00d39782 in ?? ()
</pre>

... which are completely useless for debugging without access to your system (and almost useless with access). If you see results like the above, you need to install debugging symbol packages, or even re-build postgresql with debugging enabled. Do not bother collecting such backtraces, they are not useful.

Sometimes you'll get backtraces that contain just the function name and the executable it's within, not source code file names and line numbers or parameters. Such output will have lines like this:

<pre>
#11 0x00d3afbe in PostmasterMain () from /usr/lib/postgresql/8.4/bin/postgres
</pre>

This isn't ideal, but is a lot better than nothing. Installing debug information packages should give an even more detailed stack trace with line number and argument information, like this:

<pre>
#9 0xb758d97e in PostmasterMain (argc=5, argv=0xb813a0e8) at postmaster.c:1040
</pre>

... which is the most useful for tracking down your problem. Note the reference to a source file and line number instead of just an executable name.

==== Identifying the backend to connect to ====

You need to know the process ID of the postgresql backend to connect to. If you're interested in a backend that's using lots of CPU it might show up in <code>top</code>. If you have a current connection to the backend you're interested in, use <code>select pg_backend_pid()</code> to get its process ID. Otherwise, the <code>pg_catalog.pg_stat_activity</code> and/or <code>pg_catalog.pg_locks</code> views may be useful in identifying the backend of interest; see the "procpid" column in those views.

==== Attaching gdb to the backend ====

Once you know the process ID to connect to, run:

<pre>
sudo gdb -p pid
</pre>

where "pid" is the process ID of the backend. GDB will pause the execution of the process you specified and drop you into interactive mode (the <code>(gdb)</code> prompt) after showing the call the backend is currently running, eg:

<pre>
0xb7c73424 in __kernel_vsyscall ()
(gdb)
</pre>

You'll want to tell gdb to save a log of the session to a file, so at the gdb prompt enter:

<pre>
(gdb) set pagination off
(gdb) set logging file debuglog.txt
(gdb) set logging on
</pre>

gdb is now saving all input and output to a file, <code>debuglog.txt</code>, in the directory in which you started gdb.

At this point execution of the backend is still paused. It can even hold up other backends, so I recommend that you tell it to resume executing normally with the "cont" command:

<pre>
(gdb) cont
Continuing.
</pre>

The backend is now running normally, as if gdb wasn't connected to it.

==== Getting the trace ====

OK, with gdb connected you're ready to get a useful stack trace.

In addition to the instructions below, you can find some useful tips about using gdb with postgresql backends on the [[Developer_FAQ#What_debugging_features_are_available.3F|Developer FAQ]].

==== Getting representative traces from a running backend ====

If you're concerned with a case that's taking way too long to execute a query, is using too much CPU, or appears to be in an infinite loop, you'll want to repeatedly interrupt its execution, get a stack trace, and let it resume executing. Having a collection of several stack traces helps provide a better idea of where it's spending its time.

You interrupt the backend and get back to the gdb command line with ^C (control-C). Once at the gdb command line, you use the "bt" command to get a backtrace, then the "cont" command to resume normal backend execution.

Once you've collected a few backtraces, detach then exit gdb at the gdb interactive prompt:

<pre>
(gdb) detach
Detaching from program: /usr/lib/postgresql/8.3/bin/postgres, process 12912
(gdb) quit
user@host:~$
</pre>

An alternative approach is to use the <code>gcore</code> program to save a series of core dumps of the running program without disrupting its execution. Those core dumps may then be examined at your leisure, giving you time to get more than just a backtrace because you're not holding up the backend's execution while you think and type.

==== Getting a trace from the point of an error report ====

If you are trying to find out the cause of an unexpected error, the most useful thing to do is to set a breakpoint at '''errfinish''' before you let the backend continue:

<pre>
(gdb) b errfinish
Breakpoint 1 at 0x80ced0: file elog.c, line 414.
(gdb) cont
Continuing.
</pre>

Now, in your connected psql session, run whatever query is needed to provoke the error. When it happens, the backend will stop execution at '''errfinish'''.
Collect your backtrace with '''bt''', then '''quit''' (or, possibly, '''cont''' if you want to do it again).

A breakpoint at '''errfinish''' will capture generation of not only ERROR reports, but also NOTICE, LOG, and any other message that isn't suppressed by '''client_min_messages'''
or '''log_min_messages'''. You may want to adjust those settings to avoid having to continue through a bunch of unrelated messages.

==== Getting a trace from a reproducibly crashing backend ====

GDB will automatically interrupt the execution of a program if it detects a crash. So, once you've attached gdb to the backend you expect to crash, you just let it continue execution as normal and do whatever you need to to make the backend crash.

gdb will drop you into interactive mode as the backend crashes. At the <code>gdb</code> prompt you can enter the <code>bt</code> command to get a stack trace of the crash, then <code>cont</code> to continue execution. When gdb reports the process has exited, use the <code>quit</code> command.

Alternately, you can collect a core file as explained below, but it's probably more hassle than it's worth if you know which backend to attach gdb to before it crashes.

==== Getting a trace from a randomly crashing backend ====

It's a lot harder to get a stack trace from a backend that's crashing when you don't know why it's crashing, what causes a backend to crash, or which backends will crash when. For this, you generally need to enable the generation of core files, which are debuggable dumps of a program's state that are generated by the operating system when the program crashes.

===== Enabling core dumps =====

[http://www.cyberciti.biz/tips/linux-core-dumps.html This article provides a useful primer on core dumps on Linux].

On a Linux system you can check to see if core file generation is enabled for a process by examining /proc/$pid/limits, where $pid is the process ID of interest. "Max core file size" should be non-zero.

Generally, adding "ulimit -c unlimited" to the top of the PostgreSQL startup script and restarting postgresql is sufficient to enable core dump collection. Make sure you have plenty of free space in your PostgreSQL data directory, because that's where the core dumps will be written and they can be fairly big due to Pg's use of shared memory. It may be useful to temporarily reduce the size of shared_buffers within postgresql.conf. This avoids core dumps that make the system unresponsive for minutes at a time, which can happen when shared_buffers is more than a few gigabytes. Reducing shared_buffers significantly will usually not make the server intolerably slow, since PostgreSQL will make increased use of the filesystem cache.

On a Linux system it's also worth changing the file name format used for core dumps so that core dumps don't overwrite each other. The <code>/proc/sys/kernel/core_pattern</code> file controls this. I suggest <code>core.%p.sig%s.%ts</code>, which will record the process's PID, the signal that killed it, and the timestamp at which the core was generated. See <code>man 5 core</code>. To apply the settings change just run <code>echo core.%p.sig%s.%ts | sudo tee -a /proc/sys/kernel/core_pattern</code>.

You can test whether core dumps are enabled by starting a `psql' session, finding the backend pid for it using the instructions given above, then killing it with "kill -ABRT pidofbackend" (where pidofbackend is the PID of the postgres backend, NOT the pid of psql). You should see a core file appear in your postgresql data directory.

===== Debugging the core dump =====

Once you've enabled core dumps, you need to wait until you see a backend crash. A core dump will be generated by the operating system, and you'll be able to attach gdb to it to collect a stack trace or other information.

You need to tell gdb what executable file generated the core if you want to get useful backtraces and other debugging information. To do this, just specify the postgres executable path then the core file path when invoking gdb, as shown below. If you do not know the location of the postgres executable, you can get it by examining /proc/$pid/exe for a running postgres instance. For example:

<pre>
$ for f in `pgrep postgres`; do ls -l /proc/$f/exe; done
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:30 /proc/10621/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11052/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11053/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11054/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11055/exe -> /usr/lib/postgresql/8.4/bin/postgres
</pre>

... we can see from the above that the postgres executable on my (Ubuntu) system is <code>/usr/lib/postgresql/8.4/bin/postgres</code>.

Once you know the executable path and the core file location, just run gdb with those as arguments, ie <code>gdb -q /path/to/postgres /path/to/core</code>. Now you can debug it as if it was a normal running postgres, as discussed in the sections above.

===== Debugging the core dump - example =====

For example, having just forced a postgres backend to crash with <code>kill -ABRT</code>, I have a core file named <code>core.10780.sig6.1271644870s</code> in <code>/var/lib/postgresql/8.4/main</code>, which is the data directory on my Ubuntu system. I've used /proc to find out that the executable for postgres on my system is <code>/usr/lib/postgresql/8.4/bin/postgres</code>.

It's now easy to run GDB against it and request a backtrace:

<pre>
$ sudo -u postgres gdb -q -c /var/lib/postgresql/8.4/main/core.10780.sig6.1271644870s /usr/lib/postgresql/8.4/bin/postgres
Core was generated by `postgres: wal writer process '.
Program terminated with signal 6, Aborted.
#0 0x00a65422 in __kernel_vsyscall ()
(gdb) bt
#0 0x00a65422 in __kernel_vsyscall ()
#1 0x00686a3d in ___newselect_nocancel () from /lib/tls/i686/cmov/libc.so.6
#2 0x00e68d25 in pg_usleep () from /usr/lib/postgresql/8.4/bin/postgres
#3 0x00d3d406 in WalWriterMain () from /usr/lib/postgresql/8.4/bin/postgres
#4 0x00bf0ba4 in AuxiliaryProcessMain () from /usr/lib/postgresql/8.4/bin/postgres
#5 0x00d3663b in ?? () from /usr/lib/postgresql/8.4/bin/postgres
#6 0x00d39782 in ?? () from /usr/lib/postgresql/8.4/bin/postgres
#7 <signal handler called>
#8 0x00a65422 in __kernel_vsyscall ()
#9 0x00686a3d in ___newselect_nocancel () from /lib/tls/i686/cmov/libc.so.6
#10 0x00d37bee in ?? () from /usr/lib/postgresql/8.4/bin/postgres
#11 0x00d3afbe in PostmasterMain () from /usr/lib/postgresql/8.4/bin/postgres
#12 0x00cdc0dc in main () from /usr/lib/postgresql/8.4/bin/postgres
</pre>

This example shows a stack trace that does not include function arguments. There may or may not be function arguments on your system, depending on obscure details largely outside your control, like whether or not Postgres was originally built to omit frame pointers, DWARF version, etc. In general, the situation with getting backtraces on mainstream Linux platforms has improved significantly since this example backtrace was originally added. These days, is often better to use "bt full" instead of "bt", since this can provide even more information (the values of local/stack variables during the crash). In general, the more information that you can provide for debugging, the better.

If you don't have proper symbols installed, specify the wrong executable to gdb or fail to specify an executable at all, you'll see a useless backtrace like this following one:

<pre>
$ sudo -u postgres gdb -q -c /var/lib/postgresql/8.4/main/core.10780.sig6.1271644870s
Core was generated by `postgres: wal writer process '.
Program terminated with signal 6, Aborted.
#0 0x00a65422 in __kernel_vsyscall ()
(gdb) bt
#0 0x00a65422 in __kernel_vsyscall ()
#1 0x00686a3d in ?? ()
#2 0x00d3d406 in ?? ()
#3 0x00bf0ba4 in ?? ()
#4 0x00d3663b in ?? ()
#5 0x00d39782 in ?? ()
#6 <signal handler called>
#7 0x00a65422 in __kernel_vsyscall ()
#8 0x00686a3d in ?? ()
#9 0x00d3afbe in ?? ()
#10 0x00cdc0dc in ?? ()
#11 0x005d7b56 in ?? ()
#12 0x00b8fad1 in ?? ()

</pre>

If you get something like that, don't bother sending it in. If you didn't just get the executable path wrong, you'll probably need to install debugging symbols for PostgreSQL (or even re-build PostgreSQL with debugging enabled) and try again.

=== Tracing problems when creating a cluster ===

If you're running into a crash while trying to create a database cluster using ''initdb'', that may leave behind a core dump that you can analyze with gdb as described above. This should be the case if there's an assertion failure for example. You will probably need to give the ''--no-clean'' option to ''initdb'' to keep it from deleting the new data directory and the core file along with it.

Another technique for finding bootstrap-time bugs is to manually feed the bootstrapping commands into bootstrap mode or single-user mode, with a data directory left over from ''initdb --no-clean''. This can help if there has been no PANIC that leaves a core dump, but just a FATAL or ERROR, for example. It's easy to attach GDB to such a backend.

Also, try creating the data directory using from unpatched master, then triggering the crash with the patched backend, rather than initdb.

== Dumping a page image from within GDB ==

It is sometimes useful to post a file containing a [https://www.postgresql.org/docs/current/storage-page-layout.html raw page image] when reporting a problem on a community mailing list. Both tables and indexes consist of 8KiB-sized blocks/pages, which can be thought of as the fundamental unit of data storage. This is particularly likely to be helpful when the integrity of the data is suspect, such as when an assertion fails due to a bug that corrupts data. GDB makes it easy to do this from either an interactive session (though core dumps may have [https://www.postgresql.org/message-id/20200210195659.vx6slnxmoymp5yyo%40alap3.anarazel.de issues with dumping shared memory]).

Example:

<pre>
Breakpoint 1, _bt_split (rel=0x7f555b6f3460, itup_key=0x55d03a745d40, buf=232, cbuf=0, firstright=366, newitemoff=216, newitemsz=16, newitem=0x55d03a745d18, newitemonleft=true) at nbtinsert.c:1205
1205 {
(gdb) n
1215 Buffer sbuf = InvalidBuffer;
(gdb)
1216 Page spage = NULL;
(gdb)
1217 BTPageOpaque sopaque = NULL;
(gdb)
1227 int indnatts = IndexRelationGetNumberOfAttributes(rel);
(gdb)
1228 int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
(gdb)
1231 rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
(gdb)
1244 origpage = BufferGetPage(buf);
(gdb)
1245 leftpage = PageGetTempPage(origpage);
(gdb)
1246 rightpage = BufferGetPage(rbuf);
(gdb)
1248 origpagenumber = BufferGetBlockNumber(buf);
(gdb)
1249 rightpagenumber = BufferGetBlockNumber(rbuf);
(gdb) dump binary memory /tmp/dump_block.page origpage (origpage + 8192)
</pre>

The contents of the page "origpage" are now dumped to the file "/tmp/dump_block.page", which will be precisely 8192 bytes in size. This works wherever the "Page" C type appears (which is actually a typedef defined in bufpage.h -- an unadorned "Page" is actually a char pointer). A "Page" variable is a raw pointer to a page image, typically the authoritative/current page stored in shared_buffers.

=== pg_hexedit ===

Note also that the Postgres hex editor tool [https://github.com/petergeoghegan/pg_hexedit pg_hexedit] can quickly [https://github.com/petergeoghegan/pg_hexedit#using-pg_hexedit-while-debugging-postgres-with-gdb visualize page images within GDB] with intuitive tags and annotations. It might be easier to use pg_hexedit when it isn't initially clear what page images are of interest, or when multiple images of the page from the same block need to be captured over time, as a test case is run.

=== contrib/pageinspect page dump ===

When it isn't convenient to use GDB, and when it isn't necessary to get a page image that is exactly current at the time of a crash, it is possible to dump an arbitrary page to a file in a more lightweight fashion using [https://www.postgresql.org/docs/current/pageinspect.html contrib/pageinspect]. For example, the following interactive shell session dumps the current page image in block 42 for the index 'pgbench_pkey':

<pre>
$ psql -c "create extension pageinspect"
CREATE EXTENSION
$ psql -XAtc "SELECT encode(get_raw_page('pgbench_pkey', 42),'base64')" | base64 -d > dump_block_42.page
</pre>

This assumes that it is possible to connect as a superuser using psql, and that the base64 program is in the user's $PATH. The GNU coreutils package generally includes base64, so it will already be available on most Linux installations. Note that it may be necessary to install an operating system package named "postgresql-contrib" or similar before the pageinspect extension will be available to install.

Typically, the easiest way of following this procedure is to become the postgres operating system user first (e.g., through "su postgres").

== Starting Postgres under GDB ==

Debugging multi-process applications like PostgreSQL has historically been very painful with GDB. Thankfully with recent 7.x releases, this has been improved greatly by "inferiors" (GDB's term for multiple debugged processes).

NB! This is still quite fragile, so don't expect to be able to do this in production.

<source lang="bash">
# Stop server
pg_ctl -D /path/to/data stop -m fast
# Launch postgres via gdb
gdb --args postgres -D /path/to/data
</source>

Now, in the GDB shell, use these commands to set up an environment:

<source lang="bash">
# We have scroll bars in the year 2012!
set pagination off
# Attach to both parent and child on fork
set detach-on-fork off
# Stop/resume all processes
set schedule-multiple on

# Usually don't care about these signals
handle SIGUSR1 noprint nostop
handle SIGUSR2 noprint nostop

# Make GDB's expression evaluation work with most common Postgres Macros (works with Linux).
# Per https://www.postgresql.org/message-id/20130731021434.GE19053@alap2.anarazel.de,
# have many Postgres macros work if these are defined (useful for TOAST stuff,
# varlena stuff, etc):
macro define __builtin_offsetof(T, F) ((int) &(((T *) 0)->F))
macro define __extension__

# Ugly hack so we don't break on process exit
python gdb.events.exited.connect(lambda x: [gdb.execute('inferior 1'), gdb.post_event(lambda: gdb.execute('continue'))])

# Phew! Run it.
run
</source>

To get a list of processes, run <code>info inferior</code>. To switch to another process, run <code>inferior NUM</code>.

== Recording Postgres using rr Record and Replay Framework ==

PostgreSQL 13 can be debugged using [https://rr-project.org the rr debugging recorder]. This section describes some useful workflows for using rr to debug Postgres. It is primarily written for Postgres hackers, though rr could also be used when reporting a bug.

=== Version compatibility ===

Commit {{PgCommitURL|fc3f4453a2bc95549682e23600b22e658cb2d6d7}} resolved an issue that made it hard to use rr with earlier Postgres versions, so there might be problems on those versions. Also, earlier versions of rr distributed with older/LTS Linux OS versions might not have support for syscalls that are used by Postgres, such as <code>sync_file_range()</code>. All of these issues probably have fairly straightforward workarounds (e.g. you could start Postgres with <code>--wal_writer_flush_after=0 --backend_flush_after=0 --bgwriter_flush_after=0 --checkpoint_flush_after=0</code>).

=== Postgres settings ===

A script that records a postgres session using rr might could consist of the following example snippet:

<source lang="bash">
rr record -M /code/postgresql/$BRANCH/install/bin/postgres \
-D /code/postgresql/$BRANCH/data \
--log_line_prefix="%m %p " \
--effective_cache_size=1GB \
--random_page_cost=4.0 \
--work_mem=4MB \
--maintenance_work_mem=64MB \
--fsync=off \
--log_statement=all \
--log_min_messages=DEBUG5 \
--max_connections=50 \
--shared_buffers=32MB
</source>

Most of the details here are somewhat arbitrary. The general idea is to make log output as verbose as possible, and to keep the amount of memory used by the server low.

It is quite practical to run "make installcheck" against the server when Postgres is run with "rr record", recording the entire execution. This is not much slower than just running the tests against a regular debug build of Postgres. It's still much faster than Valgrind, for example. Replaying the recording seems to be where having a high end machine helps a lot.

=== Event numbers in the log ===

Once the tests are done, stop Postgres in the usual way (e.g. Ctrl + C). The recording is saved to the <code>$HOME/.local/share/rr/</code> directory on most Linux distros. rr creates a directory for each distinct recording in this parent directory. rr also maintains a symlink (<code>latest-trace</code>) that points to the latest recording directory, which is often used when replaying a recording. Be careful to avoid accidentally leaving too many recordings around. They can be rather large.

The record/Postgres terminal has output that looks like this (when the example "rr record" recipe is used):

<pre>
[rr 1786705 1241867]2020-04-04 21:55:05.018 PDT 1786705 DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 63992/1/2
[rr 1786705 1241898]2020-04-04 21:55:05.019 PDT 1786705 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
[rr 1786705 1241902]2020-04-04 21:55:05.019 PDT 1786705 LOG: statement: CREATE TYPE test_type_empty AS ();
[rr 1786705 1241906]2020-04-04 21:55:05.020 PDT 1786705 DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 63993/1/1
[rr 1786705 1241936]2020-04-04 21:55:05.020 PDT 1786705 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
[rr 1786705 1241940]2020-04-04 21:55:05.020 PDT 1786705 LOG: statement: DROP TYPE test_type_empty;
[rr 1786705 1241944]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: drop auto-cascades to composite type test_type_empty
[rr 1786705 1241948]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: drop auto-cascades to type test_type_empty[]
[rr 1786705 1241952]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: MultiXact: setting OldestMember[2] = 9
[rr 1786705 1241956]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 63994/1/3
</pre>

The part of each log line in square brackets comes from rr (since we used <code>-M</code> when recording) -- the first number is a PID, the second an event number. You probably won't care about the PIDs, though, since the event number alone unambiguously identifies a particular "event" in a particular backend (rr recordings are single threaded, even when there are multiple threads or processes). Suppose you want to get to the <code>CREATE TYPE test_type_empty AS ()</code> query -- you can get to the end of the query by replaying the recording with this option:

<source lang="bash">
$ rr replay -M -g 1241902
</source>

Replaying the recording like this will take you to the point where the Postgres backend prints the log message at the end of executing the example query -- you will get a gdb debug server (rr implements a gdb backend), and interactive gdb session. This isn't precisely the point of execution that will be of interest to you, but it's close enough. You can easily set a breakpoint to the precise function you happen to be interested in, and then [https://sourceware.org/gdb/current/onlinedocs/gdb/Reverse-Execution.html <code>reverse-continue</code>] to get there by going backwards.

You can also find the point where a particular backend starts by using the fork option instead. So for the PID 1786705, that would look like:

<source lang="bash">
$ rr replay -M -f 1786705
</source>

(Don't try to use the similar <code>-p</code> option, since that starts a debug server when the pid has been <code>exec</code>'d.)

Note that saving the output of a recording using standard tools like "tee" seems to have some issues [https://github.com/mozilla/rr/issues/91]. It may be helpful to get log output (complete with these event numbers) by doing an "autopilot" replay, like this:

<source lang="bash">
$ rr replay -M -a &> rr.log
</source>

You now have a log file that can be searched for a good event number, as a starting point. This may be a practical necessity when running "make installcheck" or a custom test suite, since there might be megabytes of log output. You usually don't need to bother to generate logs in this way, though. It might take a few minutes to do an autopilot replay, since rr will replay everything that was recorded in sub-realtime.

=== Jumping back and forth through a recording using GDB commands ===

Once you have a rough idea of where and when a bug manifests itself in your rr recording, you'll need to actually debug the issue using gdb. Often the natural approach is to jump back and forth through the recording to track the issue down in whatever backend is known to be misbehaving.

You can check the current event number once connected to gdb using gdb's "when" command, which can be useful when determining which point of execution you've reached relative to the high level output from "make check" (assuming the <code>-M</code> option was used to get event numbers there):

<pre>
(rr) when
Current event: 379377
</pre>

Since event numbers are shared by processes/threads, which are alway executed serially during recording, event numbers are a generic way of reasoning about how far along the recording is, within and across processes. We are not limited to attaching our debugger to processes that happen to be Postgres backends.

rr also supports gdb's <code>checkpoint</code>, <code>restart</code> and <code>delete</code> checkpoint commands; see [https://sourceware.org/gdb/onlinedocs/gdb/Checkpoint_002fRestart.html#Checkpoint_002fRestart the relevant section of the GDB docs]. These are useful because they allow gdb to track interesting points in execution directly, at a finer granularity than "event number"; a new event number is created when there is a syscall, which might be far too coarse a granularity to be useful when actually zeroing in on a problem in one particular backend/process.

=== Watchpoints and reverse execution ===

Because rr supports reverse debugging, watchpoints are much more useful. Note that you should generally use <code>watch -l expr</code> rather than just using <code>watch expr</code>. Without -l, reverse execution is often very slow or apparently buggy, because gdb will try to reevaluate the expression as the program executes through different scopes.

=== Debugging tap tests ===

rr really shines when debugging things like tap tests, where there is complex scaffolding that may run multiple Postgres servers. You can run an entire "rr record make check", without having to worry about how that scaffolding works. Once you have useful PIDs (or event numbers) to work off of, it won't take too long to get an interactive debugging session in the backend of interest. You could get a PID for a backend of interest from the logs that appear in the <code>./tmp_check/log</code> directory once you're done with recording "make check" execution. From there, you can start "rr replay" by passing the relevant PID as the <code>-f</code> argument.

Example replay of a "make check" session:

<pre>
$ rr replay -M -f 2247718
[rr 2246854 304]make -C ../../../src/backend generated-headers
[rr 2246855 629]make[1]: Entering directory '/code/postgresql/patch/build/src/backend'
[rr 2246855 631]make -C catalog distprep generated-header-symlinks
[rr 2246856 984]make[2]: Entering directory '/code/postgresql/patch/build/src/backend/catalog'

*** SNIP -- Remaining "make check" output omitted for brevity ***

--------------------------------------------------
---> Reached target process 2247718 at event 379377.
--------------------------------------------------
Reading symbols from /usr/bin/../lib/rr/librrpreload.so...
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...
Reading symbols from /usr/lib/debug/.build-id/0b/4031a3ab06ec61be1546960b4d1dad979d15ce.debug...

*** SNIP ***

(No debugging symbols found in /usr/lib/x86_64-linux-gnu/libicudata.so.66)
Reading symbols from /lib/x86_64-linux-gnu/libnss_files.so.2...
Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_files-2.31.so...
0x0000000070000002 in ?? ()
(rr) bt
#0 0x0000000070000002 in ?? ()
#1 0x00007f0d2c25c3b6 in _raw_syscall () at raw_syscall.S:120
#2 0x00007f0d2c2582ff in traced_raw_syscall (call=call@entry=0x681fffa0) at syscallbuf.c:229
#3 0x00007f0d2c259978 in sys_fcntl (call=<optimized out>) at syscallbuf.c:1291
#4 syscall_hook_internal (call=0x681fffa0) at syscallbuf.c:2855
#5 syscall_hook (call=0x681fffa0) at syscallbuf.c:2987
#6 0x00007f0d2c2581da in _syscall_hook_trampoline () at syscall_hook.S:282
#7 0x00007f0d2c25820a in __morestack () at syscall_hook.S:417
#8 0x00007f0d2c258225 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at syscall_hook.S:428
#9 0x00007f0d2b5a9f15 in arch_fork (ctid=0x7f0d297bee50) at arch-fork.h:49
#10 __libc_fork () at fork.c:76
#11 0x00005620ae898e53 in fork_process () at fork_process.c:62
#12 0x00005620ae8aab39 in BackendStartup (port=0x5620b0c1f600) at postmaster.c:4187
#13 0x00005620ae8a6d29 in ServerLoop () at postmaster.c:1727
#14 0x00005620ae8a64c2 in PostmasterMain (argc=4, argv=0x5620b0bf19e0) at postmaster.c:1400
#15 0x00005620ae7a8247 in main (argc=4, argv=0x5620b0bf19e0) at main.c:210
</pre>

=== Debugging race conditions ===

rr can be used to [https://postgr.es/m/CAH2-WznTb6-0fjW4WPzNQh4mFvBH86J7bqZpNqteVUzo8p=6Hg@mail.gmail.com isolate hard to reproduce race condition bugs]. The single threaded nature of rr recording/execution seems to make it harder to reproduce bugs involving concurrent execution. However, using rr's [https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html chaos mode] option (by using the <code>-h</code> argument with rr record) seems to increase the odds of successfully reproducing a problem. It might still take a few attempts, but you only have to get lucky once.

=== Packing a recording ===

rr pack can be used to save a recording in a fairly stable format -- it copies the needed files into the trace:

<source lang="bash">
$ rr pack
</source>

This could be useful if you wanted to save a recording for more than a day or two. Because every single detail of the recording (e.g. pointers, PIDs) is stable, you can treat a recording as a totally self contained thing.

=== rr resources ===

[https://github.com/mozilla/rr/wiki/Usage Usage - rr wiki]

[https://github.com/mozilla/rr/wiki/Debugging-protips Debugging protips - rr wiki]

[[Category:Operating system]]

Getting a stack trace of a running PostgreSQL backend on Linux/BSD

2023-02-19T00:42:44Z

Pgeoghegan: Recommend the use of GDB checkpoints

[[Generating a stack trace of a PostgreSQL backend|Up to parent]]

== Linux and BSD ==

Linux and BSD systems generally use the [http://gcc.gnu.org/ GNU compiler collection] and the [http://www.gnu.org/software/gdb/ GNU Debugger] ("gdb"). It's pretty trivial to get a stack trace of a process.

(If you want more than just a stack trace, take a look at the [[Developer FAQ]] which covers interactive debugging).

=== Installing External symbols ===

(BSD users who installed from ports can skip this)

On many Linux systems, debugging info is separated out from program binaries and stored separately. It's often not installed when you install a package, so if you want to debug the program (say, get a stack trace) you will need to install debug info packages. Unfortunately, the names of these packages vary depending on your distro, as does the procedure for installing them.

Some generic instructions (unrelated to PostgreSQL) are maintained on the GNOME Wiki [http://live.gnome.org/GettingTraces/DistroSpecificInstructions here].

==== On Debian ====

http://wiki.debian.org/HowToGetABacktrace

Debian Squeeze (6.x) users will also need to install gdb 7.3 from backports, as the gdb shipped in Squeeze doesn't understand the PIE executables used in newer PostgreSQL builds.

==== On Ubuntu ====

First, follow the instructions on the Ubuntu wiki entry [https://wiki.edubuntu.org/DebuggingProgramCrash DebuggingProgramCrash].

Once you've finished enabling the use of debug info packages as described, you will need to use the <code>list-dbgsym-packages.sh</code> script linked to on that wiki article to get a list of debug packages you need. Installing the debug package for postgresql alone is not sufficient.

After following the instructions on the Ubuntu wiki, download the script to your desktop, open a terminal, and run: 
<pre>
$ sudo apt-get install $(sudo bash Desktop/list-dbgsym-packages.sh -t -p $(pidof -s postgres))
</pre>

==== On Fedora ====

All Fedora versions: [https://fedoraproject.org/wiki/StackTraces#debuginfo FedoraProject.org - StackTraces]

==== Other distros ====

In general, you need to install at least the debug symbol packages for the PostgreSQL server and client as well as any common package that may exist, and the debug symbol package for libc. It's a good idea to add debug symbols for the other libraries PostgreSQL uses in case the problem you're having arises in or touches on one of those libraries.

=== Collecting a stack trace ===

==== How to tell if a stack trace is any good ====

Read this section and keep it in mind as you collect information using the instructions below. Making sure the information you collect is actually useful will save you, and everybody else, time and hassle.

It is vitally important to have debugging symbols available to get a useful stack trace. If you do not have the required symbols installed, backtraces will contain lots of entries like this:

<pre>
#1 0x00686a3d in ?? ()
#2 0x00d3d406 in ?? ()
#3 0x00bf0ba4 in ?? ()
#4 0x00d3663b in ?? ()
#5 0x00d39782 in ?? ()
</pre>

... which are completely useless for debugging without access to your system (and almost useless with access). If you see results like the above, you need to install debugging symbol packages, or even re-build postgresql with debugging enabled. Do not bother collecting such backtraces, they are not useful.

Sometimes you'll get backtraces that contain just the function name and the executable it's within, not source code file names and line numbers or parameters. Such output will have lines like this:

<pre>
#11 0x00d3afbe in PostmasterMain () from /usr/lib/postgresql/8.4/bin/postgres
</pre>

This isn't ideal, but is a lot better than nothing. Installing debug information packages should give an even more detailed stack trace with line number and argument information, like this:

<pre>
#9 0xb758d97e in PostmasterMain (argc=5, argv=0xb813a0e8) at postmaster.c:1040
</pre>

... which is the most useful for tracking down your problem. Note the reference to a source file and line number instead of just an executable name.

==== Identifying the backend to connect to ====

You need to know the process ID of the postgresql backend to connect to. If you're interested in a backend that's using lots of CPU it might show up in <code>top</code>. If you have a current connection to the backend you're interested in, use <code>select pg_backend_pid()</code> to get its process ID. Otherwise, the <code>pg_catalog.pg_stat_activity</code> and/or <code>pg_catalog.pg_locks</code> views may be useful in identifying the backend of interest; see the "procpid" column in those views.

==== Attaching gdb to the backend ====

Once you know the process ID to connect to, run:

<pre>
sudo gdb -p pid
</pre>

where "pid" is the process ID of the backend. GDB will pause the execution of the process you specified and drop you into interactive mode (the <code>(gdb)</code> prompt) after showing the call the backend is currently running, eg:

<pre>
0xb7c73424 in __kernel_vsyscall ()
(gdb)
</pre>

You'll want to tell gdb to save a log of the session to a file, so at the gdb prompt enter:

<pre>
(gdb) set pagination off
(gdb) set logging file debuglog.txt
(gdb) set logging on
</pre>

gdb is now saving all input and output to a file, <code>debuglog.txt</code>, in the directory in which you started gdb.

At this point execution of the backend is still paused. It can even hold up other backends, so I recommend that you tell it to resume executing normally with the "cont" command:

<pre>
(gdb) cont
Continuing.
</pre>

The backend is now running normally, as if gdb wasn't connected to it.

==== Getting the trace ====

OK, with gdb connected you're ready to get a useful stack trace.

In addition to the instructions below, you can find some useful tips about using gdb with postgresql backends on the [[Developer_FAQ#What_debugging_features_are_available.3F|Developer FAQ]].

==== Getting representative traces from a running backend ====

If you're concerned with a case that's taking way too long to execute a query, is using too much CPU, or appears to be in an infinite loop, you'll want to repeatedly interrupt its execution, get a stack trace, and let it resume executing. Having a collection of several stack traces helps provide a better idea of where it's spending its time.

You interrupt the backend and get back to the gdb command line with ^C (control-C). Once at the gdb command line, you use the "bt" command to get a backtrace, then the "cont" command to resume normal backend execution.

Once you've collected a few backtraces, detach then exit gdb at the gdb interactive prompt:

<pre>
(gdb) detach
Detaching from program: /usr/lib/postgresql/8.3/bin/postgres, process 12912
(gdb) quit
user@host:~$
</pre>

An alternative approach is to use the <code>gcore</code> program to save a series of core dumps of the running program without disrupting its execution. Those core dumps may then be examined at your leisure, giving you time to get more than just a backtrace because you're not holding up the backend's execution while you think and type.

==== Getting a trace from the point of an error report ====

If you are trying to find out the cause of an unexpected error, the most useful thing to do is to set a breakpoint at '''errfinish''' before you let the backend continue:

<pre>
(gdb) b errfinish
Breakpoint 1 at 0x80ced0: file elog.c, line 414.
(gdb) cont
Continuing.
</pre>

Now, in your connected psql session, run whatever query is needed to provoke the error. When it happens, the backend will stop execution at '''errfinish'''.
Collect your backtrace with '''bt''', then '''quit''' (or, possibly, '''cont''' if you want to do it again).

A breakpoint at '''errfinish''' will capture generation of not only ERROR reports, but also NOTICE, LOG, and any other message that isn't suppressed by '''client_min_messages'''
or '''log_min_messages'''. You may want to adjust those settings to avoid having to continue through a bunch of unrelated messages.

==== Getting a trace from a reproducibly crashing backend ====

GDB will automatically interrupt the execution of a program if it detects a crash. So, once you've attached gdb to the backend you expect to crash, you just let it continue execution as normal and do whatever you need to to make the backend crash.

gdb will drop you into interactive mode as the backend crashes. At the <code>gdb</code> prompt you can enter the <code>bt</code> command to get a stack trace of the crash, then <code>cont</code> to continue execution. When gdb reports the process has exited, use the <code>quit</code> command.

Alternately, you can collect a core file as explained below, but it's probably more hassle than it's worth if you know which backend to attach gdb to before it crashes.

==== Getting a trace from a randomly crashing backend ====

It's a lot harder to get a stack trace from a backend that's crashing when you don't know why it's crashing, what causes a backend to crash, or which backends will crash when. For this, you generally need to enable the generation of core files, which are debuggable dumps of a program's state that are generated by the operating system when the program crashes.

===== Enabling core dumps =====

[http://www.cyberciti.biz/tips/linux-core-dumps.html This article provides a useful primer on core dumps on Linux].

On a Linux system you can check to see if core file generation is enabled for a process by examining /proc/$pid/limits, where $pid is the process ID of interest. "Max core file size" should be non-zero.

Generally, adding "ulimit -c unlimited" to the top of the PostgreSQL startup script and restarting postgresql is sufficient to enable core dump collection. Make sure you have plenty of free space in your PostgreSQL data directory, because that's where the core dumps will be written and they can be fairly big due to Pg's use of shared memory. It may be useful to temporarily reduce the size of shared_buffers within postgresql.conf. This avoids core dumps that make the system unresponsive for minutes at a time, which can happen when shared_buffers is more than a few gigabytes. Reducing shared_buffers significantly will usually not make the server intolerably slow, since PostgreSQL will make increased use of the filesystem cache.

On a Linux system it's also worth changing the file name format used for core dumps so that core dumps don't overwrite each other. The <code>/proc/sys/kernel/core_pattern</code> file controls this. I suggest <code>core.%p.sig%s.%ts</code>, which will record the process's PID, the signal that killed it, and the timestamp at which the core was generated. See <code>man 5 core</code>. To apply the settings change just run <code>echo core.%p.sig%s.%ts | sudo tee -a /proc/sys/kernel/core_pattern</code>.

You can test whether core dumps are enabled by starting a `psql' session, finding the backend pid for it using the instructions given above, then killing it with "kill -ABRT pidofbackend" (where pidofbackend is the PID of the postgres backend, NOT the pid of psql). You should see a core file appear in your postgresql data directory.

===== Debugging the core dump =====

Once you've enabled core dumps, you need to wait until you see a backend crash. A core dump will be generated by the operating system, and you'll be able to attach gdb to it to collect a stack trace or other information.

You need to tell gdb what executable file generated the core if you want to get useful backtraces and other debugging information. To do this, just specify the postgres executable path then the core file path when invoking gdb, as shown below. If you do not know the location of the postgres executable, you can get it by examining /proc/$pid/exe for a running postgres instance. For example:

<pre>
$ for f in `pgrep postgres`; do ls -l /proc/$f/exe; done
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:30 /proc/10621/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11052/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11053/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11054/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11055/exe -> /usr/lib/postgresql/8.4/bin/postgres
</pre>

... we can see from the above that the postgres executable on my (Ubuntu) system is <code>/usr/lib/postgresql/8.4/bin/postgres</code>.

Once you know the executable path and the core file location, just run gdb with those as arguments, ie <code>gdb -q /path/to/postgres /path/to/core</code>. Now you can debug it as if it was a normal running postgres, as discussed in the sections above.

===== Debugging the core dump - example =====

For example, having just forced a postgres backend to crash with <code>kill -ABRT</code>, I have a core file named <code>core.10780.sig6.1271644870s</code> in <code>/var/lib/postgresql/8.4/main</code>, which is the data directory on my Ubuntu system. I've used /proc to find out that the executable for postgres on my system is <code>/usr/lib/postgresql/8.4/bin/postgres</code>.

It's now easy to run GDB against it and request a backtrace:

<pre>
$ sudo -u postgres gdb -q -c /var/lib/postgresql/8.4/main/core.10780.sig6.1271644870s /usr/lib/postgresql/8.4/bin/postgres
Core was generated by `postgres: wal writer process '.
Program terminated with signal 6, Aborted.
#0 0x00a65422 in __kernel_vsyscall ()
(gdb) bt
#0 0x00a65422 in __kernel_vsyscall ()
#1 0x00686a3d in ___newselect_nocancel () from /lib/tls/i686/cmov/libc.so.6
#2 0x00e68d25 in pg_usleep () from /usr/lib/postgresql/8.4/bin/postgres
#3 0x00d3d406 in WalWriterMain () from /usr/lib/postgresql/8.4/bin/postgres
#4 0x00bf0ba4 in AuxiliaryProcessMain () from /usr/lib/postgresql/8.4/bin/postgres
#5 0x00d3663b in ?? () from /usr/lib/postgresql/8.4/bin/postgres
#6 0x00d39782 in ?? () from /usr/lib/postgresql/8.4/bin/postgres
#7 <signal handler called>
#8 0x00a65422 in __kernel_vsyscall ()
#9 0x00686a3d in ___newselect_nocancel () from /lib/tls/i686/cmov/libc.so.6
#10 0x00d37bee in ?? () from /usr/lib/postgresql/8.4/bin/postgres
#11 0x00d3afbe in PostmasterMain () from /usr/lib/postgresql/8.4/bin/postgres
#12 0x00cdc0dc in main () from /usr/lib/postgresql/8.4/bin/postgres
</pre>

This example shows a stack trace that does not include function arguments. There may or may not be function arguments on your system, depending on obscure details largely outside your control, like whether or not Postgres was originally built to omit frame pointers, DWARF version, etc. In general, the situation with getting backtraces on mainstream Linux platforms has improved significantly since this example backtrace was originally added. These days, is often better to use "bt full" instead of "bt", since this can provide even more information (the values of local/stack variables during the crash). In general, the more information that you can provide for debugging, the better.

If you don't have proper symbols installed, specify the wrong executable to gdb or fail to specify an executable at all, you'll see a useless backtrace like this following one:

<pre>
$ sudo -u postgres gdb -q -c /var/lib/postgresql/8.4/main/core.10780.sig6.1271644870s
Core was generated by `postgres: wal writer process '.
Program terminated with signal 6, Aborted.
#0 0x00a65422 in __kernel_vsyscall ()
(gdb) bt
#0 0x00a65422 in __kernel_vsyscall ()
#1 0x00686a3d in ?? ()
#2 0x00d3d406 in ?? ()
#3 0x00bf0ba4 in ?? ()
#4 0x00d3663b in ?? ()
#5 0x00d39782 in ?? ()
#6 <signal handler called>
#7 0x00a65422 in __kernel_vsyscall ()
#8 0x00686a3d in ?? ()
#9 0x00d3afbe in ?? ()
#10 0x00cdc0dc in ?? ()
#11 0x005d7b56 in ?? ()
#12 0x00b8fad1 in ?? ()

</pre>

If you get something like that, don't bother sending it in. If you didn't just get the executable path wrong, you'll probably need to install debugging symbols for PostgreSQL (or even re-build PostgreSQL with debugging enabled) and try again.

=== Tracing problems when creating a cluster ===

If you're running into a crash while trying to create a database cluster using ''initdb'', that may leave behind a core dump that you can analyze with gdb as described above. This should be the case if there's an assertion failure for example. You will probably need to give the ''--no-clean'' option to ''initdb'' to keep it from deleting the new data directory and the core file along with it.

Another technique for finding bootstrap-time bugs is to manually feed the bootstrapping commands into bootstrap mode or single-user mode, with a data directory left over from ''initdb --no-clean''. This can help if there has been no PANIC that leaves a core dump, but just a FATAL or ERROR, for example. It's easy to attach GDB to such a backend.

Also, try creating the data directory using from unpatched master, then triggering the crash with the patched backend, rather than initdb.

== Dumping a page image from within GDB ==

It is sometimes useful to post a file containing a [https://www.postgresql.org/docs/current/storage-page-layout.html raw page image] when reporting a problem on a community mailing list. Both tables and indexes consist of 8KiB-sized blocks/pages, which can be thought of as the fundamental unit of data storage. This is particularly likely to be helpful when the integrity of the data is suspect, such as when an assertion fails due to a bug that corrupts data. GDB makes it easy to do this from either an interactive session (though core dumps may have [https://www.postgresql.org/message-id/20200210195659.vx6slnxmoymp5yyo%40alap3.anarazel.de issues with dumping shared memory]).

Example:

<pre>
Breakpoint 1, _bt_split (rel=0x7f555b6f3460, itup_key=0x55d03a745d40, buf=232, cbuf=0, firstright=366, newitemoff=216, newitemsz=16, newitem=0x55d03a745d18, newitemonleft=true) at nbtinsert.c:1205
1205 {
(gdb) n
1215 Buffer sbuf = InvalidBuffer;
(gdb)
1216 Page spage = NULL;
(gdb)
1217 BTPageOpaque sopaque = NULL;
(gdb)
1227 int indnatts = IndexRelationGetNumberOfAttributes(rel);
(gdb)
1228 int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
(gdb)
1231 rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
(gdb)
1244 origpage = BufferGetPage(buf);
(gdb)
1245 leftpage = PageGetTempPage(origpage);
(gdb)
1246 rightpage = BufferGetPage(rbuf);
(gdb)
1248 origpagenumber = BufferGetBlockNumber(buf);
(gdb)
1249 rightpagenumber = BufferGetBlockNumber(rbuf);
(gdb) dump binary memory /tmp/dump_block.page origpage (origpage + 8192)
</pre>

The contents of the page "origpage" are now dumped to the file "/tmp/dump_block.page", which will be precisely 8192 bytes in size. This works wherever the "Page" C type appears (which is actually a typedef defined in bufpage.h -- an unadorned "Page" is actually a char pointer). A "Page" variable is a raw pointer to a page image, typically the authoritative/current page stored in shared_buffers.

=== pg_hexedit ===

Note also that the Postgres hex editor tool [https://github.com/petergeoghegan/pg_hexedit pg_hexedit] can quickly [https://github.com/petergeoghegan/pg_hexedit#using-pg_hexedit-while-debugging-postgres-with-gdb visualize page images within GDB] with intuitive tags and annotations. It might be easier to use pg_hexedit when it isn't initially clear what page images are of interest, or when multiple images of the page from the same block need to be captured over time, as a test case is run.

=== contrib/pageinspect page dump ===

When it isn't convenient to use GDB, and when it isn't necessary to get a page image that is exactly current at the time of a crash, it is possible to dump an arbitrary page to a file in a more lightweight fashion using [https://www.postgresql.org/docs/current/pageinspect.html contrib/pageinspect]. For example, the following interactive shell session dumps the current page image in block 42 for the index 'pgbench_pkey':

<pre>
$ psql -c "create extension pageinspect"
CREATE EXTENSION
$ psql -XAtc "SELECT encode(get_raw_page('pgbench_pkey', 42),'base64')" | base64 -d > dump_block_42.page
</pre>

This assumes that it is possible to connect as a superuser using psql, and that the base64 program is in the user's $PATH. The GNU coreutils package generally includes base64, so it will already be available on most Linux installations. Note that it may be necessary to install an operating system package named "postgresql-contrib" or similar before the pageinspect extension will be available to install.

Typically, the easiest way of following this procedure is to become the postgres operating system user first (e.g., through "su postgres").

== Starting Postgres under GDB ==

Debugging multi-process applications like PostgreSQL has historically been very painful with GDB. Thankfully with recent 7.x releases, this has been improved greatly by "inferiors" (GDB's term for multiple debugged processes).

NB! This is still quite fragile, so don't expect to be able to do this in production.

<source lang="bash">
# Stop server
pg_ctl -D /path/to/data stop -m fast
# Launch postgres via gdb
gdb --args postgres -D /path/to/data
</source>

Now, in the GDB shell, use these commands to set up an environment:

<source lang="bash">
# We have scroll bars in the year 2012!
set pagination off
# Attach to both parent and child on fork
set detach-on-fork off
# Stop/resume all processes
set schedule-multiple on

# Usually don't care about these signals
handle SIGUSR1 noprint nostop
handle SIGUSR2 noprint nostop

# Make GDB's expression evaluation work with most common Postgres Macros (works with Linux).
# Per https://www.postgresql.org/message-id/20130731021434.GE19053@alap2.anarazel.de,
# have many Postgres macros work if these are defined (useful for TOAST stuff,
# varlena stuff, etc):
macro define __builtin_offsetof(T, F) ((int) &(((T *) 0)->F))
macro define __extension__

# Ugly hack so we don't break on process exit
python gdb.events.exited.connect(lambda x: [gdb.execute('inferior 1'), gdb.post_event(lambda: gdb.execute('continue'))])

# Phew! Run it.
run
</source>

To get a list of processes, run <code>info inferior</code>. To switch to another process, run <code>inferior NUM</code>.

== Recording Postgres using rr Record and Replay Framework ==

PostgreSQL 13 can be debugged using [https://rr-project.org the rr debugging recorder]. This section describes some useful workflows for using rr to debug Postgres. It is primarily written for Postgres hackers, though rr could also be used when reporting a bug.

=== Version compatibility ===

Commit {{PgCommitURL|fc3f4453a2bc95549682e23600b22e658cb2d6d7}} resolved an issue that made it hard to use rr with earlier Postgres versions, so there might be problems on those versions. Also, earlier versions of rr distributed with older/LTS Linux OS versions might not have support for syscalls that are used by Postgres, such as <code>sync_file_range()</code>. All of these issues probably have fairly straightforward workarounds (e.g. you could start Postgres with <code>--wal_writer_flush_after=0 --backend_flush_after=0 --bgwriter_flush_after=0 --checkpoint_flush_after=0</code>).

=== Postgres settings ===

A script that records a postgres session using rr might could consist of the following example snippet:

<source lang="bash">
rr record -M /code/postgresql/$BRANCH/install/bin/postgres \
-D /code/postgresql/$BRANCH/data \
--log_line_prefix="%m %p " \
--effective_cache_size=1GB \
--random_page_cost=4.0 \
--work_mem=4MB \
--maintenance_work_mem=64MB \
--fsync=off \
--log_statement=all \
--log_min_messages=DEBUG5 \
--max_connections=50 \
--shared_buffers=32MB
</source>

Most of the details here are somewhat arbitrary. The general idea is to make log output as verbose as possible, and to keep the amount of memory used by the server low.

It is quite practical to run "make installcheck" against the server when Postgres is run with "rr record", recording the entire execution. This is not much slower than just running the tests against a regular debug build of Postgres. It's still much faster than Valgrind, for example. Replaying the recording seems to be where having a high end machine helps a lot.

=== Event numbers in the log ===

Once the tests are done, stop Postgres in the usual way (e.g. Ctrl + C). The recording is saved to the <code>$HOME/.local/share/rr/</code> directory on most Linux distros. rr creates a directory for each distinct recording in this parent directory. rr also maintains a symlink (<code>latest-trace</code>) that points to the latest recording directory, which is often used when replaying a recording. Be careful to avoid accidentally leaving too many recordings around. They can be rather large.

The record/Postgres terminal has output that looks like this (when the example "rr record" recipe is used):

<pre>
[rr 1786705 1241867]2020-04-04 21:55:05.018 PDT 1786705 DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 63992/1/2
[rr 1786705 1241898]2020-04-04 21:55:05.019 PDT 1786705 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
[rr 1786705 1241902]2020-04-04 21:55:05.019 PDT 1786705 LOG: statement: CREATE TYPE test_type_empty AS ();
[rr 1786705 1241906]2020-04-04 21:55:05.020 PDT 1786705 DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 63993/1/1
[rr 1786705 1241936]2020-04-04 21:55:05.020 PDT 1786705 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
[rr 1786705 1241940]2020-04-04 21:55:05.020 PDT 1786705 LOG: statement: DROP TYPE test_type_empty;
[rr 1786705 1241944]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: drop auto-cascades to composite type test_type_empty
[rr 1786705 1241948]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: drop auto-cascades to type test_type_empty[]
[rr 1786705 1241952]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: MultiXact: setting OldestMember[2] = 9
[rr 1786705 1241956]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 63994/1/3
</pre>

The part of each log line in square brackets comes from rr (since we used <code>-M</code> when recording) -- the first number is a PID, the second an event number. You probably won't care about the PIDs, though, since the event number alone unambiguously identifies a particular "event" in a particular backend (rr recordings are single threaded, even when there are multiple threads or processes). Suppose you want to get to the <code>CREATE TYPE test_type_empty AS ()</code> query -- you can get to the end of the query by replaying the recording with this option:

<source lang="bash">
$ rr replay -M -g 1241902
</source>

Replaying the recording like this will take you to the point where the Postgres backend prints the log message at the end of executing the example query -- you will get a gdb debug server (rr implements a gdb backend), and interactive gdb session. This isn't precisely the point of execution that will be of interest to you, but it's close enough. You can easily set a breakpoint to the precise function you happen to be interested in, and then [https://sourceware.org/gdb/current/onlinedocs/gdb/Reverse-Execution.html <code>reverse-continue</code>] to get there by going backwards.

You can also find the point where a particular backend starts by using the fork option instead. So for the PID 1786705, that would look like:

<source lang="bash">
$ rr replay -M -f 1786705
</source>

(Don't try to use the similar <code>-p</code> option, since that starts a debug server when the pid has been <code>exec</code>'d.)

Note that saving the output of a recording using standard tools like "tee" seems to have some issues [https://github.com/mozilla/rr/issues/91]. It may be helpful to get log output (complete with these event numbers) by doing an "autopilot" replay, like this:

<source lang="bash">
$ rr replay -M -a &> rr.log
</source>

You now have a log file that can be searched for a good event number, as a starting point. This may be a practical necessity when running "make installcheck" or a custom test suite, since there might be megabytes of log output. You usually don't need to bother to generate logs in this way, though. It might take a few minutes to do an autopilot replay, since rr will replay everything that was recorded in sub-realtime.

=== Jumping back and forth through a recording use GDB commands ===

Once you have a rough idea of where and when a bug manifests itself in your rr recording, you'll need to actually debug the issue using gdb. Often the natural approach is to jump back and forth through the recording to track the issue down in whatever backend is known to be misbehaving.

You can check the current event number once connected to gdb using gdb's "when" command, which can be useful when determining which point of execution you've reached relative to the high level output from "make check" (assuming the <code>-M</code> option was used to get event numbers there):

<pre>
(rr) when
Current event: 379377
</pre>

Since event numbers are shared by processes/threads, which are alway executed serially during recording, event numbers are a generic way of reasoning about how far along the recording is, within and across processes. We are not limited to attaching our debugger to processes that happen to be Postgres backends.

rr also supports gdb's <code>checkpoint</code>, <code>restart</code> and <code>delete</code> checkpoint commands; see [https://sourceware.org/gdb/onlinedocs/gdb/Checkpoint_002fRestart.html#Checkpoint_002fRestart the relevant section of the GDB docs]. These are useful because they allow gdb to track interesting points in execution directly, at a finer granularity than "event number"; a new event number is created when there is a syscall, which might be far too coarse a granularity to be useful when actually zeroing in on a problem in one particular backend/process.

=== Watchpoints and reverse execution ===

Because rr supports reverse debugging, watchpoints are much more useful. Note that you should generally use <code>watch -l expr</code> rather than just using <code>watch expr</code>. Without -l, reverse execution is often very slow or apparently buggy, because gdb will try to reevaluate the expression as the program executes through different scopes.

=== Debugging tap tests ===

rr really shines when debugging things like tap tests, where there is complex scaffolding that may run multiple Postgres servers. You can run an entire "rr record make check", without having to worry about how that scaffolding works. Once you have useful PIDs (or event numbers) to work off of, it won't take too long to get an interactive debugging session in the backend of interest. You could get a PID for a backend of interest from the logs that appear in the <code>./tmp_check/log</code> directory once you're done with recording "make check" execution. From there, you can start "rr replay" by passing the relevant PID as the <code>-f</code> argument.

Example replay of a "make check" session:

<pre>
$ rr replay -M -f 2247718
[rr 2246854 304]make -C ../../../src/backend generated-headers
[rr 2246855 629]make[1]: Entering directory '/code/postgresql/patch/build/src/backend'
[rr 2246855 631]make -C catalog distprep generated-header-symlinks
[rr 2246856 984]make[2]: Entering directory '/code/postgresql/patch/build/src/backend/catalog'

*** SNIP -- Remaining "make check" output omitted for brevity ***

--------------------------------------------------
---> Reached target process 2247718 at event 379377.
--------------------------------------------------
Reading symbols from /usr/bin/../lib/rr/librrpreload.so...
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...
Reading symbols from /usr/lib/debug/.build-id/0b/4031a3ab06ec61be1546960b4d1dad979d15ce.debug...

*** SNIP ***

(No debugging symbols found in /usr/lib/x86_64-linux-gnu/libicudata.so.66)
Reading symbols from /lib/x86_64-linux-gnu/libnss_files.so.2...
Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_files-2.31.so...
0x0000000070000002 in ?? ()
(rr) bt
#0 0x0000000070000002 in ?? ()
#1 0x00007f0d2c25c3b6 in _raw_syscall () at raw_syscall.S:120
#2 0x00007f0d2c2582ff in traced_raw_syscall (call=call@entry=0x681fffa0) at syscallbuf.c:229
#3 0x00007f0d2c259978 in sys_fcntl (call=<optimized out>) at syscallbuf.c:1291
#4 syscall_hook_internal (call=0x681fffa0) at syscallbuf.c:2855
#5 syscall_hook (call=0x681fffa0) at syscallbuf.c:2987
#6 0x00007f0d2c2581da in _syscall_hook_trampoline () at syscall_hook.S:282
#7 0x00007f0d2c25820a in __morestack () at syscall_hook.S:417
#8 0x00007f0d2c258225 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at syscall_hook.S:428
#9 0x00007f0d2b5a9f15 in arch_fork (ctid=0x7f0d297bee50) at arch-fork.h:49
#10 __libc_fork () at fork.c:76
#11 0x00005620ae898e53 in fork_process () at fork_process.c:62
#12 0x00005620ae8aab39 in BackendStartup (port=0x5620b0c1f600) at postmaster.c:4187
#13 0x00005620ae8a6d29 in ServerLoop () at postmaster.c:1727
#14 0x00005620ae8a64c2 in PostmasterMain (argc=4, argv=0x5620b0bf19e0) at postmaster.c:1400
#15 0x00005620ae7a8247 in main (argc=4, argv=0x5620b0bf19e0) at main.c:210
</pre>

=== Debugging race conditions ===

rr can be used to [https://postgr.es/m/CAH2-WznTb6-0fjW4WPzNQh4mFvBH86J7bqZpNqteVUzo8p=6Hg@mail.gmail.com isolate hard to reproduce race condition bugs]. The single threaded nature of rr recording/execution seems to make it harder to reproduce bugs involving concurrent execution. However, using rr's [https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html chaos mode] option (by using the <code>-h</code> argument with rr record) seems to increase the odds of successfully reproducing a problem. It might still take a few attempts, but you only have to get lucky once.

=== Packing a recording ===

rr pack can be used to save a recording in a fairly stable format -- it copies the needed files into the trace:

<source lang="bash">
$ rr pack
</source>

This could be useful if you wanted to save a recording for more than a day or two. Because every single detail of the recording (e.g. pointers, PIDs) is stable, you can treat a recording as a totally self contained thing.

=== rr resources ===

[https://github.com/mozilla/rr/wiki/Usage Usage - rr wiki]

[https://github.com/mozilla/rr/wiki/Debugging-protips Debugging protips - rr wiki]

[[Category:Operating system]]

Meson

2023-02-10T00:01:43Z

Pgeoghegan: /* Test structure */

== PostgreSQL devel documentation ==

See [https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section] of PostgreSQL devel docs.

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

==== Build directory ====

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| make check-world
|| meson test -v
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test -v --setup running
||
|-
|| run main regression tests
|| make check
|| meson test -v --suite setup --suite regress
|| "--suite setup" required to get a tmp_install directory; see below
|-
|| run specific contrib test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite setup --suite amcheck
|| "--suite setup" required to get a tmp_install directory; see below
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running --suite regress-running
||
|-
|| run specific contrib test suite against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running --suite amcheck-running
|| "running" amcheck suite variant doesn't include TAP tests
|}

==== Test structure ====

When running a specific test suite against a temporary throw away installation, <code>--suite setup</code> should generally be specified. Otherwise the tests could end up running against a stale <code>tmp_install</code> directory, causing general confusion. This [https://postgr.es/m/20230209205605.zo5gfhli22g2kdm2@awork3.anarazel.de workaround] is not required when running tests against an existing server (via the <code>running</code> test setup and variant test suites), since of course the installation directory being tested is whatever directory the external server installation uses.

Note that the top-level/default project name is <code>postgresql</code>, which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be omitted] when using a reasonably recent meson version (meson 0.46 or later), which we assume here.

You can list all of the tests from a given suite as follows:

<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck
ninja: no work to do.
postgresql:amcheck / amcheck/regress
postgresql:amcheck / amcheck/001_verify_heapam
postgresql:amcheck / amcheck/002_cic
postgresql:amcheck / amcheck/003_cic_2pc
</pre>

Note that there are distinct <code>running</code>/installcheck suites for most of the standard setup suites, though not all of the tests actually carry over to the <code>running</code> variant suites, as shown here:
<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck-running
ninja: no work to do.
postgresql:amcheck-running / amcheck-running/regress
</pre>

==== Running individual regression test scripts via an installcheck-tests style workflow ====

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-09T23:53:24Z

Pgeoghegan: /* Test structure */

== PostgreSQL devel documentation ==

See [https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section] of PostgreSQL devel docs.

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

==== Build directory ====

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| make check-world
|| meson test -v
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test -v --setup running
||
|-
|| run main regression tests
|| make check
|| meson test -v --suite setup --suite regress
|| "--suite setup" required to get a tmp_install directory; see below
|-
|| run specific contrib test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite setup --suite amcheck
|| "--suite setup" required to get a tmp_install directory; see below
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running --suite regress-running
||
|-
|| run specific contrib test suite against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running --suite amcheck-running
|| "running" amcheck suite variant doesn't include TAP tests
|}

==== Test structure ====

When running a specific test suite against a temporary throw away installation, <code>--suite setup</code> should generally be specified. Otherwise the tests could end up running against a stale <code>tmp_install</code> directory, causing general confusion. This [https://postgr.es/m/20230209205605.zo5gfhli22g2kdm2@awork3.anarazel.de workaround] is not required when running tests against an existing server (via the <code>running</code> test setup and variant test suites), since of course the installation directory being tested is whatever directory the external server installation uses.

Note that the top-level/default project name is <code>postgresql</code>, which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be ommitted] when using a reasonably recent meson version (meson 0.46 or later), which we assume here.

You can list all of the tests from a given suite as follows:

<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck
ninja: no work to do.
postgresql:amcheck / amcheck/regress
postgresql:amcheck / amcheck/001_verify_heapam
postgresql:amcheck / amcheck/002_cic
postgresql:amcheck / amcheck/003_cic_2pc
</pre>

Note that there are distinct <code>running</code>/installcheck suites for most of the standard setup suites, though not all of the tests actually carry over to the <code>running</code> variant suites, as shown here:
<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck-running
ninja: no work to do.
postgresql:amcheck-running / amcheck-running/regress
</pre>

==== Running individual regression test scripts via an installcheck-tests style workflow ====

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-09T23:51:07Z

Pgeoghegan: Explain "--suite setup" issue

== PostgreSQL devel documentation ==

See [https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section] of PostgreSQL devel docs.

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

==== Build directory ====

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| make check-world
|| meson test -v
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test -v --setup running
||
|-
|| run main regression tests
|| make check
|| meson test -v --suite setup --suite regress
|| "--suite setup" required to get a tmp_install directory; see below
|-
|| run specific contrib test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite setup --suite amcheck
|| "--suite setup" required to get a tmp_install directory; see below
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running --suite regress-running
||
|-
|| run specific contrib test suite against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running --suite amcheck-running
|| "running" amcheck suite variant doesn't include TAP tests
|}

==== Test structure ====

When running a specific test suite against a temporary throw away installation, <code>--suite setup</code> should generally be specified. Otherwise the tests could end up running against a stale <code>tmp_install</code> directory, causing general confusion. This workaround is not required when running tests against an existing server (via the <code>running</code> test setup and variant test suites), since of course the installation directory being tested is whatever directory the external server installation uses.

Note that the top-level/default project name is <code>postgresql</code>, which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be ommitted] when using a reasonably recent meson version (meson 0.46 or later), which we assume here.

You can list all of the tests from a given suite as follows:

<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck
ninja: no work to do.
postgresql:amcheck / amcheck/regress
postgresql:amcheck / amcheck/001_verify_heapam
postgresql:amcheck / amcheck/002_cic
postgresql:amcheck / amcheck/003_cic_2pc
</pre>

Note that there are distinct <code>running</code>/installcheck suites for most of the standard setup suites, though not all of the tests actually carry over to the <code>running</code> variant suites, as shown here:
<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck-running
ninja: no work to do.
postgresql:amcheck-running / amcheck-running/regress
</pre>

==== Running individual regression test scripts via an installcheck-tests style workflow ====

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-09T23:24:44Z

Pgeoghegan: Add missing "--suite setup" to test mappings, don't bother spelling out "postgresql:" top-level project name

== PostgreSQL devel documentation ==

See [https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section] of PostgreSQL devel docs.

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

==== Build directory ====

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| make check-world
|| meson test -v
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test -v --setup running
||
|-
|| run main regression tests
|| make check
|| meson test -v --suite setup --suite regress
||
|-
|| run specific contrib test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite setup --suite amcheck
||
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running --suite regress-running
||
|-
|| run specific contrib test suite against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running --suite amcheck-running
|| "running" amcheck suite variant doesn't include TAP tests
|}

==== Test structure ====

Note that the top-level/default project name is <code>postgresql</code>, which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be ommitted] when using a reasonably recent meson version (meson 0.46 or later), which we assume here.

You can list all of the tests from a given suite as follows:

<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck
ninja: no work to do.
postgresql:amcheck / amcheck/regress
postgresql:amcheck / amcheck/001_verify_heapam
postgresql:amcheck / amcheck/002_cic
postgresql:amcheck / amcheck/003_cic_2pc
</pre>

Note that there are distinct <code>running</code>/installcheck suites for most of the standard setup suites, though not all of the tests actually carry over to the <code>running</code> variant suites, as shown here:
<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck-running
ninja: no work to do.
postgresql:amcheck-running / amcheck-running/regress
</pre>

==== Running individual regression test scripts via an installcheck-tests style workflow ====

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-09T20:19:27Z

Pgeoghegan: /* Test structure */

== PostgreSQL devel documentation ==

See [https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section] of PostgreSQL devel docs.

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

==== Build directory ====

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| make check-world
|| meson test -v
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test -v --setup running
||
|-
|| run main regression tests
|| make check
|| meson test -v --suite setup --suite regress
||
|-
|| run specific contrib test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running --suite postgresql:regress-running
||
|-
|| run specific contrib test suite against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running --suite postgresql:amcheck-running
|| "running" amcheck suite variant doesn't include TAP tests
|}

==== Test structure ====

Note that the top-level/default project name is <code>postgresql</code>, which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be ommitted] when using a reasonably recent meson version (meson 0.46 or later). For example, a simple <code>meson test -v --suite amcheck</code> will work (the <code>postgresql:</code> prefix is not truly required).

You can list all of the tests from a given suite as follows:

<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck
ninja: no work to do.
postgresql:amcheck / amcheck/regress
postgresql:amcheck / amcheck/001_verify_heapam
postgresql:amcheck / amcheck/002_cic
postgresql:amcheck / amcheck/003_cic_2pc
</pre>

Note that there are distinct <code>running</code>/installcheck suites for most of the standard setup suites, though not all of the tests actually carry over to the <code>running</code> variant suites, as shown here:
<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck-running
ninja: no work to do.
postgresql:amcheck-running / amcheck-running/regress
</pre>

==== Running individual regression test scripts via an installcheck-tests style workflow ====

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-09T20:11:58Z

Pgeoghegan: Reorder command translation sections a little, so we talk about basic build/setup options first, then talk about testing related options

== PostgreSQL devel documentation ==

See [https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section] of PostgreSQL devel docs.

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

==== Build directory ====

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| make check-world
|| meson test -v
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test -v --setup running
||
|-
|| run main regression tests
|| make check
|| meson test -v --suite setup --suite regress
||
|-
|| run specific contrib test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running --suite postgresql:regress-running
||
|-
|| run specific contrib test suite against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running --suite postgresql:amcheck-running
|| "running" amcheck suite variant doesn't include TAP tests
|}

==== Test structure ====

Note that the top-level/default project name is <code>postgresql</code>, which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be ommitted] when using a reasonably recent meson version (meson 0.46 or later). For example, a simple <code>meson test -v --suite amcheck</code> will work (the <code>postgresql:</code> prefix is not truly required).

You can list all of the tests from a given suite as follows:

<pre>
/path/to/postgresql/build_meson $ meson test --list --suite amcheck
ninja: no work to do.
postgresql:amcheck / amcheck/regress
postgresql:amcheck / amcheck/001_verify_heapam
postgresql:amcheck / amcheck/002_cic
postgresql:amcheck / amcheck/003_cic_2pc
</pre>

==== Running individual regression test scripts via an installcheck-tests style workflow ====

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-09T20:01:34Z

Pgeoghegan: /* Test related commands */

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| make check-world
|| meson test -v
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test -v --setup running
||
|-
|| run main regression tests
|| make check
|| meson test -v --suite setup --suite regress
||
|-
|| run specific contrib test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running --suite postgresql:regress-running
||
|-
|| run specific contrib test suite against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running --suite postgresql:amcheck-running
|| "running" amcheck suite variant doesn't include TAP tests
|}

==== Test structure ====

Note that the top-level/default project name is <code>postgresql</code>, which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be ommitted] when using a reasonably recent meson version (meson 0.46 or later). For example, a simple <code>meson test -v --suite amcheck</code> will work (the <code>postgresql:</code> prefix is not truly required).

=== Running individual regression test scripts via an installcheck-tests style workflow ===

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-09T19:51:23Z

Pgeoghegan: /* Test related commands */

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| make check-world
|| meson test -v
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test -v --setup running
||
|-
|| run main regression tests
|| make check
|| meson test -v --suite setup --suite regress
||
|-
|| run specific contrib test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running --suite postgresql:regress-running
||
|-
|| run specific contrib test suite against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running --suite postgresql:amcheck-running
|| "running" amcheck suite variant doesn't include TAP tests
|}

Note that the top-level/default project name is "postgresql", which is the only one we use in practice. The project name [https://mesonbuild.com/Unit-tests.html#run-subsets-of-tests can be ommitted] when using a reasonably recent meson version (meson 0.46 or later). For example, a simple "meson test -v --suite amcheck" will work (the "postgresql:" prefix is not truly required).

=== Running individual regression test scripts via an installcheck-tests style workflow ===

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-09T18:50:29Z

Pgeoghegan: Prefer using --suite in meson test examples, use -v consistently to match autoconf's more verbose default behavior

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| make check-world
|| meson test -v
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test -v --setup running
||
|-
|| run main regression tests
|| make check
|| meson test -v --suite setup --suite regress
||
|-
|| run specific contrib test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running --suite postgresql:regress-running
||
|-
|| run specific contrib test suite against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running --suite postgresql:amcheck-running
|| "running" amcheck suite variant doesn't include TAP tests
|}

=== Running individual regression test scripts via an installcheck-tests style workflow ===

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T20:25:50Z

Pgeoghegan: /* Test related commands */

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| "running" [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup test setup] is used to run tests against an existing server
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
||
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
||
|}

=== Running individual regression test scripts via an installcheck-tests style workflow ===

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T03:33:26Z

Pgeoghegan: Add entry for meson install --only-changed

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| install files that changed only
||
|| meson install --only-changed
|| Routinely shaves a few hundred milliseconds off install time
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup Setup] used when running specific tests against existing server
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
||
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
||
|}

=== Running individual regression test scripts via an installcheck-tests style workflow ===

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T02:34:15Z

Pgeoghegan: /* Test related commands */

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup Setup] used when running specific tests against existing server
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
||
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test -v --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
||
|}

=== Running individual regression test scripts via an installcheck-tests style workflow ===

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T02:32:15Z

Pgeoghegan: /* Running individual regression test scripts via an installcheck-tests style workflow */

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup Setup] used when running specific tests against existing server
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
||
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
||
|}

=== Running individual regression test scripts via an installcheck-tests style workflow ===

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example (a directory layout often used with VPATH builds).

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T02:26:42Z

Pgeoghegan: Add info about working around the lack of a installcheck-tests target with meson

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup Setup] used when running specific tests against existing server
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
||
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
||
|}

=== Running individual regression test scripts via an installcheck-tests style workflow ===

The Postgres autoconf build system supports running a subset of regression test scripts against an existing server using the installcheck-tests target, as shown here:

<pre>
/path/to/postgresql/build_autoconf $ make installcheck-tests TESTS="test_setup create_index"
*** SNIP ***
============== dropping database "regression" ==============
SET
DROP DATABASE
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test_setup ... ok 300 ms
test create_index ... ok 1775 ms

=====================
All 2 tests passed.
=====================
</pre>

You can work around the current lack of an equivalent meson facility by invoking pg_regress directly:

<pre>
/path/to/postgresql/build_meson $ src/test/regress/pg_regress --inputdir ../source/src/test/regress/ --dlpath=src/test/regress/ test_setup create_index
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

The same approach will also work for isolation tests:

<pre>
/path/to/postgresql/build_meson $ src/test/isolation/pg_isolation_regress --inputdir ../source/src/test/isolation freeze-the-dead vacuum-no-cleanup-lock
*** SNIP ***
=====================
All 2 tests passed.
=====================
</pre>

Note that this assumes that the meson build directory is 'build_meson', and that the Postgres source code directory is 'source'. The 'source' directory is located in the same directory as 'build_meson' in this example, (a directory layout often used with VPATH builds).

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T01:41:25Z

Pgeoghegan: /* Test related commands */

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| [https://mesonbuild.com/Reference-manual_functions.html#add_test_setup Setup] used when running specific tests against existing server
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
||
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
||
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T01:13:48Z

Pgeoghegan: /* Test related commands */

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| Used when running specific tests against existing server
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
||
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
||
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T01:02:23Z

Pgeoghegan: Use specific suite (not test) in "make check" command mapping example

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| Used when running specific tests against existing server
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
||
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific regression test suite
|| cd contrib/amcheck; make check;
|| meson test -v --suite postgresql:amcheck
||
|-
|| run specific regression test
||
|| meson test -v postgresql:amcheck / amcheck/regress
|| Doesn't run TAP tests, unlike suite-level recipe
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
||
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T00:41:23Z

Pgeoghegan: Reorder testing command translation items for clarity

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
||
|-
|| list running/installcheck test variants
||
|| meson test --setup running --list
|| Used when running specific tests against existing server
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
||
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific regression test
|| cd contrib/amcheck; make check;
|| meson test -v postgresql:amcheck / amcheck/regress
|| meson won't run TAP tests here (there are distinct non-regress tests for those)
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
|| meson won't run TAP tests here
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T00:20:57Z

Pgeoghegan: /* Test related commands translation */

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
|| Test names cannot be directly used for '--setup running' (AKA 'make installcheck') testing
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
|| Limited to tests that support running against existing server
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific regression test
|| cd contrib/amcheck; make check;
|| meson test -v postgresql:amcheck / amcheck/regress
|| meson won't run TAP tests here (there are distinct non-regress tests for those)
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
|| Use 'meson test -v --setup running' to get spelling of any other "running" test
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T00:19:27Z

Pgeoghegan: Split autoconf:meson translation table into two tables to improve readability

== Autoconf:meson command translations ==

=== Setup and build commands ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|}

=== Test related commands translation ===

{|class="wikitable" style="margin:auto"
!description
!old command
!new command
!comment
|-
|| list tests
||
|| meson test --list
|| Test names cannot be directly used for '--setup running' (AKA 'make installcheck') testing
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
|| Limited to tests that support running against existing server
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific regression test
|| cd contrib/amcheck; make check;
|| meson test -v postgresql:amcheck / amcheck/regress
|| meson won't run TAP tests here (there are distinct non-regress tests for those)
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
|| Use 'meson test -v --setup running' to get spelling of any other "running" test
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-07T00:04:34Z

Pgeoghegan: Reorder a couple of items from table for clarity

== Quickstart ==

{|class="wikitable" style="margin:auto"
|+ translation of configure, make, etc commands
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|-
|| list tests
||
|| meson test --list
||
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
|| Limited to tests that support running against existing server
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific regression test
|| cd contrib/amcheck; make check;
|| meson test -v postgresql:amcheck / amcheck/regress
|| meson won't run TAP tests here (there are distinct non-regress tests for those)
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
|| Use 'meson test -v --setup running' to get spelling of any other "running" test
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-06T23:56:36Z

Pgeoghegan: Add "run specific regression test" example

== Quickstart ==

{|class="wikitable" style="margin:auto"
|+ translation of configure, make, etc commands
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
|| Limited to tests that support running against existing server
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run specific regression test
|| cd contrib/amcheck; make check;
|| meson test -v postgresql:amcheck / amcheck/regress
|| meson won't run TAP tests here (there are distinct non-regress tests for those)
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib test against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
|| Use 'meson test -v --setup running' to get spelling of any other "running" test
|-
|| list tests
||
|| meson test --list
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2023-02-06T23:41:36Z

Pgeoghegan: Add "run specific contrib tests against existing server" example

== Quickstart ==

{|class="wikitable" style="margin:auto"
|+ translation of configure, make, etc commands
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
|| Limited to tests that support running against existing server
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| run specific contrib tests against existing server
|| cd contrib/amcheck; make installcheck;
|| meson test -v --setup running postgresql:amcheck-running / amcheck-running/regress
|| Use 'meson test -v --setup running' to get spelling of any other "running" test
|-
|| list tests
||
|| meson test --list
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== FreeBSD ===

<pre>
pkg install meson ninja
</pre>

Arguments to meson setup/configure to find ports libraries:
<pre>
meson setup -Dextra_lib_dirs=/opt/local/lib -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Sample Databases

2023-02-04T22:46:31Z

Pgeoghegan: Add new link to conveniently available IMDB database custom format pg_dump

Many database systems provide sample databases with the product. A good intro to popular ones that includes discussion of samples available for other databases is [http://www.barik.net/archive/2006/03/28/195425/ Sample Databases for PostgreSQL and More] (2006).

One trivial sample that PostgreSQL ships with is the [[Pgbench]]. This has the advantage of being built-in and supporting a scalable data generator.

* MySQL has a popular sample database named [https://dev.mysql.com/doc/sakila/en/ Sakila]. Sakila [https://github.com/jOOQ/jOOQ/tree/master/jOOQ-examples/Sakila has been ported to many databases] including Postgres.

* [https://github.com/devrimgunduz/pagila Pagila] is a more idiomatic Postgres port of Sakila.

* PgFoundry had [https://www.postgresql.org/ftp/projects/pgFoundry/dbsamples/ a collection of Postgres-compatible sample databases] but it has not been updated since 2008.
* [https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/2QYZBT IMDB Data for JOB Workload], as used in the paper [https://doi.org/10.14778/2850583.2850594 "How Good are Query Optimizers, Really?"]. This data was generated using a [https://github.com/RyanMarcus/imdb_pg_dataset tool that is freely available on Github]. It is conveniently available to download as a <code>pg_dump -Fc</code> dump.
* [http://www.imdb.com/interfaces#plain IMDB] - the original IMDB source data.
* [https://theodi.org/blog/the-status-of-csvs-on-datagovuk The land registry file] from http://data.gov.uk has details of land sales in the UK, going back several decades, and is 3.5GB as of August 2016 (this applies only to the "complete" file, "pp-complete.csv"). No registration required.
<pre>
-- Download file "pp-complete.csv", which has all records.
-- If schema changes/field added, consult: https://www.gov.uk/guidance/about-the-price-paid-data

-- Create table:
CREATE TABLE land_registry_price_paid_uk(
transaction uuid,
price numeric,
transfer_date date,
postcode text,
property_type char(1),
newly_built boolean,
duration char(1),
paon text,
saon text,
street text,
locality text,
city text,
district text,
county text,
ppd_category_type char(1),
record_status char(1));

-- Copy CSV data, with appropriate munging:
COPY land_registry_price_paid_uk FROM '/path/to/pp-complete.csv' with (format csv, encoding 'win1252', header false, null '', quote '"', force_null (postcode, saon, paon, street, locality, city, district));
</pre>
* [https://github.com/lorint/AdventureWorks-for-Postgres AdventureWorks 2014 for Postgres] - Scripts to set up the OLTP part of the go-to database used in training classes and for sample apps on the Microsoft stack. The result is 68 tables containing HR, sales, product, and purchasing data organized across 5 schemas. It represents a fictitious bicycle parts wholesaler with a hierarchy of nearly 300 employees, 500 products, 20000 customers, and 31000 sales each having an average of 4 line items. So it's big enough to be interesting, but not unwieldy. In addition to being a well-rounded OLTP sample, it is also a good choice to demonstrate ETL into a data warehouse. The code in some of the views demonstrates effective techniques for querying XML data.
* [http://www.informatics.jax.org/downloads/database_backups/ Mouse Genome sample data set]. See [http://www.informatics.jax.org/software.shtml instructions]. Custom format dump, 1.9GB compressed, but restored database is tens of GB in size. MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. MGI use PostgreSQL in production [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245042/], providing direct protocol access to researchers, so the custom format dump is not an afterthought. Apparently updated frequently.
* Benchmarking databases such as [[DBT-2]] or [[TPC-H]] can be used as samples.
* [http://www.freebase.com/docs/data_dumps Freebase] - Various wiki style data on places/people/things - ~600MB compressed
* [https://www.omdb.org/content/About OMDB] - Open Media database, ~30MB compressed, 300MB when loaded - https://github.com/credativ/omdb-postgresql
* [http://www.data.gov/ Data.gov] - US federal government data collection, see also [http://www.sunlightlabs.com/ Sunlight Labs]
* [http://wiki.dbpedia.org/Downloads DBpedia] - Wikipedia data export project
* [http://www.eoddata.com/ eoddata] - historic stock market data (requires registration - licence?)
* [http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time RITA] - Airline On-Time Performance Data
* [http://wiki.openstreetmap.org/wiki/Planet.osm Openstreetmap] - Openstreetmap source data
* [https://ftp.ncbi.nih.gov/gene/DATA/ NCBI] - biological annotation from NCBI's ENTREZ system (updated daily)
* [https://postgrespro.com/education/demodb Airlines Demo Database] - Airlines Demo Database provides database schema with several tables and meaningful content, which can be used for learning SQL and writing applications
* [https://archive.org/details/stackexchange Stack Exchange Data Dump] - Anonymized dump of all user-contributed content on the Stack Exchange network (Stack Overflow, Server Fault...) under cc-by-sa 3.0 license. Use this tool to import XML dumps in PostgresQL : https://github.com/Networks-Learning/stackexchange-dump-to-postgres
* [https://github.com/MuseumofModernArt/collection The Museum of Modern Art (MoMA) collection data] - This research dataset contains more than 130,000 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in our database. It includes basic metadata for each work, including title, artist, date made, medium, dimensions, and date acquired by the Museum. At this time, both datasets are available in CSV and JSON format, encoded in UTF-8.

[[Category:Howto]]

Meson

2022-12-27T23:54:03Z

Pgeoghegan: Add "meson test --setup running" entry to quickstart table

== Quickstart ==

{|class="wikitable" style="margin:auto"
|+ translation of configure, make, etc commands
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| run all tests against existing server
|| make installcheck-world
|| meson test --setup running
|| Limited to tests that support running against existing server
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running regress-running/regress
||
|-
|| list tests
||
|| meson test --list
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Meson

2022-12-27T23:46:18Z

Pgeoghegan: Add "meson test --setup running" entry to quickstart table

== Quickstart ==

{|class="wikitable" style="margin:auto"
|+ translation of configure, make, etc commands
!description
!old command
!new command
!comment
|-
|| set up build tree
|| ./configure [<options>]
|| meson setup [<options>] [<build dir>] <source-dir>
|| meson only supports building out of tree
|-
|| set up build tree for visual studio
|| perl src/tools/msvc/mkvcbuild.pl
|| meson setup --backend vs [<options>] [<build dir>] <source-dir>
|| configures build tree for one build type (debug or release or ...)
|-
|| show configure options
|| ./configure --help
|| meson configure
|| shows options built into meson and PostgreSQL specific options
|-
|| set configure options
|| ./configure --prefix=DIR, --$somedir=DIR, --with-$option, --enable-$feature
|| meson setup|configure -D$option=$value
|| options can be set when setting up build tree (setup) and in existing build tree (configure)
|-
|| enable cassert
|| --enable-cassert
|| -Dcassert=true
|-
|| enable debug symbols
|| ./configure --enable-debug
|| meson configure|setup -Ddebug=true
|-
|| specify compiler
|| CC=compiler ./configure
|| CC=compiler meson setup
|| CC is only checked during meson setup, not with meson configure
|-
|| set CFLAGS
|| CFLAGS=options ./configure
|| meson configure|setup -Dc_args=options
|| CFLAGS is also checked, but only for meson setup
|-
|| build
|| make -s
|| ninja
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| build, showing compiler commands
|| make
|| ninja -v
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| install all the binaries and libraries
|| make install
|| ninja install
|| use meson install --quiet for a less verbose experience
|-
|| clean build
|| make clean
|| ninja clean
|| ninja uses parallelism by default, launch from the root of the build tree.
|-
|| run all tests
|| make check-world
|| meson test
|| runs all test, using parallelism by default
|-
|| build documentation
|| cd doc/ && make html && make man
|| ninja docs
|| Builds html documentation and man pages
|-
|| run main regression tests
|| make check
|| meson test --suite setup --suite regress
||
|-
|| run main regression tests against existing server
|| make installcheck
|| meson test --setup running
||
|-
|| list tests
||
|| meson test --list
|}

=== PostgreSQL devel documentation ===

[https://www.postgresql.org/docs/devel/install-meson.html "Building and Installation with Meson" section]

=== Other Notes ===

ninja tries to run from the root of the build directory. If you are not in the build directory, you can use the "-C" flag to have ninja "change directory" and run from there, e.g.:

<pre>
ninja -C $builddir
</pre>

== Installing Meson ==

=== Linux ===

Debian / Ubuntu:
<pre>
apt-get update && apt-get install -y meson ninja-build
</pre>

Fedora:
<pre>
dnf -y install meson ninja-build
</pre>

RHEL 8:
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf -y install meson ninja-build
</pre>

RHEL 9 (tested on Rocky Linux 9):
<pre>
dnf -y install dnf-plugins-core
dnf config-manager --set-enabled crb
dnf -y install meson
</pre>

=== macOS ===

With MacPorts:

<pre>
sudo port install meson
</pre>

Arguments to meson setup/configure to find MacPorts libraries:
<pre>
meson setup -Dpkg_config_path=/opt/local/lib/pkgconfig -Dextra_lib_dirs=/opt/local/lib/ -Dextra_include_dirs=/opt/local/include $builddir $sourcedir
</pre>

With Homebrew:
<pre>
brew install meson
</pre>

Arguments to meson setup/configure to find Homebrew libraries:

On arm64:
<pre>
meson setup -Dpkg_config_path=/opt/homebrew/lib/pkgconfig -Dextra_include_dirs=/opt/homebrew/include -Dextra_lib_dirs=/opt/homebrew/lib $builddir $sourcedir
</pre>

On x86-64:
<pre>
meson setup -Dpkg_config_path=/usr/local/lib/pkgconfig -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib $builddir $sourcedir
</pre>

=== Windows ===

Assuming python is installed, the easiest way to get meson and ninja is:
<pre>
pip install meson ninja
</pre>

== Why and What ==

Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC buildsystem is hard to maintain for
developers not using windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern buildsystem.

After evaluating different buildsystem choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new buildsystem and mature it in tree.

The plan is to remove the msvc specific buildsystem in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make buildsystem in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

== Meson documentation ==

* [https://mesonbuild.com/Commands.html meson commandline commands]
* [https://mesonbuild.com/Syntax.html meson syntax]
* [https://mesonbuild.com/Reference-manual_functions.html meson functions]

== Development tree, other resources ==

* https://github.com/anarazel/postgres/tree/meson
* https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Meson_new_build_system_proposal

== Visualizing builds ==

When building with ninja, the generated .ninja_log can be uploaded to [https://ui.perfetto.dev/ ui.perfetto.dev], which is very helpful to visualize builds.

Freezing/skipping strategies patch: motivating examples

2022-12-18T23:59:58Z

Pgeoghegan: /* Patch */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related work added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

Note that the patch completely removes aggressive mode VACUUM. Antiwraparound autovacuums will still exist, but become much rarer. Antiwraparound autovacuums should only be needed in true emergencies with this work in place.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (<strike>Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM</strike>).

Update: v10 of the patch series [https://www.postgresql.org/message-id/CAH2-WzmjHQJ7pbdO4BtWVJ6CLG-Mp9CNe914WUJdiScOTNRKRw@mail.gmail.com avoids the freeze spike] that you see here (here we show v9 behavior). So in v10 the same-catch up process will only happen during some much later insert-driven autovacuum, when the pgbench_history table has become far larger (while eager freezing would kick in at the same point as in v9). The relative cost of catch-up freezing won't be too great under this improved scheme; we expect to have no more than about an extra 5% of rel_pages to freeze when it happens now (far less than the 52% of rel_pages shown here on a percentage basis, though not in absolute terms).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While relfrozenxid advanced significantly less than it did in the earlier pgbench_history example, it nevertheless advanced by a huge amount by any traditional measure (in particular, by much more than the vacuum_freeze_min_age-based cutoff requires).

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite).
Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 60% - 65% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has the "use oldest extant XID for relfrozenxid" mechanism added by commit {{PgCommitURL|0b018fab}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has only a loose relationship with freeze debt, which is hard to predict but tends to be fairly fixed for a given table/workload. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-18T23:54:22Z

Pgeoghegan: /* Patch */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related work added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

Note that the patch completely removes aggressive mode VACUUM. Antiwraparound autovacuums will still exist, but become much rarer. Antiwraparound autovacuums should only be needed in true emergencies with this work in place.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (<strike>Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM</strike>).

Update: v10 of the patch series [https://www.postgresql.org/message-id/CAH2-WzmjHQJ7pbdO4BtWVJ6CLG-Mp9CNe914WUJdiScOTNRKRw@mail.gmail.com avoids the freeze spike] that you see here (here we show v9 behavior). So in v10 the same-catch up process will only happen during some much later insert-driven autovacuum, when the pgbench_history table has become far larger (while eager freezing would kick in at the same point as in v9). The relative cost of catch-up freezing won't be too great under this improved scheme; we expect to have no more than about an extra 5% of rel_pages to freeze when it happens now (far less than the 50% of rel_pages shown here on a percentage basis, though not in absolute terms).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While relfrozenxid advanced significantly less than it did in the earlier pgbench_history example, it nevertheless advanced by a huge amount by any traditional measure (in particular, by much more than the vacuum_freeze_min_age-based cutoff requires).

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite).
Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 60% - 65% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has the "use oldest extant XID for relfrozenxid" mechanism added by commit {{PgCommitURL|0b018fab}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has only a loose relationship with freeze debt, which is hard to predict but tends to be fairly fixed for a given table/workload. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-18T23:53:29Z

Pgeoghegan: /* Patch */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related work added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

Note that the patch completely removes aggressive mode VACUUM. Antiwraparound autovacuums will still exist, but become much rarer. Antiwraparound autovacuums should only be needed in true emergencies with this work in place.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

Update: v10 of the patch series [https://www.postgresql.org/message-id/CAH2-WzmjHQJ7pbdO4BtWVJ6CLG-Mp9CNe914WUJdiScOTNRKRw@mail.gmail.com avoids the freeze spike] that you see here (here we show v9 behavior). So in v10 the same-catch up process will only happen during some much later insert-driven autovacuum, when the pgbench_history table has become far larger (while eager freezing would kick in at the same point as in v9). The relative cost of catch-up freezing won't be too great under this improved scheme; we expect to have no more than about an extra 5% of rel_pages to freeze when it happens now (far less than the 50% of rel_pages shown here on a percentage basis, though not in absolute terms).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While relfrozenxid advanced significantly less than it did in the earlier pgbench_history example, it nevertheless advanced by a huge amount by any traditional measure (in particular, by much more than the vacuum_freeze_min_age-based cutoff requires).

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite).
Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 60% - 65% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has the "use oldest extant XID for relfrozenxid" mechanism added by commit {{PgCommitURL|0b018fab}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has only a loose relationship with freeze debt, which is hard to predict but tends to be fairly fixed for a given table/workload. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-15T20:33:36Z

Pgeoghegan:

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related work added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

Note that the patch completely removes aggressive mode VACUUM. Antiwraparound autovacuums will still exist, but become much rarer. Antiwraparound autovacuums should only be needed in true emergencies with this work in place.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While relfrozenxid advanced significantly less than it did in the earlier pgbench_history example, it nevertheless advanced by a huge amount by any traditional measure (in particular, by much more than the vacuum_freeze_min_age-based cutoff requires).

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite).
Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 60% - 65% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has the "use oldest extant XID for relfrozenxid" mechanism added by commit {{PgCommitURL|0b018fab}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has only a loose relationship with freeze debt, which is hard to predict but tends to be fairly fixed for a given table/workload. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-15T17:53:32Z

Pgeoghegan:

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

Note that the patch completely removes aggressive mode VACUUM. Antiwraparound autovacuums will still exist, but become much rarer. Antiwraparound autovacuums should only be needed in true emergencies with this work in place.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While relfrozenxid advanced significantly less than it did in the earlier pgbench_history example, it nevertheless advanced by a huge amount by any traditional measure (in particular, by much more than the vacuum_freeze_min_age-based cutoff requires).

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite).
Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 60% - 65% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has the "use oldest extant XID for relfrozenxid" mechanism added by commit {{PgCommitURL|0b018fab}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has only a loose relationship with freeze debt, which is hard to predict but tends to be fairly fixed for a given table/workload. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-15T17:39:58Z

Pgeoghegan: /* Constantly updated tables (usually smaller tables) */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While relfrozenxid advanced significantly less than it did in the earlier pgbench_history example, it nevertheless advanced by a huge amount by any traditional measure (in particular, by much more than the vacuum_freeze_min_age-based cutoff requires).

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite).
Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 60% - 65% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has the "use oldest extant XID for relfrozenxid" mechanism added by commit {{PgCommitURL|0b018fab}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has only a loose relationship with freeze debt, which is hard to predict but tends to be fairly fixed for a given table/workload. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-15T17:30:22Z

Pgeoghegan: /* Patch */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While relfrozenxid advanced significantly less than it did in the earlier pgbench_history example, it nevertheless advanced by a huge amount by any traditional measure (in particular, by much more than the vacuum_freeze_min_age-based cutoff requires).

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite).
Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 60% - 65% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has the "use oldest extant XID for relfrozenxid" mechanism added by commit {{PgCommitURL|0b018fab}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has a very loose relationship to freeze debt itself. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-15T17:24:40Z

Pgeoghegan: /* Patch */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While the final relfrozenxid values we set here is significantly older than the removable
cutoff/OldestXmin if we compare this example to the pgbench_history example, the relfrozenxid is nevertheless very close
to removable cutoff/OldestXmin by any traditional measure.

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite).
Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 60% - 65% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has the "use oldest extant XID for relfrozenxid" mechanism added by commit {{PgCommitURL|0b018fab}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has a very loose relationship to freeze debt itself. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-14T22:33:57Z

Pgeoghegan: /* Patch */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While the final relfrozenxid values we set here is significantly older than the removable
cutoff/OldestXmin if we compare this example to the pgbench_history example, the relfrozenxid is nevertheless very close
to removable cutoff/OldestXmin by any traditional measure.

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite).
Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 60% - 65% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has related work added by commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has a very loose relationship to freeze debt itself. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-14T22:13:12Z

Pgeoghegan: /* Mixed inserts and deletes */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While the final relfrozenxid values we set here is significantly older than the removable
cutoff/OldestXmin if we compare this example to the pgbench_history example, the relfrozenxid is nevertheless very close
to removable cutoff/OldestXmin by any traditional measure.

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is fixed (for a given scale factor/number of warehouses), a little like a FIFO queue (though not quite).
Autovacuum tends to need to remove quite a lot of concentrated bloat from the new orders table, to deal with its constant deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 55% - 60% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has related work added by commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has a very loose relationship to freeze debt itself. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-14T22:10:28Z

Pgeoghegan: /* Patch */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum (shown here) is representative of every other future autovacuum against the same pgbench_history table, at least on a "percentage of pages scanned/frozen from table" basis:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM. This totally obviates the need for a distinct aggressive mode of operation for VACUUM (and so the patch fully removes the concept of aggressive mode VACUUM, while retaining the concept of antiwraparound autovacuum).

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While the final relfrozenxid values we set here is significantly older than the removable
cutoff/OldestXmin if we compare this example to the pgbench_history example, the relfrozenxid is nevertheless very close
to removable cutoff/OldestXmin by any traditional measure.

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is bound (for a given scale factor/number of warehouses), and autovacuum tends to
need to remove quite a lot of concentrated bloat due to deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 55% - 60% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has related work added by commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has a very loose relationship to freeze debt itself. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-14T17:11:56Z

Pgeoghegan: /* Today, on Postgres HEAD */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum is representative of every other autovacuum that will every happen on the same table going forward (on a "percentage of pages scanned/frozen from table" basis):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM.

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples have already attained an age exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While the final relfrozenxid values we set here is significantly older than the removable
cutoff/OldestXmin if we compare this example to the pgbench_history example, the relfrozenxid is nevertheless very close
to removable cutoff/OldestXmin by any traditional measure.

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is bound (for a given scale factor/number of warehouses), and autovacuum tends to
need to remove quite a lot of concentrated bloat due to deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 55% - 60% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has related work added by commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has a very loose relationship to freeze debt itself. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)

Freezing/skipping strategies patch: motivating examples

2022-12-14T17:08:58Z

Pgeoghegan: /* Patch */

This page documents problem cases addressed by the patch to remove aggressive mode VACUUM by making VACUUM freeze on a
proactive timeline, driven by concerns about managing the number of unfrozen heap pages that have accumulated in larger
tables. This patch is proposed for Postgres 16, and builds on related added to Postgres 15 (see Postgres 15 commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

See also: [https://commitfest.postgresql.org/40/3843/ CF entry for patch series]

It makes sense to discuss the patch series by focusing on various motivating examples, each of which involves one particular table, with its own more or less fixed set of performance characteristics. VACUUM must decide certain details around freezing and advancing relfrozenxid based on these kinds of per-table characteristics. Certain approaches are really only interesting with certain kinds of tables. Naturally this can change over time, for whatever reason. VACUUM is supposed to keep up with and even anticipate the needs of the table, over time and across multiple successive VACUUM operations.

= Background: How the visibility map influences the interpretation/application of vacuum_freeze_min_age =

At a high level, VACUUM currently chooses when and how to freeze tuples based solely on whether or not a given XID is
older than vacuum_freeze_min_age for a tuple on a page that it actually scans. This means that all-visible pages left
behind by a previous VACUUM operation won't even be considered for freezing until the next aggressive mode VACUUM
(barring tuples on heap pages that happen to be modified some time before aggressive VACUUM finally kicks in).

The introduction of the visibility map in Postgres 8.4 made the mechanism that chooses how to freeze stop reliably
freezing XIDs that attain an age that exceeds the vacuum_freeze_min_age settings. Sometimes it actually does work in
the way that the very earliest design for vacuum_freeze_min_age intended, and sometimes it doesn't, mostly due to the
confounding influence of the visibility map.

The pre-visibility-map design was based on the idea that lazy processing could avoid needlessly freezing tuples that
would inevitably be modified before too long anyway. That in itself wasn't a bad idea, and still isn't now; laziness
can still make sense. It happens to be wildly inappropriate in certain kinds of tables, where VACUUM should prefer a
much more proactive freezing cadence. Knowing the difference (recognizing the kinds of tables that VACUUM should prefer
to freeze eagerly rather than lazily) is an issue of central importance for the project.

= Examples =

Most of these examples show cases where VACUUM behaves lazily, when it clearly should behave eagerly. Even an
expert DBA will currently have a hard time tuning the system to do the right thing with tables/workloads like these ones.

== Simple append-only ==

The best example is also the simplest: a strict append-only table, such as pgbench_history. Naturally, this table will
get a lot of insert-driven (triggered by autovacuum_vacuum_insert_threshold) autovacuums.

=== Today, on Postgres HEAD ===

Currently, we see autovacuum behavior like this (triggered by autovacuum_vacuum_insert_threshold):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 171638 remain, 21942 scanned (12.78% of total)
tuples: 0 removed, 26947118 remain, 0 are dead but not yet removable
removable cutoff: 101761411, which was 767958 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 43958 hits, 1 misses, 1 dirtied
WAL usage: 21461 records, 1 full page images, 1274525 bytes
system usage: CPU: user: 1.35 s, system: 0.34 s, elapsed: 2.39 s

Note that there are no pages frozen. There will never be any pages frozen by any VACUUM, unless and until the VACUUM is
aggressive, at which point we'll rewrite most of the table to freeze most of its tuples. Clearly this doesn't make any
sense; we really should be eagerly freezing the table from a fairly early stage, so that the burden of freezing is
evenly spread out over time and across multiple VACUUM operations.

Perhaps there is some limited argument to be made for laziness here, at least earlier on, but why should we solely rely
on table age (aggressive mode VACUUM) to take care of freezing? We need physical units to make a sensible choice in
favor of lazy freezing. After all, table age tells us precisely nothing about the eventual cost of freezing. If the
pgbench_history table happened to have tuples that were twice as wide, that would mean that the eventual cost of
freezing during an aggressive mode VACUUM would also approximately double. But (assuming that all other things remain
unchanged) the timeline for when we'd freeze would stay exactly the same.

By the same token, the timeline for freezing all of the pages from the pgbench_history table will effectively be
accelerated by transactions that don't even modify the table itself. This isn't fundamentally unreasonable in extreme
cases, where the risk of the system entering xidStopLimit mode really does need to influence the timeline for freezing a
table like this. But we don't just allow table age to determine when we freeze all these pgbench_history pages in
extreme cases -- it is the sole factor that determines when it happens, in practice, including when there is almost no
practical risk of the system entering xidStopLimit (i.e. almost always).

It's possible to tune vacuum_freeze_min_age in order to make sure that such a table is frozen eagerly, as mentioned in passing by the
[https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM Routine Vacuuming/the Autovacuum Daemon] section of the docs (starting at <code>it may be beneficial to lower the table's autovacuum_freeze_min_age...</code>), but this relies on the DBA actually having a direct understanding of
the problem. It also requires that the DBA use a setting based on XID age (vacuum_freeze_min_age, or the autovacuum_freeze_min_age reloption). While that is at
least feasible for a pure append-only table like this one, it still isn't a particularly natural way for the DBA to
control the problem. (More importantly, tuning vacuum_freeze_min_age with this goal in mind runs into problems with
tables that grow and grow, but have a mix of inserts and updates -- see later examples for more.)

=== Patch ===

The patch will have autovacuum/VACUUM consistently freeze all of the pages from the table on an eager timeline (autovacuum is once again triggered by autovacuum_vacuum_insert_threshold here, of course). Initially the patch behaves in a way that isn't visibly different to Postgres HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 477149 remain, 48188 scanned (10.10% of total)
tuples: 0 removed, 74912386 remain, 0 are dead but not yet removable
removable cutoff: 149764963, which was 1759214 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.004 ms, write: 0.000 ms
avg read rate: 0.001 MB/s, avg write rate: 0.001 MB/s
buffer usage: 96544 hits, 1 misses, 1 dirtied
WAL usage: 47945 records, 1 full page images, 2837081 bytes
system usage: CPU: user: 3.00 s, system: 0.91 s, elapsed: 5.47 s

The next autovacuum that takes place happens to be the first autovacuum after pgbench_history has
crossed 4GB in size. This threshold is controlled by the GUC vacuum_freeze_strategy_threshold, and its proposed default
is 4GB.

Here is where the patch starts to diverge from HEAD:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 526836 remain, 49931 scanned (9.48% of total)
tuples: 0 removed, 82713253 remain, 0 are dead but not yet removable
removable cutoff: 157563960, which was 1846970 XIDs old when operation ended
frozen: 49665 pages from table (9.43% of total) had 7797405 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.000 ms
avg read rate: 0.003 MB/s, avg write rate: 0.003 MB/s
buffer usage: 100044 hits, 2 misses, 2 dirtied
WAL usage: 99331 records, 2 full page images, 21720187 bytes
system usage: CPU: user: 3.24 s, system: 0.87 s, elapsed: 5.72 s

Note that this isn't such a huge shift -- not at first. We do freeze all of the pages we've scanned in this VACUUM, but
we don't advance relfrozenxid proactively. A transition is now underway, which finishes when the size of the
pgbench_history table reaches 8GB (since that's twice the vacuum_freeze_strategy_threshold setting). Several more
autovacuums will be triggered that each look similar to the above example.

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
tuples: 0 removed, 169315499 remain, 0 are dead but not yet removable
removable cutoff: 244160328, which was 32167955 XIDs old when operation ended
new relfrozenxid: 99467843, which is 24578353 XIDs ahead of previous value
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 384.224 ms, write: 1003.192 ms
avg read rate: 16.728 MB/s, avg write rate: 36.910 MB/s
buffer usage: 906525 hits, 216130 misses, 476907 dirtied
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes
system usage: CPU: user: 23.78 s, system: 8.52 s, elapsed: 100.94 s

This is a little like an aggressive VACUUM, because we have to freeze about half the pages in the table (Note to self: might be a
good idea for the patch to adjust the heuristic so that we advance relfrozenxid eagerly for the first time in a much
later VACUUM -- even this seems a bit too close to aggressive VACUUM).

From this point onwards every autovacuum of pgbench_history will scan the same percentage of the table's pages, and freeze almost all of those same pages (setting them all-frozen for good). The overhead of the very next autovacuum is representative of every other autovacuum that will every happen on the same table going forward (on a "percentage of pages scanned/frozen from table" basis):

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1500396 remain, 146832 scanned (9.79% of total)
tuples: 0 removed, 235561936 remain, 0 are dead but not yet removable
removable cutoff: 310431281, which was 5275067 XIDs old when operation ended
new relfrozenxid: 310426654, which is 23032061 XIDs ahead of previous value
frozen: 146698 pages from table (9.78% of total) had 23031179 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
I/O timings: read: 0.013 ms, write: 0.009 ms
avg read rate: 0.002 MB/s, avg write rate: 0.002 MB/s
buffer usage: 294142 hits, 4 misses, 4 dirtied
WAL usage: 293397 records, 4 full page images, 64139188 bytes
system usage: CPU: user: 9.42 s, system: 2.49 s, elapsed: 16.46 s

In summary, we get perfect performance stability with the patch, after an initial period of adjusting to using an eager approach to freezing in each VACUUM.

This last autovacuum doesn't have exactly the same details as an autovacuum that simply had vacuum_freeze_min_age set to 0. Note that
there are a tiny number of unfrozen heap pages that still got scanned (146832 scanned - 146698 frozen = 134 pages scanned but left unfrozen). We'd probably see no remaining unfrozen scanned pages whatsoever, were we to try the same thing with vacuum_freeze_min_age/autovacuum_freeze_min_age set to 0 -- so what we see here isn't "maximally aggressive" freezing. (Note also that "new relfrozenxid" is not quite the same as "removable cutoff", for the same exact reason -- "maximally aggressive" vacuuming would have been able to get it right up to the "removable cutoff" value shown.)

VACUUM leaves behind a tiny number of unfrozen pages like this because the patch only triggers page-level freezing
proactively when it sees that the whole heap page will thereby become all-frozen instead of all-visible. So eager
freezing is only a policy about when and how we freeze pages -- it still requires individual heap pages to look a certain way before we'll actually go ahead with freezing them. This aspect of the design barely matters in this example, but will be much more important with the next example.

== Continual inserts with updates ==

The TPC-C benchmark has two tables that continue to grow for as long as the benchmark is run: the order table, and the
order lines table. The order lines table is the bigger of the two, by far (each order has about 10 order lines on
average, plus the rows are naturally wider). And these tables are by far the largest tables out of the whole set (at
least after the benchmark is run for a little while, which is a requirement).

The benchmark is designed to work in a way that is at least loosely based on real world conditions for a network of
wholesalers/distributors. Orders come in from customers, and are delivered some time later; more than 10 hours later
with spec-compliant settings. It's somewhat synthetic data (due to the requirement that the benchmark can be easily scaled up and down), but its design is nevertheless somewhat grounded in physical reality. There can only be so many orders per hour per warehouse. An individual customer can only order so many things per day, because individual human beings can only engage in so many transactions in a given 24 hour period. In short, the benchmark shouldn't have a workload that is wildly unrealistic for the business process that it seeks to simulate.

See also: [https://github.com/wieck/benchmarksql/blob/master/docs/TimedDriver.md BenchmarkSQL Timed Driver docs], written by Jan Wieck.

The orders are initially inserted by the order transaction, which will insert rows into both tables in the obvious way.
Later on, all of the rows inserted (into both tables) are updated by the delivery transaction. After that, the
benchmark will never update or delete the same rows, ever again. This could be described as an adversarial workload,
because there is a relatively high number of updates around the same key space/physical heap blocks at the same time,
but the hot-spot continually changes as time marches forward. This adversarial mix is particularly relevant to the
project, because there are two opposite trends that pull in opposite directions -- we want to freeze eagerly in some
pages, but we also want to freeze lazily in other pages. It's relatively difficult for the patch to infer which
approach will work best at the level of each heap page.

=== Today, on Postgres HEAD ===

Like the pgbench_history example, we see practically no freezing in non-aggressive VACUUMs for this, before the point
that an aggressive mode VACUUM is required (there are quite a few autovacuums that look like similar to this one from
earlier):

ts: 2022-12-06 03:18:18 PST x: 0 v: 4/8515 p: 554727 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
tuples: 10112936 removed, 1023712482 remain, 5311068 are dead but not yet removable
removable cutoff: 194441762, which was 7896967 XIDs old when operation ended
frozen: 152106 pages from table (0.91% of total) had 3757982 tuples frozen
index scan needed: 547657 pages from table (3.26% of total) had 1828207 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4448447 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 471323.673 ms, write: 23125.683 ms
avg read rate: 35.226 MB/s, avg write rate: 14.049 MB/s
buffer usage: 4666574 hits, 9431082 misses, 3761396 dirtied
WAL usage: 4587410 records, 2030544 full page images, 13789178294 bytes
system usage: CPU: user: 56.99 s, system: 74.71 s, elapsed: 2091.63 s

The burden of freezing is almost completely borne by aggressive mode VACUUMs:

ts: 2022-12-06 06:09:06 PST x: 0 v: 45/8157 p: 556501 LOG: automatic aggressive vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)
tuples: 12190981 removed, 1127721257 remain, 17548258 are dead but not yet removable
removable cutoff: 213459729, which was 25528936 XIDs old when operation ended
new relfrozenxid: 163469053, which is 148467571 XIDs ahead of previous value
frozen: 10940612 pages from table (59.79% of total) had 640281019 tuples frozen
index scan needed: 420621 pages from table (2.30% of total) had 1345098 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 5219724 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 558603.794 ms, write: 61424.318 ms
avg read rate: 23.087 MB/s, avg write rate: 16.189 MB/s
buffer usage: 16666051 hits, 22135595 misses, 15521445 dirtied
WAL usage: 25282446 records, 12943048 full page images, 89680003763 bytes
system usage: CPU: user: 216.91 s, system: 298.49 s, elapsed: 7490.56 s

Workload characteristics make it particularly hard to tune VACUUM for such a table. By setting vacuum_freeze_min_age to
0, we'll freeze a lot of tuples that are bound to be updated before long anyway.

It's possible that we'd see more freezing by vacuuming less often (which is possible by lowering
autovacuum_vacuum_insert_threshold), since that would mean that we'd tend to only reach new heap pages some time after
their tuples has already attained an age already exceeding vacuum_freeze_min_age. This effect is just perverse; we'll
do less freezing as a consequence of doing more vacuuming!

=== Patch ===

As with the earlier example, the patch will have autovacuum/VACUUM consistently freeze all of the pages from the table containing
only all-visible tuples right away, so that they're marked all-frozen in the VM instead of all-visible.

Here we show an autovacuum of the same table, at approximately the same time into the benchmark as the example for
Postgres HEAD (note that the "removable cutoff" XID is close-ish, and that the table is around the same size as it was when we looked at the Postgres HEAD aggressive mode VACUUM):

ts: 2022-12-05 20:05:18 PST x: 0 v: 43/13665 p: 544950 LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
tuples: 10365170 removed, 1088378159 remain, 3299160 are dead but not yet removable
removable cutoff: 208354487, which was 7638618 XIDs old when operation ended
new relfrozenxid: 183132069, which is 15812161 XIDs ahead of previous value
frozen: 2355230 pages from table (13.16% of total) had 138233294 tuples frozen
index scan needed: 571461 pages from table (3.19% of total) had 1927325 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 4730173 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 431408.682 ms, write: 29564.972 ms
avg read rate: 30.057 MB/s, avg write rate: 12.806 MB/s
buffer usage: 3048463 hits, 8221521 misses, 3502946 dirtied
WAL usage: 6888355 records, 3056776 full page images, 20240523796 bytes
system usage: CPU: user: 71.84 s, system: 92.74 s, elapsed: 2136.97 s

Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach
to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we
don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first
pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts
to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap
pages.

We're likely to be freezing XIDs in an order that only approximately matches XID age order. Despite all this, we still
consistently see final relfrozenxid values that are comfortably within vacuum_freeze_min_age XIDs of the VACUUM's
OldestXmin/removable cutoff. So in practice XID age has absolutely minimal impact on how or when we freeze, in this
example table/workload. While the final relfrozenxid values we set here is significantly older than the removable
cutoff/OldestXmin if we compare this example to the pgbench_history example, the relfrozenxid is nevertheless very close
to removable cutoff/OldestXmin by any traditional measure.

All earlier autovacuum operations look similar to this one (and all later autovacuums will also look similar). In fact, even much earlier autovacuums that took place when
the table was much smaller show approximately the same percentage of pages scanned and frozen. So even as the table
continues to grow and grow, the details over time remain approximately the same in that sense. Most importantly of all,
there is never any need for an aggressive mode vacuum that does practically all freezing.

This isn't perfect; some of the work of freezing still goes to waste, despite efforts to avoid it. This can be seen as
the cost of performance stability. We at least avoid the worst impact of it, by conditioning triggering freezing on
all-visible-ness, and by avoiding eager freezing altogether in smaller tables.

==== Scanned pages, visibility map snapshot ====

Independent of the issue of freezing and freeze debt, this example also shows how VACUUM tends to scan significantly fewer pages with the patch, compared to Postgres HEAD/master. This is due to the patch replacing vacuumlazy.c's SKIP_PAGES_THRESHOLD mechanism with visibility map snapshots. VACUUM thereby avoids scanning pages that it doesn't need to scan from the start, and also avoids scanning pages whose VM bit was concurrently unset. Unset visibility map bits are potentially an important factor with long running VACUUM operations, such as these. VACUUM is more insulated from the fact that the table continues to change while VACUUM runs, since we "lock in" the pages VACUUM must scan, at the start of each VACUUM.

In fact, the patch makes the percentage of scanned pages shown each time (for this workload) both lower and very consistent over time, across successive VACUUM operations, even as the table continues to grow indefinitely (at least for this workload, likely for many others besides). This is another example of how the patch series tends to promote performance stability. VM snapshots make very little difference in small tables, but can help quite a lot in large tables.

Here we show details of all nearby VACUUM operations against the same table, for the same run (these are over an hour apart):

pages: 0 removed, 13210198 remain, 2031762 scanned (15.38% of total)
pages: 0 removed, 14270140 remain, 2478471 scanned (17.37% of total)
pages: 0 removed, 15359855 remain, 2654325 scanned (17.28% of total)
pages: 0 removed, 16682431 remain, 3022064 scanned (18.12% of total)
pages: 0 removed, 17894854 remain, 2981204 scanned (16.66% of total)
pages: 0 removed, 19442899 remain, 3519116 scanned (18.10% of total)
pages: 0 removed, 20852526 remain, 3452426 scanned (16.56% of total)

And the same, for Postgres HEAD/master:

pages: 0 removed, 12563595 remain, 3116086 scanned (24.80% of total)
pages: 0 removed, 13360079 remain, 3329987 scanned (24.92% of total)
pages: 0 removed, 14328923 remain, 3667558 scanned (25.60% of total)
pages: 0 removed, 15567937 remain, 4127052 scanned (26.51% of total)
pages: 0 removed, 16784562 remain, 4547881 scanned (27.10% of total)
pages: 0 removed, 18298517 remain, 16577256 scanned (90.59% of total)

== Mixed inserts and deletes ==

Consider a table like TPC-C's new orders table, which is characterized by continual inserts and deletes. The high
watermark number of rows in the table is bound (for a given scale factor/number of warehouses), and autovacuum tends to
need to remove quite a lot of concentrated bloat due to deletes.

=== Today, on Postgres HEAD ===

(Not showing Postgres HEAD, since most individual VACUUM operations look just the same as they do with the patch series).

=== Patch ===

Most of the time, the behavior with the patch is very similar to Postgres HEAD, since eager freezing is mostly
inappropriate here (in fact we need very little freezing at all, owing to the specifics of the workload). Here is a
typical example:

ts: 2022-12-05 20:48:45 PST x: 0 v: 43/18087 p: 546852 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83277 remain, 66541 scanned (79.90% of total)
tuples: 2893 removed, 14836119 remain, 2414 are dead but not yet removable
removable cutoff: 226132760, which was 33124 XIDs old when operation ended
frozen: 161 pages from table (0.19% of total) had 27639 tuples frozen
index scan needed: 64030 pages from table (76.89% of total) had 702527 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 65683 in total, 2668 newly deleted, 3565 currently deleted, 2022 reusable
I/O timings: read: 398.127 ms, write: 4.778 ms
avg read rate: 1.089 MB/s, avg write rate: 12.469 MB/s
buffer usage: 289436 hits, 931 misses, 10659 dirtied
WAL usage: 138965 records, 13760 full page images, 86768691 bytes
system usage: CPU: user: 3.59 s, system: 0.02 s, elapsed: 6.67 s

The system must nevertheless advance relfrozenxid at some point. The patch has heuristics that weigh both costs and
benefits when deciding when to do this. Table age is one consideration -- settings like vacuum_freeze_table_age do
continue to have some influence. But even with a table like this one, that requires very little if any freezing, the
costs also matter.

Here we see another autovacuum that is triggered to clean up bloat from those deletes (just like every other autovacuum
for this table), but with one or two key differences:

ts: 2022-12-05 20:54:57 PST x: 0 v: 43/18625 p: 546998 LOG: automatic vacuum of table "regression.public.bmsql_new_order": index scans: 1
pages: 0 removed, 83716 remain, 83680 scanned (99.96% of total)
tuples: 2183 removed, 14989942 remain, 2228 are dead but not yet removable
removable cutoff: 227707625, which was 33391 XIDs old when operation ended
new relfrozenxid: 190536853, which is 81808797 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 64206 pages from table (76.70% of total) had 676496 dead item identifiers removed
index "bmsql_new_order_pkey": pages: 66537 in total, 2572 newly deleted, 4115 currently deleted, 3516 reusable
I/O timings: read: 693.757 ms, write: 5.070 ms
avg read rate: 21.499 MB/s, avg write rate: 0.058 MB/s
buffer usage: 308003 hits, 18304 misses, 49 dirtied
WAL usage: 137127 records, 2587 full page images, 25969243 bytes
system usage: CPU: user: 3.97 s, system: 0.14 s, elapsed: 6.65 s

Notice how relfrozenxid advances, and how we scan more pages than last time. This happens during an autovacuum that
takes place at a point where the table's age is about 55% - 60% of the way to the point that autovacuum needs to launch
an antiwraparound autovacuum.

Here the patch notices that the added cost of advancing relfrozenxid is moderate, though not very low. The logic for
choosing a vmsnap skipping strategy determines that the cost of advancing relfrozenxid now is sufficiently low that it
makes sense to do so. So we advance relfrozenxid here because of a combination of 1.) it being cheap to do so now
(though not exceptionally cheap), and 2.) the fact that table age is starting to become somewhat of a concern (though
certainly not to the extent that VACUUM is forced to advance relfrozenxid).

Notice that we don't have to freeze any tuples whatsoever here, and yet we still manage to advance relfrozenxid by a
great deal -- it can actually be advanced further than what we'd see in an aggressive VACUUM in Postgres 14 (though
perhaps not Postgres 15, which has related work added by commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}).

=== Opportunistically advancing relfrozenxid with bursty, real-world workloads ===

Real world workloads are bursty, whereas benchmarks like TPC-C are designed to produce an unrealistically steady load.
It's likely that there is considerable variation in how each table needs to be vacuumed based on application
characteristics. For example, a once-off bulk deletion is quite possible. Note that the heuristics in play here will
tend to notice when that happens, and will then tend to advance relfrozenxid simply because it happens to be cheap on
that one occasion (though not too cheap), provided table age is already starting to be a concern. In other words VACUUM
has a decent chance of noticing a "naturally occurring" though narrow window of opportunity to advance relfrozenxid inexpensively.
Costs are a big part of the picture here, which has mostly been missing before now.

== Constantly updated tables (usually smaller tables) ==

This example shows something that HEAD (and Postgres 15) already get right, following earlier related work (see commits {{PgCommitURL|0b018fab}}, {{PgCommitURL|f3c15cbe}}, and {{PgCommitURL|44fa8488}}) that the new patch series builds on. It's included here because it shows the
continued relevance of lazy strategy freezing. And because it's a good illustration of just how little freezing may be
required to advance relfrozenxid by a great many XIDs, due to workload characteristics naturally present in some types of tables.

One key observation behind the patch, that recurs again and again, is that relfrozenxid generally has a very loose relationship to freeze debt itself. Understanding and exploiting that difference comes up again and again. It works both ways; sometimes we need to freeze a lot to advance relfrozenxid by a tiny amount, and other times we need to do no freezing whatsoever to advance relfrozenxid by a huge number of XIDs. Here we show an example of the latter case.

Consider the following totally generic autovacuum output from pgbench's pgbench_branches table (could have used
pgbench_tellers table just as easily):

automatic vacuum of table "regression.public.pgbench_branches": index scans: 1
pages: 0 removed, 145 remain, 103 scanned (71.03% of total)
tuples: 3464 removed, 129 remain, 0 are dead but not yet removable
removable cutoff: 340103448, which was 0 XIDs old when operation ended
new relfrozenxid: 340100945, which is 377040 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 72 pages from table (49.66% of total) had 395 dead item identifiers removed
index "pgbench_branches_pkey": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 304 hits, 0 misses, 0 dirtied
WAL usage: 209 records, 0 full page images, 20627 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Here we see that autovacuum manages to set relfrozenxid to a very recent XID, despite freezing precisely zero pages.
In practice we'll always see this in any table with similar workload characteristics. Since every tuple is updated before long anyway, no old XID will ever get old enough to need to be frozen, which every VACUUM will notice automatically, which will be reflected in the final relfrozenxid. In practice this seems to happen reliably with tables like this one.

Although this example involves zero freezing in every VACUUM, and so represents one extreme, there are similar tables/workloads (such as the previous example of the bmsql_new_order table/workload)
that require only a very small amount of freezing to advance relfrozenxid by a great many XIDs -- perhaps just a tiny amount of freezing with negligible cost. To some degree these sorts of scenarios
justify the opportunistic nature of eager strategy vmsnap skipping from the new patch series. VACUUM cannot ever
notice that one particular table has these favorable properties without trying to advance relfrozenxid by some amount,
and then noticing that it can be advanced by a great deal quite easily. (The other reason to be eager about advancing
relfrozenxid is to avoid advancing relfrozenxid for many different tables around the same time.)