PGConf.dev 2026 Developer Unconference

From PostgreSQL wiki
Jump to navigationJump to search

The unconference at PGConf.dev 2026 will be organized by Andres Freund and Nathan Bossart. Please add notes about individual sessions here! You can either include them in the body of the page, or put them on another page and link from here.

If you feel that your session cannot be productive without a particular conference attendee present, please ask their permission to include them as a "required attendee" for your proposal. The organizers will use this information to minimize scheduling conflicts.


Postgres and Other Communities

So many ideas and suggestions on the notes page! [1]



Global Index

We want global indexes, because:

  • Allows unique constraints on non-partition keys.
  • Scan on one index is cheaper than N scans on N indexes

3 problems:

1. VACUUM: 1000s of partitions each must run cleanup on the same index relation; which is sized to 1000s of partitions' data. 2. DDL: Adding & removing partitions requires maintenance on the global index. 
How to decide whether rebuild from scratch or to apply changes with only the new/removed tuples 3. Locking: Scanning the global index still requires locks on all leaf tables that contribute(d) to that index.
Scanning a table requires having a lock on the relation. Scanning the index itself doesn't require locking, but you'll still have to lock the leaf partitions to access their data.

   1. There are schemes to make this work, using multi-pass operations

Room: 4. Sizing: Individual relations are limited to 32TB, and that doesn't scale with ; is this something we want/need to solve, too?


Q: What's the main blocker? Is this a good idea?

A: Primarily: VACUUM.


Q1: Is the solution to VACUUM conveyor belt (mentioned yesterday)

A: No, instead of vacuuming that partition, all partitions are VACUUMed at once, gathering the TIDs that the global index will clean up.

Q2: What's the status of the conveyor belt?

A: Hit a snag on planning VACUUM operations and [...] Response: Serializing VACUUM of a partition hierarchy is likely going to cause issues; VACUUM interrupts become incredibly much more expensive.


Comment: DDL is likely the most pressing issue. Conveyor belt may fix VACUUM, but DDL isn't trivial either.

DDL Attach: 2-phase attach with first xact adding the partition as 'invalid', then load tuples into index, then second xact marking the partition valid.


Q: Does sub-partitioning affect partition ID usage?

A: No, only leaf partitions (data partitions) are included in the IDs


Comment: Maybe we should disallow certain operations when global indexes are involved.

Response: Yes, that seems like a reasonable limitation

Q: Which operations would be disallowed?

A: On-the-fly locking of partitions accessed through a global index scan would break with leaf table insertion, so in this case we should block leaf partition -level DML.

Comment: The main customer for Global Indexes are people coming in from Oracle, which doesn't allow direct leaf access


Comment: Even with 10s of 1000s of attached partitions you still win, because it'll have only few key bytes, not full rows.


Q: What other avenues were explored to achieve global index's features, without this current approach [single relfile index, ed.]

A: There was a Global Unique Index approach, which had unique indexes per partition, and checked other partitions' unique indexes for conflicting values.


Comment: We need to scope it down to make it through.

Q: How does DROP PARTITION work with Global Index?

A: The partition ID is marked invalid, and all tuples that reference this ID will be dropped during maintenance operations.


Q1: What needs to change in indexAM?


A: minimal changes to indexes themselves.

Q2: And REPACK/VACUUM FULL?

A: Just rebuild the index on REPACK/VACUUM FULL.


Q: Performance. What's the overhead for locking in Index Scans vs Global Index Scans?

A: IS needs to scan each index, vs GIS only the global index. The number of locks is unchanged (unless runtime on-access locking is applied)

Commit Sequence Numbers

(Incomplete notes by mmeent)

There are several implementations. Jeff (organizer) wants to talk about changes in visibility semantics that we need to discuss and agree on.

First, PG has different visibility semantics on Primary vs replica (causes "Long Fork" anomaly):

  • Replica visibility is determined by LSN/replay order.
  • Primary visibility order is determined by persistence window, and prock lock ordering.

With CNS we have an opportunity to allow readers to determine the visibility interpretation of the system: record created vs record flushed vs record replicated.

Comment: May cause problems with hint bits

Q: Are we considering CSN for Long Fork, or are we considering CSN for faster snapshots? They don't strictly need to be combined.

A: [...]


Comment: Combining both is likely to make the patch unnecessarily complicated.


Q: Please clarify; do you expect CSNs to be in lockstep with LSNs?

A: From an implementation perspective this could be done, but isn't probable.

A: If we define visibility semantics to be

Comment: If the visibility order doesn't match WAL order we'll have to get another system.

Andres: "We could also just decide we'll continue to suck." ... The best approach would be to have quorum flush, but that'd require proper node registration, and that has too many dependencies.


The current system has many levels of durability and that causes differences.

Q: When should sync_commit=off vs =on release for visibility?

A: Thought is: =off returns to the client once inserted. To maintain semantics the read-path would have "visible=inserted in wal". If you'd wait for persistence you'd lose performance.


Comment: We can't have sessions' sync_commit settings affect visibility semantics. "That way lies madness."

Comment: The covered behaviour is that sync returns after flush, async returns before WAL flush. If CSN visibility is determined by XLOG flush location, that'll cause issues with async commits; as it would either have to wait for a sync, or might not be able to read its own writes.


Suggestion: If you (a session in sync commit mode) want to see async commits, wait for the async commit is synced (or issue the sync yourself).


Comment: We should agree on anomalies like "async sessions sees something == sync session sees that (or later) data"

Comment: The problem is that durability is decoupled from visibility; LSN order <> visibility order with mixed sync_commit=off|local|remote_write|remote_apply.


Q: Do we want sync commits to see async commits?


A: Sync commits wait for async-but-visible commits to be synced, then read their data.

Q: Do we want async commits to see sync commits?

A: [...]


There were some questions about how CSNs in a logical replication context would relate to MySQL's GTIDs. CSNs themselves (as currently proposed, with a 64-bit counter) wouldn't provide an equivalent to GTID. A global transaction order would need to be managed by an external program.

Scaling pg_stat_statements

Scalability of pg_stat_statements


Feedback-based Query Optimization

pgconfdev2026 Feedback-based Query Optimization

Postgres CI

Bilal's Slides Mailing list thread

  • Cirrus CI will shut down June 1st
  • We need to find another CI provider
  • Bilal has done some tests using GitHub Actions as a CI runner
  • Jelte independently has done tests on GitHub Actions as well

Problems we want to solve:

  • Runtime
  • Compute limitations
  • Public logs (Cirrus CI had public logs, everyone can read them)
  • Connecting to a test environment
  • Supporting different operating systems

CFBot will be broken in any case

Looked at a couple of alternatives CI Providers

  • GitHub Actions is good because unlimited for public repositories
  • 20 concurrent jobs for GitHub Actions (might be able to get that lifted)
  • Could also host our own runners if we need to

What is a runner in this context?

  • Its a VM in some environment that starts the test
  • We did that with Cirrus already, but it has some level of isolation between runs
  • The basic infrastructure in GitHub has no isolation, which is a problem
  • Would have to preprovision VMs to make that work, would need more compute

Peter: Ran into a problem with using up credits in Cirrus CI

  • This is much better with GitHub Actions (credits don't run out)

Thomas: Jelte's test got all operating systems going, did that use nested virtualization?

  • Yes, that was running nested containers, that's why OpenBSD takes 34 minutes
  • If that's what was done, could we use that on our local machines?
  • That'd be the dream if we could do that
  • Andres: The problem is that you can't just download them every time, its a lot of network I/O
  • With Cirrus CI it just downloaded the part of the image it needed, that's why that worked
  • To do that efficiently we'd need our own CI infrastructure

Peter: There are a bunch of other projects that use CirrusCI that are in trouble (e.g. pgbouncer)

  • Let's keep in mind what the other projects could do

It would be interesting to check how well the current disk image compresses

Andres: We should consider skipping BSDs initially for expediency (regarding the deadline of June 1st)

Peter: We have one week - should we at least provisionally go with GitHub Actions?

  • Andres: Yes
  • Then come back to looking at alternatives
  • Peter G: I already thought that's what we're going to do - I don't see any reason not do that

Most important things are Linux, Windows, MacOS, Compiler Warnings/ASAN

  • Thomas: If macOS compiles, the BSDs are going to compile and run

Michael: Windows has been super helpful

Peter G: Don't we use FreeBSD for some other checks, debug parallel query, etc?

  • The official FreeBSD website has 600MB VM images, might be small enough?
  • Andres: Unclear how much the dependencies add?

With Cirrus you can use your own images, with GitHub you can't

Its very recent that you can use QEMU on GitHub (nested virtualization wasn't allowed previously, it is now)

David: Sent a link in Discord re: action using BSD

Peter G: So we have a patch right?

  • Yes
  • Runtimes are a bit longer compared to Cirrus CI, but not too much
  • Windows 20-25 minutes on GitHub Actions, Cirrus CI was faster mostly (15-25 minutes)
  • Andres: I want to commit this early next week, so we have some days of overlap
  • Its also going to make it easier to move CFbot over

Can you show how logs / file uploads look like?

  • Similar, but for artifacts you need to download a zip file
  • Need a GitHub account to view the logs, will accept that for now

David: Do we care about ARM64/etc?

  • Andres: We care about 32-bit builds, that will catch some bugs
  • Bilal: That's done (32-bit)

Should we have some helper scripts in addition to GitHub actions?

  • Might be better to have in one place?
  • Andres: Motivation for helper scripts would be that it'd make it easier to share code between platforms
  • Peter E: Could run it locally to test things

Andres: I'd want to backpatch this, wanted to check

  • Thomas: Agreed
  • Peter E: Down to 14?
  • Andres: 15, we don't have 14 at the moment in Cirrus
  • Windows would be painful to do on 14
  • Michael: Seems the cost is too high given the fact we're only 6 months out from PG14 being EOL

What coverage are we going for (just check world?)

  • Same as Cirrus basically, just won't be able to do the BSDs initially

Thomas: To feed the results from GH actions into CFbot, not clear how to do it

  • Bilal: Was already planning to look into it
  • Thomas: Will probably turn off the stuff that examines the logs
  • Andres: Might have to call back into GitHub and use a token to not get rate limited

Peter E: There is logic in the commitfest app when CI is failing, need to check how that interacts

  • Thomas: The way it works is that Cirrus sends a status change to a web hook, and CFbot feeds that into the CF app
  • Only the CFbot needs to be changed
  • If you make this work in regular GitHub accounts, it will also work for CFbot

Debian vs Ubuntu

  • Debian requires containers, Ubuntu does not
  • Peter E: Should be fine to do Ubuntu
  • Andres: Should check if we can use our containers without too much overhead
  • Peter E: Where do you host the containers?
  • Andres: Currently in Google, should probably move them to make it cheaper (somewhere to GitHub)

Peter E: What compiler is the Windows job using?

  • Bilal: MSVC 2022
  • Peter E: How can we get 2019 tested?
  • Tristan: You can specify it using the matrix (and its part of the image)

Andres: Lets just remove MSVC 2019 support

  • Thomas: I don't know why we do it, its already unsupported officially by Microsoft
  • Peter E: I was cautious because not sure how Windows users use their system, but maybe with Windows the coupling is not as tight
  • Andres: They actually removed installers for specific MSVC compiler versions after 2019 (its one link, the next time you rebuild the image it gets a newer one)

For Windows, the GitHub Actions images have more things installed

  • Bilal: We might not have to build Windows images anymore (and not hardcode OpenSSL version)
  • Andres: One problem with Chocolatey is that they have broken binaries, also sometimes a lot slower
  • Pavlo: WinGet?
  • Andres: Doesn't have several packages we need

Connecting to test environment

  • Not currently out of the box in GitHub Actions, possibly via a third-party SSH action?
  • Is that solved by GitHub Workspace?
  • Someone could try that out

How to do HA with Physical Replication

HA Orchestrators (physical repl)

Patroni, Stolon, pg_auto_failover, repmgr, CloudNativePG, Consul?

Promotions/Transitions and Correctness

  • Cluster config changes: sync_standby_names, sync_commit
  • SyncRep waits in commit (canceled xact issue, etc.)
  • Knowing write/flush/replay LSNs
  • Old primary: pg_rewind and accepting workload after conversion to replica
  • Any issues around timeline management?
    • Latest timeline — issues reconstructing from history file

Discussion Notes

Andrey: Should we have cluster orchestration like Patroni in core?

  • Laurenze: PG is supported on many OS & platforms — needs to work everywhere
  • HA is an enterprise requirement — table stakes
  • Philippe: Want to control tradeoffs around availability vs. durability; have worked systems with debates around how tradeoffs should be made. Correctness is a high priority; maybe some people want to relax this
  • Mohamed: It's possible to make tradeoffs configurable — maximizing durability, performance, availability (example from Oracle Data Guard)
  • Laurenze: Maybe we have too much configuration? Can set synchronous_commit at transaction level

Is the discussion around automated failover almost a discussion around consensus?

The way you do fencing can vary depending on what hardware or infrastructure you're running on:

  • Can you fence with network? Do you have access to hardware?

Are there ways protocol changes could make these problems easier to solve?

Jeremy: Does synchronous_commit=remote_write mean you can lose data? (Probably not?)

  • Do we KNOW if it happened or not?

We need to be able to explain tradeoffs to people.

Oleksii: Would be nice to see something like pg_ctl demote:

  • Aware of replicas — say to primary "demote" — primary waits for replicas
  • Part of switchover: block incoming connections, wait for replication; goes hand-in-hand (before) pg_promote
  • Sentinel record in WAL stream saying primary is definitely down

More Discussion Notes

for switchover ubicloud relies on https://github.com/ubicloud/ubicloud/blob/main/rhizome/postgres/lib/postgres_lockout.rb

on aws EIP based failover was fast enough that after lockout it was sufficient to just go with usual failover path on switchover on azure, where I got to maintain port of Citus Cloud, failover was slow (poweroff could be hanging because vm host down, in which case next step was destroy vm which could take 10 minutes), so for switchover I implemented "soft failover" which would trash the system with chmod -x while killing postgres processes, allowing us to act as if it was fenced

.

promoting a node when it's time to promote a node, we have nothing to do this in core. patroni in core?

would be nice, but portability. maintenance burden? what is HA? HA vs durability

oracle makes this adjustable (maximize availability vs maximize durability vs maximum performance), patroni has this too apparently

postgres has too many knobs already: durability knobs per transaction. semantics here are unclear. (this was reviving previous commit sequence number discussion)

goal is to not lose transactions. flushlsn/replaylsn/receievelsn how does ANY 2 replication choose what to promote to?

how do you prepare for promote? change to read only? fence? failover vs switchover. fencing mechanism differs by environment, is inherently tribal (some rely on undocumented AWS behavior with EIP, azure you need to poweroff/terminate/destroy vm)

when people own their own hardware, they might rely on actual reboot/poweroff of node. Quorum can get around this a bit by being able to maintain consistency without fencing

what data loss guarantees do we have? can we at least know that data loss didn't happen usually?

extreme end: no accepting new wal until quorum regained

Replication Slot Management

  • Logical slot HA (failover=true on standby)

logical slot failover slots have problems where you have to use physical slots & if slot isn't advancing that can block switchover. wal slot grows indefinitely if using pg_logical_emit_message non-transactionally on system with no transactions

if slot not advancing impacts vacuum etc. maintenance script needs to generate activity

Right now there's the issue that you need physical slots. You have slot sync running on the side; logical slots don't copy until there's activity:

  • Maintenance — can't fail back because there wasn't activity to advance slot
  • Not good quality of life
  • Running systems can have keepalive or background activity
  • If slot is not advancing, there are bigger issues
  • There should be background activity that causes slots to advance

Standby Topics

  • Hot standby feedback and bloat
  • Autovac truncation

Missing First Principles

what is an overarching idea to help this situation? ALTER SYSTEM READ ONLY. standby-quorum fencing. or protocol where primary cooperates independently in split-brain scenarios (needs way for primary to know this is happening)

ALTER SYSTEM READ ONLY?

  • For switchover — ability to make primary read-only

Not really any cluster awareness in Postgres?

  • Writer has: sync_standby_names, replication_status
  • Standby does not know

standby doesn't know what's going on in rest of cluster, just what wal it receives. pg_nodes? replicated sync_standby_names (but need name mapping)?

Global table? pg_nodes or something?

Should value of sync_standby_names be replicated? Need primary_conninfo too.

Could standbys agree a primary is unavailable?


What Other Optimizer Stats Do We Want

Out of scope:

  • join statistics (separate topic)
  • adaptive statistics as the query is running (previous session)

Ideas:

  • (Peter E.) statistics about what is in memory (replaces effective_cache_size, use mincore (portability?))
    • might get obsolete with AIO
  • collect stats in streaming (incremental) way, for large tables there is a 3 mio. row limit
    • nr. of buckets, sample size is currently coupled
    • dynamic sampling (like in Oracle?)
    • disable sampling
  • (David R.) getting statistics from indexes has been problematic
    • might be better via extension points
  • JOB (Join Order Benchmark) 2x better with more complete statistics
  • cardinality hints might be coming
  • risk-based costing, cardinality distributions
    • DR: model cost of one extra row
  • Corey: store multiple versions of stats (like MS SQL Server)
    • use it for triggering random sampling
  • special statistics for string operations (LIKE '%foo%')
  • (Maxime) statistics for aggregates output
  • (Feike) statistics on system (to inform cost parameters)
    • DR: we do not model enough (e.g. tuple deforming)
    • F: systems go through cycles, wants to adjust by time of day
  • ML-based tuning
  • Corey: percentage of toasted columns (makes hash tables more expensive, order of predicate processing)
    • column width
    • function cost could scale with datum width