PgCon 2023 Developer Unconference

This is the PGCon 2023 Unconf Page

Unconference schedule by Time and Room

Time	DMS 1160	DMS 1120	DMS 1140
Fri 10:00-11:00	Choice of topics and votes
Fri 11:15-12:15	Logical Replication	pg_hint_plan	Kernel Hacker
Fri 12:15-13:15	Lunch
Fri 13:15-14:15	pg_upgrade	Build System	Cluster Awareness
Fri 14:15-14:30	Coffee Break
Fri 14:30-15:30	Distributed Transactions	Vectors	Index Prefetching
Fri 15:30-15:45	Break
Fri 15:45-16:45	Future of PGCon	Table AMs	Partitioning

Notes/Minutes

pg_upgrade

NOTE:s copied from memory by Mmeent

Statistics ready after pg_upgrade

pg_upgrade doesn't upgrade (or otherwise populates) the statistics for databases, resulting in bad queries (downtime) until all tables have been analyzed. Several solutions have been discussed:

Upgrading of statistics data
- Not always possible due to differences in format, but also different gathering and meaning of the data in the fields may result in problematic data
  - But any data might be better than no data
(Administrative) write access to pg's statistics
- This would allow for manual insertion of statistics, which can be useful in normal administrative work, too. E.g. dumping/restoring problematic stats from prod clusters to dev clusters for debugging.
- Providing a sample dataset to generate the table's statistics on would be a potential solution, too. E.g. select pg_import_statistics(table=>'nsp.table'::regclass, source=>'tmp_sample_table'::regclass)

pg_upgrade phases pain points

We can probably improve throughput by grouping ops per database, instead of having sequential phases reconnect to each database to execute a single statement in each database.

pg_dump/pg_restore pain points

pg_dump's catalog dump process is single-threaded, and used for moving the catalogs. This means the critical path for any upgrade is limited (amongst others) by how fast we can pg_dump/pg_restore the schema of the largest database. Maybe we can improve on that by dumping and restoring parts of the catalogs concurrently.

Large Objects

Large objects are currently migrated across connections, which is quite expensive. Maybe we can use something like the --link pg_upgrade option here as well?

binary upgrade of the catalogs

Instead of the pg_dump/pg_restore cycle, maybe we could directly upgrade the catalog files like how we use ALTER TABLE ADD ROW [...] DEFAULT ....

This does have issues - we still need to apply changes in the catalogs which we currently are unable to process. We don't currently maintain a log of changes to catalogs to apply in version upgrades - we dump user-created entries and restore them to the new version. To do binary upgrades, we'd need to track (and apply) the changes for e.g.
- added/removed/renamed procedures
- added/removed arguments for procedures
- added/removed/renamed columns (especially when added/removed in places that are not the end of the tuple)
- etc.
Special care would need to be taken to make sure we'd still be able to bootstrap the system - I'm not sure how we'd be able to get this to work without issues.

Concurrent upgrade

Right now, no user connections are allowed at all while pg_upgrade is doing it's job, which means that there will be significant downtime if there the catalogs are large.

Work on more operations in parallel or while the old cluster is still running:
- We could lock all catalogs on the old cluster and prepare the new cluster using dump/restore, so that this doesn't have to be done after shutting down access to the old cluster.
- Adding to the point above, we could do a lot of the linking of relation data before we shut down access to the old cluster. This would still need to do cross-validation that new and/or deleted relation data segments are correctly reflected in the new cluster, but it should (amongst others) reduce the time spent on linking files in the new cluster's data dir.
There were arguments that this was (Mmeent: specific negative term forgotten), but the counterargument to this is that we essentially already do all of this, but without the user having acces to the old PostgreSQL instance.

Build System

Notes from Peter E.:

Andres: Meson build system is available in PG16. Problem is that you can't build from a distribution tarball, because it has built files in the source directory, and Meson cannot handle that. The easiest way would be to not generate any distprep files anymore and make the tarball from just the files in git.

Joe Conway: [This was a separate unconference proposal, but was merged into this one.] There are now some stricter requirements in some places for being able to track the origin of software more completely. Generating the release tarball in a semi-obscure way with a lot of generated files violates the principle.

The room agreed that the distprep step is an outdated approach and we should just get rid of it. This would add bison, flex, perl as build dependencies, as well as whatever you need for the documentation. For the documentation, it was suggested to provide a download of the built HTML files, for those who don't want to/can't build it locally.

Plan for PG17:

HTML download on web site
get rid of make distprep
remove MSVC build system
create release tarball using git archive

Longer term:

support building extensions using pkg-config, instead of pgxs
This will take some time until most extensions are updated.
Maybe at some point remove all makefiles except the ones needed for pgxs.

Signing releases, commits

no commit signing for now
- not clear what that would help
tag signing yes
not clear about how to manage keys
- check with packagers and other projects for best practices
once we have the keys have figured out, we can also start signing tarballs
- not necessarily need to start with a major release

Cirrus CI

Andres has done some work for NetBSD, OpenBSD tasks
No good way to get AIX, Solaris and non-mainstream CPUs in there. So buildfarm continues to be necessary.
We could pay extra to build mingw faster/more often. Money could be available, but someone would have to manage the relationship.
initdb caching would help more with test run times on Windows, so focus on that for now.

Table AMs

Notes from Peter E.:

Jacob C. gave a quick presentation about challenges using the table AM API, thoughts about how to evolve it
Mark D. gave many details about his experiences implementing several table AMs

Various problems, issues, potential improvements:

issues with understanding the precise purpose and scope of the scope of table AM API
- for example, hard to do in-memory, column store
- column store and other AMs would need a lot of executor changes
max. number of tuples per page hardcoded with heap assumptions in tidbitmap.c (work in progress to relax that)
storing something other than heaptuples affects WAL, starts working with PG15 (custom resource managers)
idea: push projections into table AM API (can be done with custom scans now)
custom kind of TID (like primary key) is hard/impossible to support with indexes
- but after tidbitmap changes, you can use the full 6 bytes
table AMs that don't store heaptuples should also define custom tuple table slots
the main internal API that requires heaptuples (instead of tuple slots) is triggers
idea: (some) quals pushed down into table AMs (same or similar as index AMs)
- would have runtime costs to decide that (planner hook?)
push down rest of visibility map checks into heapam (currently used for bitmap heap and index-only)
for testing pushdown or other non-heap features in core, create a wrapper around heap that accepts it (like heap2)
table AMs shouldn't hard code their own OIDs (so they can be reinstalled for testing)
bulk/vector/batch operations on tuples
terminology should be cleaned up (heap vs table) (many thousand lines of code)
- maybe clean up the highest impact areas
- some code in btree is actually specific to heap, like some locking, some index-only scan stuff, poorly documented [Peter G.]
maybe make autovacuum customizable

Everybody left encouraged to make progress on these issues. ;-)

pg_hint_plan

Notes from Michael P.:

Michael P.: Short presentation about the status of the project and own upcoming plans:

Latest release done in January 2023, dropping support for v10.
Next release is planned for September/October 2023 at the same time as PostgreSQL 16 to have both available at the same time.
Development work for the upcoming release in the works:
- Bug fix focus.
- Plug in into the module a yyac layer to retrieve hints from queries rather than the current code, removing weird cases where a value in a WHERE clause could be considered as a valid hint (Spoiler: it should not). WIP patch relies on psql's yyac, largely simplified, to get the work done.
- Plug in a second yyac/lex to parse the hints retrieved from the query strings. This would be kept separated from the first one to have a simpler maintenance if any yyac rules change in upstream.
- Possibility to disable indexes in queries: DisableIndex as well as possibility to define an index list for bitmap index scan, index scan and index only scans.

Areas of improvements that have been mentioned:

Force PGC_POSTMASTER GUCs in SET hints.
Store query ID in hint table rather than the normalized query. Per Horiguchi-san, it happens that we'd still need to keep a track of the normalized query to be able to look at that without pg_stat_statements.
Enforce starting relation for group node in join path.
Provide ability to force full executor/planner tree. May not be related to something pg_hint_plan could do as plan reproducibility is a larger problem when shared among backends.
Storage and reuse of plans generated by the planner. Similar to last point.
Taking in input (explain analyze, query string, plan I would like) => generate a list of hints that could be used.
Force row estimates for simple scans inside hints (Lukas F.). This may prove to be challenging as pg_hint_plan does not go this much into the planner internals and there is no specific hooks that allow this level of granularity.

Index Prefetching

Notes from Tomasz R.:

The main prefetching goal is to increase performance: we ask (OS or PostgreSQL) for page in advance to avoid waiting and have it ready when we need it later.

This is only about heap pages' prefetching; most of indices pages' should be already in RAM, so (trying to) prefetching them won't bring performance gains and'll cost syscalls. OTOH it might be useful for index-only scans, to (pre-)fetch leaf nodes, to have them ready when we want to check their predicates.

Bitmap index scan already uses prefetching when building bitmap; but other indices don't. Prefetching's ultimate goal should be to restore/bring performance of Bitmap Heap Scan for all index types. It should be possible for all indices except for GIN: it is lossy and we cannot reliably get tuple from it. The other solution would be to overuse Bitmap Scan - but this might lead to mis-optiming queries, especially without ordering. There are also differences in storing duplicates: 12 and earlier keep them in any order, 13 and later in increasing ctid order. Also, sometimes usage of bitmap scan is not possible - for example for GiST/distance queries.

When we follow index, most of IO is random. And most of algorithm/code is synchronous which means we need to wait for that page to be read. So, the more operations we need to perform on that page/tuple, the better for final gain.

The original version of patch was fetching all links from leaves: rationale was that it'll behave a bit like sequential scan, reading continuous stream of pages. But here with additional level of indirection: not heap pages, but heap pages pointed by used index. But with more realistic queries, we don't use all those pages: we have SORT, WHERE, LIMIT... It means that most probably we won't need many of those (already-read) pages; it also means that reading them will be wasteful. We might get information from planner how many we'll (might) need, and behave like during bitmap scan, using incremental ramp-up; in any case reading right link of leaf page (i.e. reading next page) makes sense in most of the cases.

The first solution was a bit naive, (pre-)fetching all tuples in order they are in index. Now there is small cache to avoid prefetching the same page; there is also pinning/unpinning of them. Small cache (16-32 entries) is enough to avoid overhead of re-fetching (and trashing), while allowing for simple searching through them. We also don't want to prefetch too much, not to overwhelm kernel's IO subsystem: we want to leave enough available bandwidth for WAL flush, checksumming, other background workers, etc. Especially do not prefetch too much in the beginning of execution, not to increase latency for first row.

Benchmark results for TCP-H: some queries faster, q2 slower - regression. We get similar results when data is cached. OTOH most of the gains (but also regression!) are gone with parallelism.

Does cardinality matter for those cases? 1-key or many keys? Is there a case for bitmap scan if this patch is applied? Maybe for AND/OR queries.

Code changes: struct Prefetcher and struct BTScanPosData Change API by adding prefetching to IndexAM.

Prefetching executed in index_getnext. Should we move this outside of index methods? GetBlock specific to index. Might be table-specific. We might need to push information to higher levels (of executor) during OLTP query execution.

We might want to change cost model (like for bitmap scan): there were differing opinions related to that change. We consume the same amount of resources in total, but OTOH avoid waiting on IO and use more CPU for pinning/unpinning and other page management.

Current patch changes 34 files. Will be larger when above scope increase is taken into consideration.

Wider discussion about possible optimisations and code instrumentation

Prefetching to buffers - we don't have API to check if page is already read, need to check manually all buffers. Would be nice to have:

  prefetch_page_which_was_previously_in_buffer(36);

Also - API to feed you (code) the stream of already pinned pages, to avoid using CPU cycles for pinning/unpinning.

Check if prefetch window is large enough. Check (via kernel?) how many cache hits and misses, to know how much prefetching we need. Kernel prefetches some of the pages already: we might know better and therefore disable this prefetching for our usecase. Will this be nice case for DirectIO?

Do we need to check if index is not changed too much, e.g. by vacuum?

Wider picture: instrumenting code to understand changes. How do we know if code (or patch/our changes) is doing what it's supposed to be doing? To avoid regressions, or find them easier/quicker during development. Deterministic test - to help to avoid falling into known/already encountered regressions or problems. And to have baseline So need for better benchmarks and regression tests: existing test setup might be too complex for hacking or refactoring. Problem is not only indeterministic timing: we need to remove as much variability as possible. Avoid result contamination during subsequent testing.

Logical Replication of DDLs

PGCon 2023 presentation slides: https://wiki.postgresql.org/images/d/dd/20230602-DDL_Replication.pdf

Notes from PeterS:

Zane: overview use cases

use case #1
- major version upgrade.
- Using LR for staging environment, but unable to use DDL
- those who want to upgrade all the DDLS
- might also want to do auto-fix syntax
use case #2
- consolidate multiple databases
- maybe only need DDL replication for tables
- name mapping is desirable to avoid clashes
use case #3
- heterogeneous replication
- pg to redshift
- currently cannot use ddl

Zane: Current patch is on top of the existing LR infrastructure.

capture ddls using event triggers
append new WAL record -- decode transferred
subscriber reconstructs + optional mapping or transformation, then execute

Zane: Design choices

How to capture?
- triggers?
What kind of payload format to put in the WAL?
- cmd string?
- not all are schema-qualified --> force search path, security implications
- let the user supply a search path?
- unable to have schema mapping
JSON
- Alvaro pointed us to ddl deparsing --> JSON blob representation
- flexible
- could fully schema qualify everything
- problem is the maintenance/test burden
Explanation of diagram
Explanation of JSON

Q. Dave: Able to use Slony?

James: possible to do it now
Amit: synchronization is difficult
Amit: ok we can do it, but we want it in core
Zane: manually is OK

Q. Mark. Are you only interested in the parser or post-parse hooks?

Amit: we already have hooks at command time
Mark: No, I mean I can currently have parse-hooks with nonsense cmd that executes fine even though PG does not understand
Amit: we are building from parse-tree and catalog
Zane: we get some info from parse-tree and some from catalog so it depends

Q. Suzuki-san: At the subscriber side should schema mapper be universal or ...? worried mapping will depend on the publisher side. (e.g. if the table is in a different tablespace)

Amit: think this can be changed
Euler: same problem as global obj?
Amit: we should not just ignore the tablespace
Jonathan: ignoring tablespace stuff needs to be an option but not a default
Euler: same issue for other objects?
Amit: eventually aim to replicate everything
Amit: currently only tables... later want more objects
Suzuki-san: how about sequence etc.. is it too limiting for the first implementation
Amit: scope is small otherwise patch is too big
Amit: 1st tables; 2nd dependent objects on tables; 3rd other stuff
Q: triggers/functions should go together?
Amit: we are trying to mimic what pg_dump does (e.g tables are just tables) ... so we can use pg_dump for the initial sync

Zane: there is a DDL option for enabling it

granularity

Q. Kuroda-san: Are we ignoring drop?

Zane: no, DDL means all create/alter/drop
Amit: In 1st phase we want minimal functionality
Euler: maybe has specific 'classes' of objects (to avoid long lists in future)
Amit: 3 phases as above 1st only tables; 2nd table dependencies ...
Euler: e.g want procedures
Amit: we will specify later that it will ONLY be the tables, not the dependencies
Euler: it is better if we have all the dependencies.
Amit: we are trying to follow the pg_dump model. so when you say table you get only table etc...
Amit: ddl replication and initial sync should behave the same
Mark: will you have initial sync + ddl?
Amit: yes

Zane: The logical logging format (slide)

cmd string
structure format as JSON

Q James. Do they have to be mutually exclusive? Should also include the original string

Amit: yes we can do that. Maybe more burden on the size in the WAL
James. likes the format, but would be nice to include the *original* command as a reference to know what happened
infosec people want to see just the command
Jonathan: Good idea - have cake and eat it too
Amit: Do you want it as optional?
James: no, do it by default.. it is not much of size
Zane: JSON is about 11x original
Amit: there are some reasons why not to use the original string...
James: right, but not saying shouldn't be there for applying.
Amit: we might change the JSON on the fly
PeterS: including the original in the same place can make testing easier
James: downstream I want to see the original command for logging (eg just for putting it in a log somewhere)
Mark: what about different storage defaults timescale/heap?
Amit: that will be the user responsibility if defaults are different
Jeff: where do we get the format?
Zane: it is hardcode in the decoding
Jeff: %I %d etc means ??
Amit: just to represent the information I - is an identifier, and T is type names.
Suzuki-san: can we write different decoders for other use cases or is there a single decoder?
Amit: not now. it is hardcoded
Zane: you can use different output plugin

Q. Jeff: also considering GUCS that matter?

Amit: eg. ??
Jeff: maybe include it in the JSON format so you can know what it was at the publisher side
Amit: it would be a long list
Jeff: why long? only the ones that affect the cmd...
Amit: what to do with GUC - set them?
Jeff: not necessarily but maybe could do stuff with them
Chris: GUC might be at the system level...
Jeff: it is just an idea
Amit: not considered - we can think about it.
Jeff: imagine only a handful of these.

Amit: Summary so far:

we want a JSON format
We want to log original cmd wihin it
trigger mechanism is OK

Q. Sawada-san: How to do mapping?

Amit: subscriber side may have some options
Sawada: ... (inaudible question)
Amit: will give a list of names key/value pair that can be changes
Amit: mapping has to be done *before* executing the command

Q. Suzuki-san: JSON is essentially the same as a parse tree. Changes version to version. So need to know the structure. Will need differences from each format. Maybe a lot of work

Zane: can use JSON schema for version tracking. We need to look into it.
Amit: we need to consider this problem
Suzuki-san: JSON needs version
Euler: can check the version. Not sure if it is ok.
Amit: maybe the protocol version needs to change if JSON changes

Zane: Initial Schema Sync

issues
suggest using pg_dump or API

Zane: Wrap-up

References + wiki (slide)
Amit: see the wiki for the summary
Amit: we are trying to use pg_dump in apply worker for the initial syn to reduce complexity
Amit: builds some dependency on the packaging

Amit: Wrap-up

please tell all suggestions to hackers.
Slides can be put on the wiki
Notes also to be put on the wiki

PgCon 2023 Developer Unconference

Contents

Unconference schedule by Time and Room

Notes/Minutes

pg_upgrade

Build System

Table AMs

pg_hint_plan

Index Prefetching

Logical Replication of DDLs

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools