PgCon 2022 Developer Unconference
This is the PGCon 2022 Unconf Page
- 1 Unconference schedule by Time and Room
- 2 Notes/Minutes
- 2.1 PostgreSQL vs the block size
- 2.2 Removing SPI from RI trigger implementation
- 2.3 Logical Replication High Replay Lag
- 2.4 State of the SQL Standard
- 2.5 Improving Statistics Accuracy
- 2.6 Logical Replication Origin Filtering and Consistency
- 2.7 Observability gaps
- 2.8 Meson new build system proposal
- 2.9 Reducing heap alignment losses
- 2.10 Improving Indexing Performance
- 2.11 Benchmarking Round Table
Unconference schedule by Time and Room
|Time||Stream 1||Stream 2||Stream 3|
|Wed 10:00-11:00||Session Pitches and Scheduling|
|Wed 11:15-12:15||PostgreSQL vs the block size||Removing SPI from RI trigger implementation||Logical Replication High Replay Lag|
|Wed 13:15-14:15||State of the SQL Standard||Improving Statistics Accuracy||Logical Replication Origin Filtering and Consistency|
|Wed 14:15-14:30||Coffee Break|
|Wed 14:30-15:30||Observability gaps||Meson new build system proposal||Reducing heap alignment losses|
|Wed 15:45-16:45||Improving Indexing Performance||Benchmarking Round Table|
PostgreSQL vs the block size
- Presentation including these benchmarks:
- Charts showing correlations between block sizes and performance for WAL and data block sizes
- David Steele: Surprising that smaller block sizes perform well, not larger. WAL checksums are optimized for fixed size block, and other optimizations..
- Different block sizes for different objects?
- would require multiple shared buffer areas or some solution like that
- Other areas would be complex -- full page writes?
- It’s not just code structure -- replacing macros with variables
- Some code correctness depends on these things being constant, toast threshold, index tuple size, max tuples per page
- More analogous to segment size than wal size
- As long as default is 8k then the risk to users is minimized. They won’t be impacted unless they actually change block sizes
- Bruce: do we know why the block sizes have effects
- David Steele: sheer number of blocks can cause extra effort
- Large blocks get higher data density and more sequential i/o
- Block sizes that don’t match hardware blocks cause extra i/o (especially SSDs and raid arrays)
- Effects on cache hit ratio -- surprisingly it seems large/small(?) blocks have higher cache hit rate
- Full page writes are more expensive for larger blocks
- Can we maybe keep track of dirtiness of portions of the blocks to handle large block sizes
- Would JIT help optimise the code that depends on block size
- Having to statically allocate memory to different shared buffer pools for different block sizes would be bad for users
- Direct I/O may have a large effect and illuminate the reasons for the performance behaviours
Removing SPI from RI trigger implementation
Amit's "what he heard them say" summary notes:
- Should the planner be skipped in some triggers and not others? Does it really have to be SQL string that’s passed to the planner if it is not be skipped?
- Isn’t plan space not that big in either of the cases (for scanning PK table to look up a given referencing row being inserted and scanning FK table to look up a given referenced row being deleted), so how bad would be having a bespoke mini-planner that outputs, not a plan tree, but, say, just the index OID to be scanned?
- Peter E:
- Creating Query tree by hand might not be such a great idea given past experience
- Full-blown planner might be needed after all to handle things like foreign keys using exclusion constraints someday
- CASCADE/SET actions would need *some* executor behavior, but maybe we can reuse what logical replication workers use to insert/update received tuples, because they already support doing things like firing triggers, updating indexes, etc.
- Why don’t we try getting rid of only the SPI layer from ri_trigger.c, before starting rewriting the underlying checks
- Amit’s patch didn’t get in basically because of the undesirability of having 2 implementations / sets of semantics depending on which side of the foreign key a trigger is, most prominently the semantic differences of privilege checking, snapshot to use for scanning
- Real overhead might be the executor (init, shutdown), so maybe we could cache the EState/PlanState tree across tuples being checked (Heikki mentions EvalPlanQual() doing a similar thing)
- Using statement triggers with their transition tables, not row triggers, might be the idea to pursue
- Let’s Avoid too many bespoke planner, executor mini-implementations (maintenance burden, bugs)
- Row-level triggers are fine for inserts, but not deletes
- Let’s just stop doing the foreign key checks in triggers, but integrate them into the insert/update/delete’s plan node (ModifyTable)
- SPI overhead is noticeable in the insert/update of referencing tables, but not always in the update/delete of referenced table
Additional notes from Peter E.:
- problem is FK performance
- proposed patch (hardcodes index scan) gives 2x performance on FK inserts
- transition tables use tuple stores, high overhead, maybe just record ctids instead
- previous experiments with using statement-level triggers had mixed results, high overhead in some cases
- idea: store parse trees in the trigger/catalog
- execReplication.c wasn't intended for performance or simplicity, just it was too hard to integrate properly, should perhaps be gotten rid of, not used as example to do more like it.
Logical Replication High Replay Lag
- walsender needs very many cpu cycles to send out logical replication data
- walsender is in CPU contention with regular backends trying to execute transactions
- perhaps improve things with cgroups?
- slide deck with benchmarks highlighting some of those issues https://www.postgresql.eu/events/pgconfde2022/sessions/session/3773/slides/297/benchmarking_four_replication_solutions-julian_markwort-pgconf_de-2022-05-13.pdf
- use niceness or something like that (should also work in other POSIX-like OS)?
- have you tried profiling the walsender process?
- maybe some obvious inefficiencies
- need to look at different workloads (longer transactions, bulk loads)
- need to consider platform compatibility (do we want to create platform-specific solutions, e.g. cgroups in postgresql?)
- improve/speed up/parallelize the replay process (could profit both binary and logical replication)
- limit the load through a pooler
- check for obivous improvements in walsender
- start discussion whether a cgroup solution in core would be desired
- experiment with little script that regularly checks walsender and cgroup assignments and tries to fix them
State of the SQL Standard
Notes from Peter E.
- New SQL standard expected in early 2023
- So this would be a topic for PostgreSQL 16.
- Some discussion about open access to standards. That's a big political issue, but there is some small amount of discussion and movement.
- Questions about which unimplemented SQL features are most requested or how to pick a feature to work on -- Follow normal development process: Pick something you find worthwhile. The fact that it's in the standard can be one supporting argument, but not the sole one.
- Should be mark features that are unimplemented but that we find obsolete (so will probably never implement)? [seemed like a welcome idea]
- How to participate in SQL standards process? -- Contact your national ISO member, they will direct you.
- Some discussion about the SQL/MDA standard. Could be interesting for PostGIS and related uses. -- We probably need better handling for large values for that.
- There was a thought that we need to support efficient partial updates. But the feedback was that for example for large satellite images you only need to support replacing the whole value.
- Any work on standardizing full-text search? -- no
- Any work on casts with defaults? -- No, but seems like a plausible topic.
- Peter wants to make a permanent wiki page with this information.
- Any questions about details can be sent to Peter Eisentraut.
Improving Statistics Accuracy
- Presentation of notes from Matthias van de Meent: [url]
- Example of cross-column correlation: city, zip
- Cross column statistics require manual intervention
- HLL Sketches
- Theta Sketches
- Is a library which is Apache License
- Apache License doesn’t come with patent licenses
- Don’t go looking for patents, just avoid anything that may be encumbered.
- Most sketches are designed for streaming large data sets -- our requirements are much lower and we usually have a sample in advance so we can use a simpler sketch.
- Possibly bloom filters?
- How do sketches avoid requiring manual selection of column sets?
- Sketches for separate columns can be combined to implement intersection, union, etc.
- Are these sketches similar to bitmap indexes, brin indexes, etc?
- Not really because they are based on samples
- They’re aiming to be very small ~ order 1k buckets per value
- Discussion of data structures used in sketches
- Simple data structure for discussion:
- A list of bitmaps, one for each value in the sample, each bitmap represents a row from the sample and indicates if it was that value
- A search for e.g. city=Montreal, country=Canada could combine the bitmaps for city and country to calculate the expected selectivity
- Simple data structure for discussion:
- We can worry about the size less if it’s possible to enable/disable per column
- Sampling for statistics is “fully random”
- Fear about sampling seeing inconsistent data between different columns
- Do we need a different “deterministic” sampling to avoid this?
- But do we use the whole row when we sample a row so it should be consistent?
- Changing the sampling method would be an isolated change but it would be hard to tie it to the type of statistics being gathered.
- …(some discussion about issues with sampling)
- Better to keep any sampling changes separate from statistics data structure changes
- Some sketches could be combined across inheritance tree
- Does this work with sampling when tables are very different sizes
- When combining sketches from each child table we would have to keep a subset of the samples where that size is based on the size of the child tables
- Best to take a very simple sketch data structure (e.g. bitmap) for a first implementation
- More sophisticated data structure solve different problems and we need to identify the problems before choosing one
- HLL answers a different question than we need -- number of distinct values -- and if we store exact tid then we’re not really using HLL for what it’s good for
- Bloom filters would probably be a better second step but start with a simple bitmap data structure (https://mobile.twitter.com/pvh/status/979391936690667520)
- Counting Bloom Filters, Cuckoo Filters, … many variants for solving different problems with different pros and cons
Logical Replication Origin Filtering and Consistency
- while benchmarking different synchronous_commit in the primary, noticed performance oddities when waiting for the replica to confirm the flush
- benchmarks and some notes here: 
- turns out, the feedback messages (write, flush, apply) are designed for binary streaming replication
- feedbacks are sent at different points in time in logical replication
- write feedback: when the transaction is received from the network
- replay feedback: when the replaying transaction is finished
- flush feedback: is sent after the replaying transaction is committed when synchronous_commit=on in the subscriber. When it is off, the apply worker regularly check the flush state of recently logically replayed transactions on subscriber side and then sends appropriate feedback. This latter behaviour leads to very unpredictable and high latencies.
- feedbacks are sent at different points in time in logical replication
- documentation of synchronous_commit setting in CREATE SUBSCRIPTION is spotty. It should explain exactly how this works, and that it -- just like in regular user sessions -- overrides the "global" synchronous_commit setting in the subscriber database instance.
- the table about consistency is only correct for binary streaming replication or when the subscriber is configured for synchronous_commit=on (or higher)
- "have never thought about these issues" (regarding differences between feedback for different mechanisms)
- not easy for newcomers to wrap their heads around, especially when already accustomed to streaming replication, where these feedback levels originated
- should be addressed for cases where logical replication is the only HA mechanism
- issues came up during other discussions already
- feedback sending could be definitely improved
- add the different version of the table to the logical replication specificalities
- update terminology in table (standby <-> replica): the table is only appropriate for physical standbys/replicas
- It is very easy to filter out transactions that originated somewhere other than locally. All local transactions have their origin set to 0 (unless changed in the session).
- Handle infinite recursion in logical replication setup: 
- the origin id is determined when a subscription is created
- based on the oid of the origin instance
- origins should be referred by origin name, but pglogical only allows filtering through a list of origin ids.
- improvement idea: use names for origin filtering instead of id
- PG has lots of metrics, but to access them have to connect to PG and issue SQL queries, not great for system administrators
- Simpler to have metrics made available in similar systems that they use for other things (OS, networking stack, etc)
- Open Metrics
- Flattened relational model, more like a data warehouse; labels/dimensions and values/facts
- Limited querying ability
- State of the art is that an agent is required on each DB system and then that connects to PG to run SQL queries
- This approach causes a lot of headaches
- Agent has to also be monitored
- Mappings between PG and the system done by the agent author, who isn't a PG developer
- Agent is always behind due to not being part of PG core, so when core adds something, the agent author then has to go update their agent code
- Cross-version issues
- Agents end up having to be supplemented by queries specific to particular systems
- PG can make it difficult because certain metrics are not easy to be flattened out
- Everyone has to develop their own mappings and their own alerts
- Ideal world
- Standard set of metrics from PG that would automatically be exported
- Automatically updated with new PG releases
- PG talk native Prometheus / Open Metric protocol to allow service discovery, etc
- A challenge is that not everyone has the same set of metrics that they'd like to see
- Different metrics are interesting to different people
- Per query/object/session metrics
- Summarized vs. instantaneous
- Lots of different views
- Different kinds of users
- One big database, one big application
- Simple database but have thousands of them
- DBA who isn't concerned with queries but concerned with storage / uptime
- OS metrics + PG metrics
- How metrics are published
- Different agents publishing different metrics potentially at different frequencies and measurements
- Each filesystem vs. each user vs. error / high-level monitoring
- Methods for getting in-depth, but there's a base level of monitoring for all systems
- Not always, in production, to be able to get in-depth data
- Concern about too much overhead in some cases for PG core to provide ways to monitor things because not everyone wants that monitoring
- When a problem is found, where is the rest of the data?
- Different in PG vs. other RDBMS- no repository in PG for holding performance statistics over time
- Have to emit these stats for other systems to be able to store and track this
- In other RDBMS, in-depth analysis able to be done within the system, but PG doesn't provide this
- Means that a lot more data ends up being exported to other systems which has downsides
- When working on a grafana dashboard
- Had to deploy a lot of queries through the agent
- Dashboard assumed that data was structured in a certain way with certain labels
- ideally be able to just have things "work"
- Existing agents don't export even the "base" level of things that should be exported for all systems
- Backend publisher for metrics to spit out things in a standard format (Open Metrics)
- Don't need a client system that specifically knows about PG
- Security / permissions around metrics are also a concern
- Security issues don't go away even if a metric system is integrated into PG
- Reconnecting to the database a lot of times is necessary because some metrics are per-DB and have to connect to each DB
- Monitoring systems are not happy having to connect to a lot of DBs
- Maybe with native support, wouldn't have to connect to each DB, which would make it easier for monitoring
- Hard to design dashboards at high level to get overview of all DBs, all clusters, and how they are behaving
- Have to go to a different dashboard to drill-down
- One cluster may have 30 DBs and want an aggregate, but also to know where things come from in order to drill down
- Many dashboards are for an individual database and it's hard to figure out ways to present data for hundreds/thousands of databases and for only a few
- Key performance metrics for single DB in single cluster
- Mapping of replica and figuring out which is primary and so on
- Dashboards not built to understand primary vs. replica and the queries which go against each
- Replica and primary can change over time too
- Different databases may not agree on who is primary vs. not
- Slony provided a way to see topology and could follow changes of primary vs. replica
- Data from pg stat replication / slots makes it difficult to reconstruct topology
- Slot names arbitrary
- Hard to match up with other nodes
- Wanted to detect if a node wasn't a replica and thought another node should be connected to it but that isn't really tracked
- Due to polling not happening at the same time, didn't end up being consistent
- Data for replication isn't really designed for monitoring the entire topology from a single view
- Meant for monitoring *this* DB
- Not able to see if the other nodes are consistent with this node though
- Not too hard for a human to interpret but names are strings as the monitoring system is expecting floating point metrics
- Some tables have richer data than is able to represent in monitoring, such as replication data
- Useful to be able to define your own metrics
- Having exporting built into PG might allow DBas to define this
- If we build something into PG, we should try to make it extensible
- System monitoring is useful and would be good to have consistent across all of them
- Might not need normal SQL access to get this information
- Maybe get all data from the stats system without needing to actually have a transaction open
- With agent, monitoring breaks if we run out of connections or ability to open transactions
- Monitoring can break due to locks
- Monitoring should be low-level pulling data out of shared memory
- Having the flexibility of SQL is part of the issue in some cases
- end-to-end monitoring can be good
- Users don't have monitoring of error rates and rollback and disk i/o errors even though that should just be part of the base monitoring
- Maybe have end-to-end monitoring be in a separate tool from the 'base' PG metrics
- Should there be an agent?
- Maybe not
- Ability to grab data w/o connecting to a database
- Maybe would be good
- Grabbing data across all databases
- Likes the idea about extensibility
- Kubernetes / Kernel allow you to grab out whatever you want with hooks and such
- Maybe PG should have similar
- Possibly without having actual SQL-level queries
- Stats system which is in shared memory and on disk
- Most things from PG stats gathered through this mechanism
- Outside of the usual transaction system
- Tempting to think that everything is needed is in stats system
- pg stat activity, pg stat replication - mostly just exporting shared memory info
- Not sure how to do w/ ebpf what is happening in PG as you'd have to understand all the PG data structures and such
- Making it configurable through a mechanism similar to ebpf would allow users to implement monitoring of things like subtransactions without causing overhead for users which don't need that
- Good to allow folks to turn things on/off in terms of what they want
- ebpf is great because it lets folks get things that they didn't think to get originally, not just because of the overhead concern
- PG stats don't have much OS data
- PG has how many blocks PG read but not how many blocks the kernel read
- Not good portable interfaces (either old and not great or new and not portable)
- pg stat statements might have getrusage data, for example
- getrusage isn't super useful due to being pretty "old fashioned" and not portable to Windows
- Way to get arbitrary stats from ebpf and in an extension to fill in more metrics
- Not able to make PG depend on ebpf but maybe could support it through an extension to support more stats?
- Get things like how much bandwidth/latency on the network
- Using an extension
- added extra columns to existing tables, those would automatically show up, wouldn't have to change agent's queries and add labels, et al
- Could have a standard set of alerts defined
- Adding columns to existing tables could cause issues
- What I want to achieve:
- Turn on PG and it all just works without an agent and with some standard dashboard that contains everything useful
- Per-object metrics doesn't really scale for folks with lots of databases
- What is the patch that is being worked on?
- Old patch w/ getrusage
- Added getrusage calls to EXPLAIN
- begin/end of each node, calls to getrusage
- Adds blocks read/written, system time, user time, page faults, and other metrics from getrusage
- Side point / example of additional metrics that would be interesting to be able to add and store/return those values
- pg_stat_statements could potentially also do this, maybe also pg_stat_tables but not sure how that would work
- Maybe call getrusage at start/end of every access to a table?
- Same general logic might work better if it's able to be done by querying the kernel (using ebpf?)
- More general project is to get native prometheus out of PG directly w/o an agent
- General dashboards and monitoring could then rely on the same data to be presented by PG
- More metrics that are added would need to be done in some kind of extensible way
- Looking from engine's / PG's perspective understand how much memory is being used by queries
- Sometimes want to quickly know what's going on without going into all of the memory contexts
- Which backend is using the most memory
- Maybe use pg cancel backend to kill that backend and keep PG running w/o running out of memory
- hash mem context added which could end up using a lot of memory
- Having more data on a per-backend basis can be helpful and see how processes are taking memory in a dashboard
- Maybe can't show it always but could provide a function to collect it?
- Folks are testing things out with ebpf and tracepoints and see how many times we are actually calling palloc/malloc
- Trying to get data from glibc/kernel for this
Meson new build system proposal
Notes from Peter E.:
- Current development information can be found via the commit fest entry: https://commitfest.postgresql.org/38/3395/
- We discussed the rationale for this project and demoed the current patch.
- Open issues include:
- support for extensions (alternative to pgxs); several approaches were discussed:
- write new meson.build files for every extension
- generate pgxs.mk by meson during build
- provide a conversion tool for extension makefiles to meson.build
- Is something like Coverity supported? -- probably yes, just specify a different compiler binary
- Niche platforms such as Solaris and AIX need to be tested.
- support for extensions (alternative to pgxs); several approaches were discussed:
- We hope to replace the Windows build system by Meson for PG16.
- The Autoconf build system will stay around in parallel for some time. tbd.
Reducing heap alignment losses
- Heap alignment losses add up to 43.25% in bad cases - [bool, bigint]...
- Automated column packing on CREATE TABLE ... or ALTER TABLE ADD COLUMN [, ... ] could save space, but current APIs would not allow that without losing SQL Standard, as we currently only support logical order == physical order == catalog insertion order.
- HeapTupleGetNAtts() is a monotonically increasing table schema version. That makes it a potential discriminator between different physical layouts of tuples when the physical layout of tuples only changes if at least a new column is added.
- Reordering columns would potentially require (expensive) table rewrites, but new column insertions in "waste space" could work through the above API.
- CLUSTER could potentially rewrite tuples, but that'd require significant catalog changes during CLUSTER and serious tuple processing. That's a lot of extra CPU and effort for something a DBA might already do well.
- Previous work on the topic: https://www.postgresql.org/message-id/flat/20141209174146.GP1768%40alvh.no-ip.org
Improving Indexing Performance
- Btree performance improved in last five versions due to deduplication, eager deletion of tids from tree
- SPGist, GIN, etc don’t have similar optimizations
- Currently only Btree, SPGist are presorted before loading. Currently being worked on for hash, GIN can be improved significantly
- BRIN Indexes could have presorted to do top-N sort to not read all blocks
- Some work needed to separate the work between index access method interfaces and executor nodes
- The executor is not as fully general and flexible as it seems today -- it’s more in bed with index am than it may appear to a modern reader
- Even old nodes like IndexScan->index am are paired
- Modern examples are things like KN search
- Ideal for Top-N sort but ….
- Another way to look at it would be “if all we have is a BRIN index what would we have to do to do a regular ordered index scan?
- Ideally we would need an incremental tuplesort -- kind of like exposing the old Tapesort algorithm
- Building GIST index-like tree structure on top of BRIN-min-max
- Would this be more appropriate for a more fully general BRIN-like index for non-clustered data distribution
- This would optimize finding the interesting BRIN ranges more efficiently than scanning the whole BRIN index
- Would be especially helpfully for the search above
- Hash can do the same optimization as btree for optimistic deletion
- Thumbs up from Peter G
- Why does this need to be copy/pasted into each index am, is there some generic api that could give this for free to any index am?
- Presorted index build for hash indexes
- Being worked on now
- In Btree: `where owner_fk =any/in and created_at < ? order by created_at` with a btree on <owner_fk, created_at>
- Is this skip-scan? Loose index scan?
- Sort of but you need to be able to merge results from different owner_fk values
- MDAM paper has a framework that can help with this
- One of the assumptions of MDAM is that you don’t have to do duplicate elimination. So it’s equivalent to a btree order
- There’s some duplication here -- this is very similar to MergeAppend -- but it seems impractical to have the planner be able to rewrite this that way
- There’s another similarity with GIN indexes which has a lot of the same logic where it has multiple scans of the same index with skip scans
- The GIN index is really the same as a single btree with optimizations like leading prefix elimination and efficient optimizations to store duplicates
- GIN also only supports lossy index scans
- It would be nice if the GIN index which already does all these skip scan tricks were lifted out so any query could use the same algorithms
- Peter G: It seems the consensus is that this logic should live in btree
- Does this mean it will be hard to make use of for many different types of queries? It’s no harder for the planner to recognize the planner and use an index am method than it would be for it to construct a plan tree which implements it.
- It’s not easy to cost skip scans using statistics
- Especially when the ANY/IN is some generic subquery
- But with a LIMIT the costing doesn’t need to be very accurate
- But what if the IN list has a large number of values and they all produce no matches …. Then the LIMIT doesn’t help
- There is another plan that looks very different
- The plan could take the top-N records from each set and then sorted afterwards. This has a better worst case but much worse best case
- Peter G this sounds compelling
- This makes me ask again whether these algorithms shouldn’t be in the executor/planner where it can make these decisions based on costs and statistics
- But the AM can make the decision on the fly when it sees the data arriving instead of making the decision early
- Does Santa Clause Exist?
Benchmarking Round Table
- Peter: It's hard to keep up with what benchmarks are considered good, which ones are discredited.
- Grant: Need to distinguish between micro and macro benchmarks.
- using pgbench, sysbench, hammerdb
- Mark: working on updating my TPC kits
- GSoC project to update TPC-E kit
- some work on DS in progress
- Mark: PG doesn't do well in TPC-H yet
- All the TPC specs are open. (Don't have to be a member to read them.)
- Peter: LDBC is a good example of an open source benchmark.
- Julian: [benchmarking replication] When there is a deadlock, pgbench just fails. Could be smarter about retrying.
- Mark: TPC specs contain details about how retries are allowed.
- Peter: Some refinement for pgbench in this area could be useful.