FOSDEM/PGDay 2023 Developer Meeting
FOSDEM 2023 Developer Meeting schedule by Time and Room
|Wed 8:30-9:00||Welcome and Introductions|
|Wed 9:00-9:30||Improving index performance|
|Wed 9:30-10:00||Extensions & SMGR|
|Wed 10:30-11:00||XLog Format|
|Wed 11:00-11:30||Page Format|
|Wed 11:30-12:00||ResourceOwner Patch|
|Wed 12:00-12:30||ICU / Collations|
|Tue 13:00-13:30||Improving wait events|
|Wed 13:30-14:00||Extensions & Stats|
|Wed 14:00-17:00||v16 Patch Triage|
Improving index performance
Matthias has been working on improving index performance. Concerned about interest in improvement in B-Tree performance. Improve how we use the data on the page. Multiple columns we can sometimes skip the first few columns because they're likely to be equal. Would like to know if it's worth it to continue on that. Responses mainly from Peter G but not others and want to gauge interest. nbtree performance improvements, specialization on .. PGConf.Eu presentation which showed improvements still possible. Sort on index key ranges with min/max index instead of whole table sort. Faster top-N sort with BRIN. Tomas working on this.
Andres - are these changes attacking the most common performance issues. BTree index performance is more in CPU overhead and less in how data is stored. Builds search keys from scratch and code seems designed for cache misses. Very basic optimizations that we should be doing to improve btree performance. Benefits from how data is stored is constrained by these other issues. We don't keep the block numbers anywhere useful so we have to get-block-number all the time and that is horrible for performance due to cache misses and cache lines.
Matthias- I see your point, not something I had been looking at when I started working on it.
Heikki- These are orthoganal changes
Andres- Don't find the structural changes as interesting due to the CPU overhead, et al
Matthias- I get that these changes could be done too and would compliment each other.
Peter E- Initial patches didn't have good performance numbers or clear improvements and so wasn't clear how it was going to help. Selling the patch better would help get interest.
Matthias- Lots of people have SSDs and fast storage and on single keys and in those cases these patches don't really help. Much of the work is making sure that these cases don't degrade while improving the multi-key cases.
Heikki- Are these useful on their own?
Matthias- Yes, the patches are useful on their own. Improvement in multi-key indexes while not degrading the default case. Makes it complicated.
Jeff- What was your motivation to work on this?
Matthias- We had a really large index across three columns which was really slow at a prior company to do lookups. We couldn't use btree deduplication for $reasons. Was thinking "why is this so slow?" and was largely because the attribute had to be compared for every column but we don't need to do that in every case because we know that the first columns are the same. Improvements seen 10-20% on index insertion and lookup.
Jeff- That sounds pretty compelling but thread was hard to guess from what was in the patch what the improvements were.
Matthias- 31 text column index which go into compare path plus one uuid patch which gave 200-400% improvements because we can skip the earlier columns.
Heikki- You can construct cases which can show the improvement.
Jeff- Constructed cases aren't very compelling but actual use cases which show strong 10-20% improvement are a much better way to sell the patch.
Jeff- Which collation provider was being used?
Matthias- The non-default collation because it's more expensive which helps demonstrate the improvement, but even with the default collation there were improvements.
Andres & Matthias discussion about better approaches to scanning and constants.
Matthias- In PG14, Peter G committed some changes to btree where tries to delete items on the page when it does a split to avoid a split. May be able to implement the same for the other index types which could improve performance of those other index types. Don't have time to work on it currently but someone else could work on.
Andres- Index insertion path improvements by doing a pre-sort which can help a lot. No reason to not do a pre-sort when doing batch inserts. Not able to do it in every case due to triggers and such but in many cases it could be done and would help performance.
Mark Dilger- What is causing the improvement?
Andres- Just try it and you'll see the improvement.
Heikki- The table will also immediately be sorted and so you don't have to CLUSTER. If index could keep track of recent inserts then it could order them and insert them.
Andres- Like GIN fast insert.
Heikki- Yes, but better. Could be done pretty simply in the index access method..
Andres- Not sure doing it in index access method is best as it could be the same code copied a bunch of times, better to do it higher up..
Heikki- Yeah, better to have a batch insert access method function
Bruce- Folks are frustrated and sometimes the people working on things may feel like a 'lone ranger'. Is there anything we can do to try to avoid that? Particularly with the more specialized stuff, the farther down you go, the more peoples eyes may glaze over. We encourage people to take on these hard problems but it may seem like people don't care about it, but people really do care about it, a lot of people are interested and don't feel that people don't care. Anything we can do to try to improve on that?
Heikki- Some folks have said that working on PG can have an impact on mental health due to frustration of working on PG.
Bruce- Not a topic here but feels like something we could work on improving.
Matthias- The commitfest topics and patch names aren't very descriptive of what expertise is needed to review the patches. Maybe could add "area" such as "indexes" or "access methods" or such to the commitfest system.
Peter E- How are those defined, could we change them?
Magnus- Might be able to be modified by CF Admin or superuser.
Peter E- Always wished for adding categories because lots of patches end up under Misc.
Matthias- I don't mind the topic part but also need the distinction of what part of the code is being modified and what expertise is needed.
Peter E- How would you do it? Maybe tags?
Matthias- Maybe general areas would help.
Andres- Doubt the commitfest app is where the issue is here.
Matthias- Maybe could make it easier to find patches.
Peter E- Have to sell your patches.
Heikki- Problem of contributors getting frustrated and going away. People post patches they feel are brilliant but then they don't get any feedback on it. Or people put time and effort and then don't hear anything for months and then the result is that we don't want it.
Mark- I'm also working on index improvement but going at it in a different direction, changing heap code and not index code to make improvements and so it isn't clear if there's an overlap there or not.
Extensions & SMGR
Matthias- At company we would like to be an extension and not a fork but we bind deeply into the SMGR APIs. How well would PG be accepting of SMGR available to external users to avoid having to be a fork.
Peter E- Way back to 6.0 we had this
Matthias- We ressurected that
Peter E- Also looked into this and just have to come up with a way to do it but everything is today hard-coded. Maybe like tablespace with a local override or something.
Andres- Has to work at all times including in an inconsistent state because we use it during recovery and therefore can't look at catalogs, et al. Worried about code complexity for core PG where PG is just making things easier for forks or other projects but making it too complicated for core without any benefit.
Heikki- Lets imagine opening up smgr read and smgr write
Peter E- That or md.c functions?
Heikki- Does it make a difference?
Andres= Could use compiler flags to override that using linker magic.
Mark- Change the whole cluster?
Heikki- Yes, across the whole cluster.
Mark- Can't really do that from an extension then because you need to init the DB and load the extension first.
Andres- You would only be able to do this at initdb if for the whole cluster
Heikki- Also use md.c for temporary files and things instead too.
Andres- How are you handling figuring out when to use what?
Heikki- Hack in smgr.c to pass in a flag to indicate the kind of table
Matthias- Passed in the type of relation to then control how we access the files
Heikki- Or we could have it a level above to specify the functions to access
Mark- This is for storage as a service, so you could just swap out the whole storage manager
Matthias- We are modifying the functions that call into the storage manager to call our hooks so we don't use the smgr storage structs
Heikki- We have hacked to pass in what kind of table it is
Jeff- Could we do it at tablespace level instead
Matthias- We don't currently force all temporary tables to a particular place
Heikki- May have an issue with what Andres said about being able to recover before being able to do catalog lookups
Andres- yeah, not sure that tablespace makes sense for this.
Heikki- Other things that you could do, maybe encryption or compression, not sure if that would work. Another cool thing would be to do something with backup to fetch the data on-demand from like pgbackrest.
Peter E- Maybe instead of storing on disk, store out in the cloud somewhere. Not easy to play with the changes since there isn't a good API
Jeff- What about the non-smgr file access that's going on.
Heikki- All of the SLRUs, work around that by not doing things with those and those are still stored locally. Would be cool if they also went through smgr API.
Mark- How would that work? Multiple nodes connected..
Heikki- When you start up PG, you need a base backup to start from with control file and other little things and the SLRUs. Just restore those as-is. SLRUs are pretty small so that is ok. All of the relations are stored in the cloud and get accessed through the SMGR API. Only one writer to handle the sync issues.
Mark- This isn't multi-master?
Heikki- No, this isn't multi-master. All the table and index accesses go through the SMGR API. There are some places where we have to modify the code for index builds. One thing causing us trouble is some writes through SMGR are not going to be WAL logged. GIST index builds the index first and then WAL logs everything after. smgr write for us is a no-op and then on read we reconstruct from WAL. Maybe add assertion that everything is WAL logged to make sure all modified pages are actually WAL logged.
Andres- Hint bits?
Heikki- Think we just throw those changes away now but we could possibly do better.
Andres- Sounds like you need to do the work to propose a patch and then we can see if the complexity is really bad or not.
Matthias- There is not zero external interest in this.
Mark- Sounds interesting and we would be interested in seeing a patch for it.
Heikki- One thing that bothers me about md.c and smgr.c- the way they treat the relation forks is kinda ugly. md.c needs to know about all the forks and there's an array for the segments and forks and segments on forks. Would be better if md.c didn't have to deal with that and instead dealt with one fork at a time. smgr.c could maybe deal with forks.
Andres- That doesn't seem quite right. You would have to somehow group the data together. Looked at that problem at one point and maybe we could have completely different relfilenodes for forks instead and have one fork for main relation and a different relfilenode for the VM would make things simpler.
Peter E- Set of forks are hard-coded in a lot of places.
Heikki- Larger relfilenode we could do that
Matthias- 56-bit relfilenode would allow us to have room for that
Peter E- What about other kinds of forks that other kinds of storage might want to add, maybe as an extension or maybe not, not everything may need the free space map, etc.
Heikki- Working on column-store before would like to have a separate relation fork or something like that for each column.
Jeff- Might be harder to do that but at least having more than one ... there's a difference between allowing a few extra forks vs. allowing potentially hundreds.
Andres- But what about init forks which are weird magic, may have to come up with something different for that. Init forks maybe a separate directory then you wouldn't have to iterate over everything and could work and would allow to have initial data.
Peter E- forks are used by different parts of the system such as init forks being hooked into the crash recovery system
Heikki- init fork change would be good to do independently.
Andres- had a patch to allow you to associate a given relation with multiple relfilenodes in pg_class but didn't quite get it all done
Mark- As Andres mentioned, not want to only accept things into core that are for forks. Chicken and egg when it comes to TAMs though. More baked into core that we have heaps, lots of code doesn't talk about tables but they talk about heaps. Lots of assumptions. Mainly interested from the group in trying to address the chicken and egg issue when it comes to TAMs or to see what the better way to go about it is. Company worked on zHeap for a while and was excited but not sure if it's ever coming back. Tendancy to think of new TAMs as where devs go into the wilderness to pass away. Two TAMs developed and released to solve specific performance problems in production. Working on new TAMs with different on-disk formats. Anyone else working on TAMs is likely to want this too. How to share these improvements in core for these other TAMs. There are some functions that call the heap functions directly still and don't go through the TAMs. If we put a different TAM out but when these differences exist and something accesses directly through heap functions then you end up with bugs. Created the Contrary AM to store everything differently on disk intentionally. If you use Contrary AM, you can find where functions which access the heap directly will explode and break. Happy to contribute that back but not just that, could write a TAM which will provide better performance but which aren't heap. Should we have a contrib module to make sure that the TAMs don't get broken.
Heikki- TAMs should be how everything access, but those are bugs if they access the heap directly and should be fixed.
Mark- Extensions exist out there or forks which access the heap directly and that's where the bugs are.
Andres- For a beginscan could check that what was given is actually a heap and so there should be a catch happening but if we don't have that then we could add that.
Jeff- Could we add preprocessor magic to check if someone is calling the heap functions but throw an error if they are being called from not through a TAM that could maybe throw an error during compile time maybe.
Andres- Not sure we could do that and that might cause more harm than it would help.
Heikki- What extensions are doing that?
Peter E- pglogical is doing that.
Heikki- Why does it do that?
Peter E- Fixable but could build more scaffolding to avoid new extensions doing that.
Alvaro- How much is there between heap and the Contrary AM?
Mark- When you have a loadable module you expect in minor upgrade for those modules to continue to work, don't expect that the module is updated at the same time as the core code. If you copy the heap code and make it pile and add it as a contrib module as a TAM but identical to heap. If you upgrade from 15.0 to 15.1, it should be fine, but you don't know for sure and the community may change the heap code in a way that changes things that aren't compatible. Maybe we could have something in contrib which runs in the buildfarm to detect such changes.
Andres- What minor version changes has this happened in? We've only had like three of these recently, just those?
Mark- Yes, just those.
Heikki- Is this just for contrarian AM or are there real cases where the heap changes in minor versions have caused issues?
Peter E- May only happen once a year or once every three years but when it does happen it's very tramatizing...
Mark- We have several customers with heap performance problems and they continue to ask for fixes. Index bloat due to holding open a transaction and lots of updates, lots of tuple versions across pages and lots of index bloat. Make a solution to that problem and give it to the customer, you're very nervous that this new AM won't survive minor version updates. Would like a better guarantee that this won't be an issue in minor version upgrades.
Andres- Don't recall a case in core where we have had an issue with this and sounds like isn't a core issue really but is an issue in extension. Not obvious how we could test for this in core because it's an issue in something external. Maybe something that tests for signatures.
Mark- Does anyone else that wants in TAM... want to spend maybe a year working on a new TAM to allow push-down and contribute to the community. Others interested in predicate push-down stuff?
Jeff- Yes, interested in that.
Andres- Heap could be interested in this too
Heikki- A few things left in core that assume there is something like heap. ANALYZE assumes you have blocks. Would be nice if there was a function in the TAM API. Was expecting that to be brought up.
Peter E- Not really going into fully different things with this- things are still blocks.
Heikki- bitmap heap also assumes blocks still
Mark- If you take a bit of space to say what kind of page it is then you could have your scan skip blocks that aren't interested in that. Don't need an API change for that really. Will TABLESAMPLE land on the wrong kind of block maybe, is that an issue?
Heikki- That seems like it may be an issue, yes.
Andres- Some pretty easy API changes for bitmap heap scans but after that you're going to have to have something that's block shaped.
Heikki- With bitmap heap scan, the problem is that it degrades and becomes lossy and you have to scan all tuples on a particular block.
Matthias- Becomes lossy quite quickly, could imagine TAM that has many many line pointers per page / TIDs, so bitmap heap scans likely to always be lossy which could be pretty annoying.
Andres- Someone working on patch to use radix for vacuum and that could maybe be used for bitmap heap scans, not a small change.
Matthias- Not sure if happy with this part of TAMs, but also a patch to do batch inserts for things like insert into select
Mark- Would be fantastic. Some of the code about assuming the number of tuples per page doesn't really work because I don't want to store the header over and over again, end up sorting the data as it comes in to improve performance and improve compression.
Andres- Is that a question of TAM or executor code?
Matthias- Both, right now when we batch insert we only do in copy, only happens with full batch inserts. Patch proposed to make possible to do batch insert in TAM and allow to buffer for later insertion and could use that to reduce WAL size and improve performance of compression. Old patch that hasn't been updated in quite some time, about a year ago. New TAM for multi and single inserts is the patch.
Jeff- A lot more infrastructure could be provided for conditional push=down around things like parameterization. Ideally a TAM would advertise the columns that would be interesting for parameterization and then the planner could generate and cost those paths. That could be a combinatorial problem explosion, the planner would need to handle pruning that. Would be useful infrastructure. Can be done with custom plans now but think that a lot of TAMs would want this and therefore would be good common infrastructure for that.
Heikki- Does that predicate push-down make sense...
Jeff- Rather than do the predicate push down, let it return some other structure to the executor and let is handle it there? I think that's a good point but it seems like the horizon for actually making that fully generic is pretty far out there. As an extension author, just trying to get something running, you'd have to do that with a custom scan.. You want to do some simple predicate push-down you have to invent a whole bunch of things in a custom scan and that's a pretty long path when you may have a simple structure which would let you rule out a bunch of rows that aren't interesting to scan very quickly. Instead, being able to say parameterize me and provide that path to the executor would be simpler/faster to have.
Andres- Possible issue with taking out exclusive locks in part of this but we could probably work around that to make it better and possibly provide a substantial speedup.
Jeff- If we rearrange some of this, we may be able to rework how index parameterization is done. Don't have a lot of details there but essentially if you have a TAM with predicate push-down, looks a lot like a nested loop index scan from the point of view of the planner.
Heikki- Could we use the exact same scankey infra for TAM that we do for index scans..?
Andres- Today it's different
Heikki- but you'd want it to be the same
Jeff- Don't have a way to do that today
Heikki- Have a heapscan key today in the TAM but we don't really use it today.
Mark- If you ask for all rows where ID=5, you go to the index to get it, you don't really ask that of the TAM. There's an opportunity there for improvement if we also pass that through the TAM to eliminate things that the TAM has to do.
Jeff- Thinking of the TAM to do less, would have to write a lot of custom scan code but could be done with custom scan node and TAM today, all doable but it should be common infra.
Matthias- Complaint- it's large. We have 44 bytes to just change 1 page. 24 bytes are xlog header, remaining bytes used in determining the actual page and related overhead. Don't think we should have that much overhead for changing a page. There are changes we could make to reduce that overhead if we are willing to make those changes. A few discussed before on the lists. Looking at what changes we actually can make, for instance- transaction IDs are included but few cases where we actually need transaction ID. Indexes are not aware of transactions and don't use transactions, maybe we can eliminate transaction IDs from index updates. Length of record currently uses 4 bytes but records themselves may need less than a byte to store that, and should be able to reuse the rest. Potential problem there is that the decoding may become fairly expensive.
Peter E- Send a patch!
Matthias- Would have but have run into issues with decoding/unpacking of struct. When record is split across pages because it's too large or just didn't fit, right now we copy the whole record into a separate buffer that is allocated and then we checksum and then we decode it. For large records we have overhead of reading the record twice. Very expensive. Should be able to do decoding and checksumming in one pass, many opportunities to figure out that the record doesn't fit and the checksum at the end is good to validate the record. Don't need to checksum the full record before we start decoding.
Heikki- All comes down to performance, send a patch and benchmark it. More principaled questions- Do we need to store the XID on every change for $reasons like debugging perhaps?
Alvaro- Used by pg_rewind?
Matthias- Not every xlog record gets generated in backends that have local transaction ID
Andres- today we have generic handling of it but changing that means we need to make sure we copy transaction ID to everywhere that needs it
Heikki- Would it be ok to compress the WAL itself? Do some kind of fields to the page header for the relfilenode so that it is only stored once instead of every time, or maybe reference previous WAL record, to avoid having to save the relfilenode over and over.
Andres- Locking for that would be awful and adding intra-record stuff would be bad. Maybe reference the page header for multiple records within a page and could do that opportunistically maybe
Matthias- We build the record before we know where the record is going to go
Andres- already able to modify the checksum and we could change where it goes
Heikki- This could change the size though is very different from calculating the CRC because impacts records after. Comes down to the xlog insert path as that has to be highly concurrent, assuming we could make it work we could do it. Kind of wasteful that we store CRC, et al, on every record but then we flush the records in larger buffer, maybe we should have a frame or page that contains multiple records.
Peter E- but then you still have to go back, concerned about the size or..?
Heikki- People concerned about size yes, though mostly about FPIs, but still
Peter E- Decoding speed vs insert speed and size
Andres- Record length variable width maybe... pg_waldump --stat was pretty reasonable on a big workload with variable record length, could optimize that further. If the data can be organized to make decoding cheap then could help but we have such low-hanging fruit like we store 4-byte integers where 3-bytes could be used, but just generally reducing alignment losses, et al. WAL was more than 60% zeros or something like this.
Matthias- there are some alignment places where there are zeros..
Heikki- We could possibly just compress the WAL when we write it.
Matthias- Compress record or stream?
Heikki- Compress the whole stream.
Andres- Constantly end up flushing the last page multiple times which is pretty bad for OLTP workloads
Heikki- Have to have an append-only compression algorithm
Peter E- If you do that then you can only compress the flush size and that may not make sense
Heikki- Not as good as compressing the file afterwards but still could be better
Alvaro- Proposing using compression to avoid improving the base layer of how we write WAL
Heikki- Can be an effecieint way to address it
Andres- One thing reminded me of for AIO, for network storage and slow storage, partially filled pages can't be overwritten concurrently and that becomes an issue. Have multiple IOs for different pages concurrently but can't do that for last page in WAL. Got performance improvements by using new pages constantly to parallelize it but that blows the WAL up really big so isn't good but does improve performance. Don't really see how to improve with the way partial pages happen. Reduce the block size to 4k instead of 8k as 8k doesn't give us any benefits and makes it much more common to have partial pages and on the filesystem typically have 512 or 4k granularity and so you just add overhead with 8k bytes and read/modify/write cycles.
Heikki- Would be great to get rid of the WAL header entirely..
Peter E- Only drawback is that you have more records split
Matthias- Still need the segment header but that's ok
Heikki- Think you're replying to me and just talking about the size change and yeah
Andres- if you start to write WAL that's not page-aligned then performance suffers really badly.
Heikki- Makes the decoding and encoding of WAL more complicated as you have to keep track of those headers
Andres- You can't read as randomly from the WAL because you don't know where things start and end, but wouldn't know that without the header. Two-phase commit stuff ...
Heikki- Reducing the page size and just flush have a page rather than full page and get same benefit?
Tomas- You will still modify the page multiple times and dirtied the page multiple times
Heikki- Keep xlog size at 8k but at xlog flush then flush at 4k
Peter E- Then you introduce a different idea of what a page is
Andres- Don't see a real advantage to having 8k page size, doesn't seem advantagous.
Heikki- You're saying it's useful to have 8k page size WAL?
Andres- No, I don't see it being useful at all
Heikki- As far as I'm concerned maybe we should have 32k or 16MB page size instead of having it be smaller to reduce page header overhead
Andres- Able to determine end of WAL more easily
Matthias- When we recover to a broken record, usually we expect it to be because a split page write and that's where we know the WAL ends. Kind of important because as was mentioned in 56bit relfilenodes, we might write WAL all the way back to a prior flush
Andres- Would be pretty bad to have to reflush a 1GB segment
Matthias- Right now the page header the wrong LSN will be seen
Heikki- If it's not a recycled earlier segment but is a new segment...
Stephen- Comments about larger field for AES GCM auth tag.
Matthias- Right now we have normal page format which is used and available for various AMs. There are ideas about TDE which want to change the format to reserve some space for auth tag or other things like extended checksums. I think that space should not be at the exact end of the page while we are in memory, don't care about it when on disk, but in-memory that is prime real-estate.
Discussion about page changes to store extended checksums or 64bit XID or auth tag.
Tomas- 4k pages can greatly improve performance in OLTP workloads
Andres- Case where low-cache hit ratio, shared buffers isn't enough to fit things
Tomas- yes, in cases where shared buffers isn't large enough
Andres- Might be more a factor of line pointers distance to the tuples
Matthias- Shouldn't be an issue if huge pages are being used
Andres- If huge pages used then yes but with prefetching there's heuristics that may not work if it's too far
Tomas- When it fits into shared buffers it doesn't seem to make meaningful difference.
Andres- That makes sense and is mainly write volume and not specific to SSDs really
Tomas- SSDs have erase blocks but they're split into pages but the faster you write the faster you write into erase blocks, generate more work. Definitely it's a combination of multiple parameters.
Andres- Dirty write-back with SSDs where they go back and write out blocks in a row to avoid having to make the system read in a page and then write it out can make a huge difference.
Peter E- How can we get more people to use checksums?
Andres- Right now there's a big performance hit from using checksums in some cases because WAL logging of hint bits causing a performance hit.
Matthias- Idea to reduce hint bit writes in the WAL from changes with checksums enabled. There is no split page potential? Even if there is a torn page, you still redo the record.
Heikki- Partial full-page?
Andres- Typically is going to be a full-page change anyway so doesn't really help
Matthias- For freezing which is a common case, changing a lot of things on the page from visible to frozen, just changes bits, the meaningful bits aren't being changed
Heikki- Point is when you write the WAL record, you'd have to say modify this bit or that bit
Matthias- organization of the page doesn't change
Matt- Who is we
Matthias- the information on the page, while freezing we aren't changing the data of the page, whatever torn bytes there are going to be aren't changing the bytes that are meaningful. Reason for torn page protection is that the line pointer array may change and the tuples may be getting changed but if we don't change where the tuples are then a torn page shouldn't be an issue
Andres- not sure that that's true, maybe for hint bits but not for freezing
Matthias- even for freezing, it's only sets of bits that are being updated, fairly certain that we can improve on this. Think there are places where maybe we don't need to push out a full page image.
Andres- Don't think it's enough, still need to do a full page write or need a second LSN
Matthias- We do have space in the page header bits, only have a couple of bits used
Andres- As soon as you have checksums this is all gone
Heikki- This is for checksums too, you do a partial full-page write that's not as large, but seems hard to pull off
Matthias- Yes, extremely difficult but not impossible.
Heikki- What if we stop writing those hint bits?
Andres- the performance hit is very large if you just don't write them at all
Peter E- Vacuum-ish kind of process that does it sometimes still
Andres- the SLRU lookups kill you immediately in terms of performance
Jeff- Maybe a tiny cache that would help?
Andres- Maybe could win a lot with a tiny hash table cache. Could cache xid to parent to win a lot to help with subtrans too.
Heikki- Have a better SLRU system and cache but isn't going to be as good as hint bits on the tuple
Bruce- Have a scratch space for a table, always have an extra dead page in the table, instead of writing page 10, you write it into the dead space and have that space in the table. The reason we have trouble is because we can't go back to the old version of the table.
Andres- Then have really hard problems with possibly returning double tuples
Heikki- So whenever you write the page to disk you write to double-buffer area and it's an alternative which has downsides but is possible
Andres- No need to log hint bit changes immediately and so maybe we could batch them and reduce the xlog overhead and WAL logged hint bits
Heikki- Have patch to change the way they work internally and make them useful in extensions. Not using them currently in any extensions but for others to use if it's useful. Patch made some changes to how ResourceOwners work. Objections from Andres- with Heikki's patch resources are released in random order. May have made exception for locks.
Heikki- pgcrypto wants to track some things in ResourceOwners, when wrote that code was very painful because couldn't use them. There's callbacks but they're really difficult to use. Hard for an extension to leverage ResourceOwners from extension. In core we have 10-15 uses for ResourceOwners and there's a lot of boiler-plate code that could be eliminated.
Andres- Performance regression due to ResourceOwner getting bigger which isn't good
Heikki- Hard part of patch was to keep performance good because of ResourceOwners being in critical path. Objection- released in current code is in a specific order but with the patch they get released in random order.
Alvaro- Why do you care?
Andres- There may be some dependencies in there and error handling needs to mark the page and has to happen before un-pinning the page during io.
Alvaro- Can we create more phases
Heikki- We were discussing that and maybe having a priority number or such. Not convinced that should be necessary. Would like to take a look at where that's being done. Second objection, if you need to remember a resource in a critical function you have to first call an enlarge function to reserve a slot for it. That mechanism is per resource-kind currently. Changed it so that there is just ResourceOwner enlarge instead of being per resource-kind. Difference is that if for some reason you want to reserve one buffer pin and one tuple descriptor and then you enter critical session that doesn't work because with patch can only allocate one slot. Argument is that you should keep the distance between reserving the slot and using the slot should be very small because it's already very dangerous to do this because of other calls you might end up using one by accident. New patch just reserves one slot instead of having one slot for each kind.
Peter E- You could just allow reserving more than one
Heikki- Yes, if you know how many you might need then you could do that. If there's any serious code between the reservation and using the slot, it's very hard to be sure.
Andres- Right now we always increase by power of 2 and that's part of the reason it's hard to find off-by-one errors. Maybe change to have a counter/check to make sure that you aren't going past how many.
Heikki- Maybe have a way to return what was reserved and then have the use of that pass in the value of which was reserved and throw an error if that's an issue.
Andres- There should be only a few places that need more than one.
Peter E- Seems solvable.
Heikki- Third, for some kinds of resources you could do it differently and instead of having array and hash, keep track of resources in a linked list instead. Some resources have structs and you could use a linked list instead and that could be faster and more performant.
Andres- That could also make it safer. Ran into this for AIO for WAL insert and every AIO for WAL insert had to be reserved and is in critical section and you can't allocate memory there. Not applicable to all kinds of resources but does work for some.
Heikki- Maybe linked list approach could be used for basically all of these cases instead since very few cases where there aren't structs. Maybe everything could use structs?
Andres- Convert most things to list then maybe could be better. Patch by Rowley to get rid of all special cases by using dlist(?).
Heikki- For buffer pins could we have a local buffer struct
Andres- Have to allocate it. Could do something like existing resource for hints and just use dlists or lists for everything else.
Heikki- Yeah, maybe I'll try that approach.
Andres- Maybe combine dlist approach with patch approach by storing allocation in dlist head so that can store header, dlist head, inside resourceowner unless it's needed and then one or two things that are needed, might be best of both worlds.
Heikki- I'll play around with that if I get a chance.
ICU / Collations
Jeff- Issues with collations. Right city to discuss it in. One thing is PG is pretty unified and the direction users a guided in in terms of the way things should be done. Integer timestamps are better than float, et al. Should we be doing the same thing with ICU vs libc? Should we make a decision there? Are we not going to express an opinion? Of course, even if we try to not make a decision, if we leave the default as-is or change the default, that's a decision. Would we eventually like to pick one way and go with it, or stay on the fence?
Peter E- Would like to move towards making ICU be used as the default.
Andres- Hard dependency?
Peter E- That's the problem. You can change the initdb default but it'll fail if ICU isn't compiled in. What do you do then?
Tomas- What about use ICU by default if it's built-in?
Peter E- Is that a good answer? Already is environment-dependent and so maybe it wouldn't be that different. Might be better as it would be a better default instead of getting it from environment.
Heikki- Locale itself still depends on where it's running
Peter E- Want to get rid of locales but that's kind of independent.
Andres- Just make it a hard dependency?
Alvaro- Are there other collation providers? Microsoft?
Andres- Microsoft has ICU available but not the default for things
Peter E- Collation provider concept, back of mind- there's a native API on MacOS which could have been another choice but there's no practical benefit. Doesn't seem like there's actually a bunch of different APIs, just the legacy one and the ICU one and not really interest in other.
Dave- What about platforms which don't have ICU? Are there such?
Andres- Don't think there really are any such. Built PG on a bunch of platforms last year and pretty much all have ICU for a long time. May not be available by default on some systems but it's available. No extra dependencies on MacOS currently to build and some appreciate that.
Peter E- When on new platform sometimes it's nice to be able to git clone and build and avoid ICU because ICU is big to download and build.
Heikki- Agreed that isn't great, maybe have an option to not have any collations in that case?
Peter E- Question is if we want to nudge users to use that stuff.
Andres- Switch the default to use if available which would at least allow devs to not have to worry about it but generally it gets used.
Alvaro- Is this something we could do for 16?
Peter E- What we are talking about right now as it's an easy thing to just change the default.
Jeff- If we feel ICU is the right thing, we've been using it for a while and we have found some issues with it and it isn't perfect but generally my feeling is that it's a better path than libc and if the project feels that way then we should start nudging people in that direction.
Joe- Not all of PG locale functionality is handled by ICU. lowercase/uppercase, C-type operations...
Jeff- lowercase/uppercase do use ICU but there are some scattered cases of other things being used. strxform call when making histograms as an example maybe
Peter E- Question about how to instrument these things to catch such cases would be good to figure out. tsearch uses it, on list to fix but isn't very interesting.
Jeff- Those scattered places ... there are details we should work to figure out and address those cases
Heikki- Even if we don't change the default, we should fix those cases anyway. Is there a reason to not use ICU?
Matthias- because you're only using the C locale?
Peter E- Then just say to use that?
Matthias- But you could build a smaller binary with having just the C lib
Heikki- Is there a performance reason to use ICU than C lib?
Jeff- In my tests it's been better
Joe- There was a regression introduced by the C maintainers where they made a change saying it wouldn't cause a performance issue but it actually does for multi-byte. ICU is faster if you have a lot of UTF8 multi-byte characters vs. libc. Big regression in recent versions of glibc.
Jeff- If they fix that problem then in theory libc could be faster, but if they don't fix that then ICU blows away glibc.
Joe- Scattered calls to glibc locale dependent functions in PG core that aren't going to ICU, concern about switching to ICU due to that
Jeff- In terms of actually what the user sees, should all be handled correctly. The cases pointed out shouldn't be user-facing. A lot of those cases are with libc collation provider and not with ICU, though there were some calls that may need to be looked at. If there are user-facing issues then that's a bug that should be addressed. Assuming we can address the bugs...
Matt- Independent of performance, a libc upgrade that changes sort order breaks indexes, etc. ICU would make it easier to detect/address that?
Jeff- No way to change from one collation to another today and so have to keep same ICU version. But, there are potential advantages to using ICU because it's a separated library that you could manage the versioning of instead of being tied to libc.
Joe- Ins't just indexes. If you have FDWs and running on machines with different versions of glibc then you'll have problems in that case too. Recent case of mysql FDW and they were having problems because a join wasn't working because the collation for mysql was different.
Matt- Haven't seen a strong advantage to ICU vs glibc because that's the same problem between the two.
Andres- Not really a nice way to load multiple versions to load multiple versions of ICU, but you could do that more easily with ICU.
Peter E- Or you could just keep the same ICU version around generally instead of having to upgrade it, like you have to upgrade glibc due to $reasons. We could move things forward at least, not a panacea.
Joe- Did a project where extracted out of glibc the locale code into a separate library to use and freeze the collation at a particular collation that way. Link PG to that library instead of the actual glibc library.
Tomas- Wouldn't want to get stuck on one collation as there are improvements which happens. If there's a new glibc version, how difficult would it be to update that?
Joe- Was able to test it extracting 2.17 and 2.26 and the way extracted was able to work for both. Could be extended to build a different version if needed.
Tomas- How would that work? We would decide when building a major version?
Joe- Think it would be something that the packagers would have to handle. Same issue with ICU.
Jeff- Have working code to allow change ICU library at runtime so users could change to a new version of ICU. Could help users prevent issues with the library changing out from under them. Based on prototypes that Tomas provided earlier. Might not go into 16 but the code works. Also have prototype code which allows doing something similar for libc. Packagers could build against later version and then users would be able to choose version at initdb time. Packagers would then package up multiple versions and make them available concurrently and keep them all forever and users could then choose the one for them and keep it static.
Tomas- One of the problems we have is people upgrading OS where new server has new glibc and they don't realize their indexes have gotten broken. This would be a solution to that by installing the old compat library.
Joe- Have to do that before they do anything.
Tomas- Is there a way to track the version and on server start check the version and refuse to start if needed.
Joe- PG15 may blert a warning?
Jeff- That's a different thing, colversion but that's different from the collation library version. Simplest proposal at initdb time could be to pass a flag saying which is needed and then have that track and do initialization and setting up the collation from that provider.
... further discussion over lunch including about loading multiple ICU versions concurrently, tracking collation version in the catalog, allowing to build new indexes concurrently with existing, et al
Improving Wait Events
Bertnard- Add more details to wait events, for example when buffer content could have relfilenode and other information included. Won't be the same details for each wait event and so we have different data we want to add and the number of different details. Buffer content we might have 3 additional details, for a checkpoint we might have 2 extra items. Store this in session dynamically and then be able to return data from pg_stat_activity or otherwise. Issue with consistency. Currently we have wait event and wait event type with int32 and that is always consistent but if additional information is included then not sure about how to keep it consistent.
Andres- Issue with overheard too. Wait events added because they're cheap but adding this other info adds a lot of additional overhead in certain code paths.
Bertnard- Yes, have to consider that.
Andres- Still have to store all the source data with each different wait event. Not sure how to do that without making it much more expensive.
Bertnard- Have to see how to measure it, maybe able to make it lossy to address that cost.
Peter E- How would you expose it beyond just pg_stat_activity. The more detail you add such as how extensions may want to include means you have to consider how to display it. How are people supposed to use it?
Andres- Depends on the details. On content lock on a btree page is very different from contention on content lock for heap page.
Alvaro- Is waitevent the best way to store this information. Maybe could write to ring buffer and could from application side read that and build a history of what's been going on instead of having to poll that information.
Peter E- Could still be the wait event API and just store it in a different place
Andres- We use wait events in critical sections where you can't allocate things and can't do anything serious because heavy locks are being held. Making wait events much more expensive isn't going to be acceptable.
Peter E- Is consistency really required in all these cases, maybe we don't need to have it be completely consistent? Maybe just write 4 int32 fields without locking around them all at once and maybe it's fine to do that independently.
Andres- Have to be careful to not make it vastly more expensive. When you read you need to know if what you read is actually valid or not. You can't read it without knowing if it's actually reasonable or not.
Bertnard- During 10s you have a bunch of different wait events if some aren't valid maybe is ok but you need to know that it's valid.
Peter E- Would be really frustrating if you get the wrong data and take action based on incorrect data.
Bertnard- Is it work it to spend time on this?
Peter E- Could do some simple tests by making value 8 bytes (maybe larger) and putting a spinlock around it and see how expensive it is.
Andres- 8 bytes probably fine
Stephen- 8 bytes not enough though to track this
Andres- 8 bytes atomic on nearly all platforms these days, isn't on like armv7 but probably not a big deal there
Peter E- Is 8 bytes enough?
Bertnard- Depends on the wait event
Peter E- Maybe say "here are the things thinking about adding, this this and this" and if it fits in 8 bytes then maybe ok but if it doesn't then may need another idea.
Andres- Suspect overhead is going to be too high and will need a completely different mechanism because collecting all that data for all wait events and most of the time it's not going to be needed and is just expensive.
Stephen- Maybe pull together other information at the same time when polling rather than putting it all in the wait event
Andres- yeah, maybe store buffer ID then good chance you'll be able to figure out what it is without having to store everything in to the wait event
Bertnard- Sticking to 8-byte only might be ok
Andres- store the buffer ID and then get the rest from shared buffers and doesn't introduce a lot of overhead by default. Some of this is maybe solved by dtrace instead possibly.
Bertnard- Most of the time you have to guess when you have a waitevent but you don't know what is happening. Would like to know if it's always the same relation and see what's going on in an aggregate. If database is waiting on something but you don't know what it's waiting on then that's not as helpful.
Andres- Maybe infer from other information and not try to have everything answered by data provided through wait event
Alvaro- Would be good to see what the system is doing in more granular way but wait event isn't the best way to get at that and maybe there should be another way that's a completely different approach which could be turned on/off for a specific operation that doesn't cause too much overhead. idk if things like a branch testing a flag will be too expensive of a problem. Storage of the performance data storage is going to have to be something completely separate from wait event. idea of using a ring buffer to store that data instead.
Andres- In that case you have to be even more careful of storing the data because the cache line where that ring buffer data is stored and you constantly are writing then it's a significant amount of system memory being used.
Peter E- Probes need to be put at a different level
Andres- If you want the granularity where wait events because that's why wait events are there.
Alvaro- Maybe where dtrace points are
Mark- Could perhaps clear things out every so often
Andres- Data-dependent branches added to this to collect numbers the overhead is going to be way higher. Ring buffer idea, would have to be very small ring buffer to avoid too much overhead and have to poll very very often and that all ends up with a lot of overhead.
Mark- Don't have a specific design but if you overwrite regularly a particular place
Andres- but then you have a memory barrier and you get a stall if you don't have that data in L1. As soon as you do any reads it gets much more expensive and you need to do reads in the data collection path.
Extensions & Stats
Bertnard- Folks are working on this and want to make sure the idea is generally supported. Idea is to allow extensions to add stats into the system.
Andres- Add infra to add stats at runtime. pg_stat_statements has its own storage but if we added the last bit of extensibility to the shared memory stats system then maybe wouldn't be needed.
Bertnard- If you want to reset stats then maybe have it in a different file..
Peter E- Storage happens just at shutdown
Andres- When you store it on disk need to keep track of what extension added the stats. With different files maybe you have an easier time detecting which file goes with which extensions stats.
Peter E- Same issue as always, extensions have to register themselves somehow
Andres- Maybe just extension name would be fine, just write to disk on shutdown, maybe a bit more space but probably not an issue really. One thing I'd like to change pg_statistic would be to make it crash safe because on crash we run into problems where vacuum doesn't do anything initially because the stats were lost. Maybe store with redo LSN on checkpoint the stats at that point in time and might be slightly dated but would otherwise be correct.
Heikki- Is there a case where having old stats is worse than new stats..?
Matthias- What about truncation or such?
Andres- That should create a new relfilenode and that should be ok
Heikki- what if stats say it doesn't need vacuum but changes since last time make it so that it does need to
Andres- Today that problem already exists, this would make things at least better a bit. Could change most of the stats to include relfilenode and then use that when doing replay maybe and should solve truncate problem too. Not sure if there is semantic issue with that.
Vik- If autovacuum sees 4 zeros then maybe it should select that table for analyze
Andres- on stat reset then autovacuum goes crazy and that could cause an issue.
Vik- Also issue with failover where we don't have stats
Andres- If we add relfilenode to stats key then that would help with failover too and you could count the number of inserts and updates and such and keep that and that would be better than zero. May also be able to serialize stats into xlog during checkpoint maybe and that works maybe for inserts and updates but not for selects because those are actually different on the replicas vs. the primary.
Peter E- Maybe the standby only replays the things that it trusts or which it should, but that could be messy.
Alvaro- What about WAL size increase
Andres- We used to write out the stats a whole bunch and wasn't an issue really. Have seen cases with really huge stats but was very exceptional case. We could do something like a summary at commit time maybe and might make reconsiliation in memory easier.
Tomas- Only write out the things that changed since the last time...
Andres- Could add a change counter or such to the in-memory so that we could know when we need to pass things along
Heikki- Maybe add columns to pg_class to track
Andres- Update so frequently that it could be a problem. Have to have some kind of per-database background worker of some kind perhaps.
Tomas- When we store last vacuumed / last analyzed, maybe store at that moment..
Andres- but at that point it's probably not useful, wouldn't end up triggering another vacuum because it was just run. Logging to WAL at checkpoint time is easy but if you are doing catalog updates then you have to connect to each database to update those catalogs if it's in pg_class, etc.
Tomas- If we had columns in pg_class, what use-case would that solve that regular logging of stats wouldn't? If we log in WAL stats, would that give you everything you need. Benefit of pg_class would be that you wouldn't lose stats because they're WAL'd.
Heikki- You could do it differently
Tomas- Losing stats is a pretty common issue. Two different approaches to the problem- WAL log stats directly vs. updating pg_class. Seems like storing stats in WAL would be better. Once in a while flush modified stats to WAL.
Andres- Just changing the key to the relfilenode likely would help but wouldn't deal with insert/abort but could handle that by logging more info during abort. With relfilenode we could track enough to get close enough value.
Heikki- Talking 3 approaches. 1) never WAL log stats but instead calculate stats based on WAL data seen; issue: stats on primary vs. replica they could drift apart- basically dead reckoning, might get far off
Tomas- How would this work? pg_stats has a lot of other stuff and so you wouldn't have background writer info or checkpointer
Andres- Why do you want that from the primary when you're on the replica?
Tomas- Just spit-balling, maybe there's other types of stats that do make sense
Heikki- Number of stats that you'd want separate on the replica from the primary like seq scans, index scans, etc. Second proposal, instead of dead reckoning, you dump the whole stats file to the WAL on a regular basis and that could be very large which seems like an issue.
Andres- The size doesn't seem that bad
Stephen- Maybe store into new place on replica and pull into place on promotion
Andres- Requires handling of the stats differently quite a bit possibly
Heikki- 3) Put stats in pg_class directly, not everything but important things
Andres- That seems like it would be very hard
Peter E- Saying we only care about these stats for these specific reasons, but not other stats, which isn't great because people care about the other information. Maybe is ok but maybe not.
Heikki- With any of these schemes, it seems like we would want to separate these
Andres- Think we agree that trying to keep some stats after crash would be good
Peter E- We can't hard-code too much stuff if we want to keep stats system extensible
Andres- Extending stats comes with a bunch of things to be added but shouldn't be too hard to keep extensible even with these ideas.
Container sets (arrays, row types, etc)
Vik- Range type- We have a couple by default but otherwise you have to create your own range, other types get created each time. How can we have multi-type values without having to create new types to do it. What would it take to get multi-sets?
Peter E- Create them on the fly
Heikki- Do the same thing as row types, there's permanent row types and also dynamic row types
Peter E- Want to create a multi-set field with integer and maybe create that type on the fly.
Vik- Yes, but just in queries it would be nice to create multi-set without having to create a whole new type
Andres- We don't necessarily need a different pg_type if we can put some encoded into typenum maybe
Vik- What about nesting?
Heikki- If you think of them as records, think it works
Peter E- arrays of multi-sets?
Andres- Is nesting necessary?
Vik- Multi-sets useful, maybe not nesting but maybe, multi-set of arrays
Andres- If you can get a lot of things without implementing a lot of crazy stuff then maybe it could be done.
Peter E- Is it really a problem to create types?
Andres- Could end up bloating pg_type a lot
Peter E- Where is bloat coming from?
Andres- row types in pg_class ends up adding up a lot
Alvaro- Also need to consider pg_depend, pg_shdepend for owner
Peter E- bloat from pg_type ends up coming from every table having entry, maybe create new base type.. Creating 5 is maybe not that bad since creating 2 already. Maybe we just say we don't support ranges on table row types.
Heikki- Multi-set of record would make sense
Vik- Yeah. Main issue is knowing about these things on the fly and not necessarily having to put something into pg_type
Andres- Maybe copy approach from records, might not be too hard except for nesting case.
v16 Patch Triage
session variables, LET command -- Tomas- Will be talking to Pavel about this patch. Did a review of it, planning to commit it, biggest question is if it's really a useful feature or not. I think it is. Patch in pretty good shape. Joe- Like the feature but not sure why it has taken so long. Alvaro- Has gone through several rewrites. Jeff- Risk of running afoul of SQL standard? Tomas- Don't think there really is. Heikki- Are there concerns about what happens if it changes in the middle of a query or..? Tomas- Having session variables that's accessible instead of GUC. It's not transactional. Vik- Why not just use a table? Heikki- Seems like a temp table with only one row? Stephen- Issues with temp tables being constantly created/dropped can't use on standby, etc. Peter E- May look into the standard and see. Tomas- Maybe not good to get into details on the patch right now. Heikki- Looking at patch now.
Remove self join on a unique column -- Tomas- patch seems correct but is hard to convince myself that people actually write joins like this. Stephen- Because of ORMs. Tomas- Doesn't add much overhead.
Avoid hiding shared filesets in pg_ls_tmpdir (pg_ls_* functions for showing metadata ...) -- Alvaro- Need to do something here but not sure if this is the thing to do. Andres- doesn't show directories but parallel operation have directories and this wasn't updated and so semantics are not entirely clear.
Make message at end-of-recovery less scary -- Andres- Idea of patch is quite useful, needs a good bit of polish based on last review. Not sure if that's changed more recently. Vik- Not just wording? Andres- No, 300-line patch and some of that is tests but is more than just wording. Currently we hide errors for example in some places and you get a useless message at the end and to fix that there are structural changes needed. Heikki- seems pretty narrow, if WAL recovery ends due to invalid length but it could end for a variety of reasons depending on if WAL recycled or not. Andres- On primary shouldn't get that and almost always zero out the page. On the standby we should but we don't zero out the pages and that is causing bugs and we should start doing that. Wrote a patch for that but some details are really hard to get right there.
More scalable multixacts buffers and locking -- Andres- Not sure if there is agreement that this is a good idea because people want to move SLRUs into shared buffers and then this idea wouldn't make sense. Matthias- When is that going to happen? Is a bandaide but could help. Andres- But is a bandaid that could cause really weird performance impacts. Needs a lot of work to figure out the access patterns and such. If was just a config without massive downsides then it would be ok but it isn't that.
pg_dump - read data for some options from external file -- Peter E- Don't personally like it but if someone wants to commit it then it should be fine. Stephen- Dislike having a whole new file format but whatever.
CREATE INDEX CONCURRENTLY on partitioned table -- Matthias- Just like normal CONCURRENTLY but on a partitioned table. Heikki- Great feature if we can have it. Are there concerns? Andres- Not sure how the code can be correct, but maybe missing something. Opens a memory context and then calls existing concurrent code and expects snapshots to work across that but that can't really work so don't see how it could be correct...
Function to log backtrace of postgres processes -- Peter E- Not sure if that's useful? Andres- Wished for this many times. Disagreements on list with this currently. Peter E- Maybe patch tries to do too much? Heikki- Every background worker has to be modified. Peter E- Probably isn't great that it requires that and maybe that's part of the issue. Andres- Does that to make it safe to use in signal handlers.. but it can't do it safely.. Whole reason it does it as shared preload library but that's not guaranteed because of how ELF works. Heikki- Any way to do it safely? Peter E- If you want to call it from a signal handler because you're stuck somewhere.. Andres- Does it really need to be called from a signal handler? Andres- If we use latch wait in more places and use that approach instead of trying to do it from signal handler then it may work. Heikki- Tom commented that surely this is unsafe to do from a signal handler. Stephen- Seems like general feeling is that this should be RWF as needing to be redone to not be trying to do this in a signal handler.
pg_stat_statements and "IN" conditions -- Tomas- About normalization of the strings, variable number of values in the IN list instead of generating each entry it would normalize into smaller number. Feature seems useful where big IN list completely swamps the system. Andres- Adds a GUC? Seems unnecessary. Tomas- Can imagine cases where different numbers generate different plans, can understand why a GUC. In that case we are not differentiating between different types of queries. Peter E- The code looks very straight-forward here. Tomas- Anyone think we shouldn't have the feature or maybe we don't need to even have the GUC? Vik- Feature seems useful and we should just always enable it. Peter E- Discussion of query jumbling or if we need a switch and this might be something where we may want to have control over. Tomas- We should make the same decision between this patch and query jumbling. If it's hidden behind a GUC or internal function that says jumble one way or another.. Peter E- May be a good release to try putting this into when we're breaking things already and see what happens, if you break it, break it big. Maybe we should just do it. Andres- Anyone know why this adds a new field to struct location len for merge? Tomas- wants to track the original location to the unjumbled. Stephen- Seems like folks are generally in favor of this, maybe even without having a GUC.
Fix pg_rewind race condition just after promotion -- Heikki- Completely forgot about this and am looking back through it. Just haven't gotten around to actually committing. Heikki will commit (haha, but probably).
Faster pglz compression -- Tomas- Looks ready, plan to commit it. Difficult to understand but a good improvement. Heikki- Not sure about why to bother but don't see a downside. Tomas- People still do use pglz a lot, so.
Parallel Hash Full Join -- Alvaro- Munro says planning to get this in shortly ... in November. Heikki- We want it. Seems to include bug fixes that should be committed?
On client login event trigger -- Heikki- What have the problems been with it? Andres- If you screw it up you can never log in again which was one of the issues. At some point was work on a GUC to disable to allow you to get in.. Not sure if added but without that means no way to log into the system. Alvaro- Very wanted feature. Heikki- What do people want to do with this? Matthias- Possibly useful to set variables on log in or to log into a table that a user logged in. Peter E- Maybe wait until after event trigger disable GUC so that can bypass this if there is a bug or issue with it.
Consider parallel for LATERAL subqueries having LIMIT/OFFSET -- Tomas- I may be able to take a look but difficult to make reasoning about if it's correct or not, would be good to get Tom's input, but will take a look. Alvaro- Tom said he didn't know how it could be safe. Tomas- I'll read it and maybe learn something and try to figure it out and see if it could be done.
pg_stat_statements: Track statement entry timestamp -- Andres- A lot of complexity. Tomas- The idea of tracking when entry added makes sense because if you have two entries for two tables and one has very large numbers and other has very low numbers, does it mean if one is more active? Might just be because of which one is newer and not which is more active really. Makes sense. But then adds a lot of complexity by adding in a lot of ways to reset things. Concerned about some of the changes. Andres- A lot of overhead been added lately and not sure if it's good to add more. Tomas- To do reasonable analysis you need to keep the deltas anyway and so not sure that this is really helpful. Peter E- If you have a data set where you care about tracking then likely you'll have entries for years and therefore isn't really that useful. Matthias- Even latest isn't that hard to derive by checking deltas across time. Tomas- Only issue with keeping regular snapshot is that it doesn't work for min/max latency for the query because once you get a spike you'll never see the new min/max in the following period but even so not sure that it makes sense to keep entry timestamp.
psql - refactor echo code -- Peter E- Added myself to review it and will do so.
pg_stats and range statistics -- Tomas- Did review of this. What it does is that we don't currently track range statistics and only problem with that is how we read and print the histogram and if there is a way to do that in pure SQL or if we need special functions for it. I will continue working on it and reviewing it and hopefully will make progress.
pgbench: using prepared BEGIN statement in a pipeline could cause an error -- Alvaro- problem is that we prepare the whole thing, but maybe we didn't want it to change how we are measuring latency and a different method was proposed but the author hasn't changed it accordingly. Alvaro will comment that it wasn't updated to new approach.
Add system view tracking shared buffer actions -- Andres planning to commit it, issue with tablespace tests but should be able to resolve in next few days.
Using each rel as both outer and inner for anti-joins -- Tomas- Will make us consider more planning options. Currently only consider one way and this could allow other ways to be considered. Andres- Turns a lot of nested loop antijoins into hash antijoins which seems good. Tomas- Seems pretty reasonable..
Dynamic result sets from procedures -- Peter E- Patch held up for a long time to get the display of multiple result sets due to psql needing it. Working on adding more capabilities and tests to check the extended protocol, found some issues and that's in progress for being fixed. Not sure if this patch will land any time soon but needs more tests and is a useful feature. Funtionality is part of the standard. Heikki- Changes the protocol? Peter E- Kind of, protocol kind of just works today but maybe need to be more explicit to make sure that everything works. Have a patch to make it work with JDBC that's pretty small.
Add foreign-server health checks infrastructure --
Parallelize correlated subqueries that execute within each worker -- Tomas- On list comments that it's unsafe but the discussion was side-tracked about discussion about how parameters passed to parallel workers. Not sure ... Not just about parallel subqueries or parallelism in general but also about how parameters are passed in general. Not sure what the conclusion is. Andres- Not close to being committable due to commented out warnings and seems to be WIP.
postgres_fdw: commit remote (sub)transactions in parallel during pre-commit -- Andres- looks partially committed? Heikki- Not sure why this needs to be configurable? Andres- Seems to maybe have a lot of duplicated cases that shouldn't be needed between commit/abort?
Update relfrozenxmin when truncating temp tables -- Andres- Every version seems to get more complicated ...
functions to compute size of schemas/AMs (and maybe \dn++ and \dA++) -- Matthias- Would like more verbose options into backslash commands. Peter E- Not sure why want this ++. Maybe have + for more details but don't want to compute size every time. Stephen- Maybe have cache or stats for size of things to make them less expensive to query. Andres- Lot of work to actually keep correct answer for size in shared memory. Andres- Don't really see point and maybe just reject it.
disallow HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_LOCKED_ONLY -- Andres- just explicitly forbids a combination of bits that shouldn't be allowed. Mark- Mainly to pick up on corruption but not sure that it's actually not allowed to happen and pg_upgrade makes things very difficult because couldn't be 100% sure that this is an error. Seems probably right but not 100% sure. Andres- What does this actually get us though? Would it really catch corruption? Mark- Unless there is a way to prove that this really won't happen then can't commit it. Heikki- Same as having asserts to check things. Tomas- In some other cases have realized that there was corruption due to invalid bits being set and so this could be useful. Don't know how long it has been broken though. Peter E- Maybe add this to amcheck but not add an assertion as that's only in development anyway but won't help with corruption detection. Mark- idea is to add assertion to the code that matches what amcheck checks so if you decide to use such a bit pattern then would check and make sure that it gets realized that both need to be changed. Mark- Not willing to try and guarantee that this can't happen. Heikki- Maybe put it into amcheck and use that to see if it does happen in the field. Tomas- It might scare people though for no reason if it turns out to not be an issue and people who hit it might not report it. Heikki- another thing is that it seems to possibly report problem on pg_upgrade'd clusters which were valid and therefore this shouldn't go in because of that. Maybe doesn't kill the whole patch but maybe does.
In-place persistence change of a relation (fast ALTER TABLE ... SET LOGGED with wal_level=minimal) -- Peter E- Seems to add a lot of code but seems not really worth it. Andres- Should have removed minimal WAL long ago. Tomas- Lots of bugs there and people don't seem interested in fixing them..? Andres- Some of them fixed but those fixes sometimes added other bugs. Tomas- Maybe make minimal WAL level improvements conditional on other things. Andres- Don't see real use-case for wal level minimal. Tomas- Wouldn't use it for important data ... Peter E- Even if we don't like wal level minimal, this is a legit point that we could optimize this, but the code is large and adds things to check what we changed, etc. Tomas- For patch author it makes sense because it can be helpful. Heikki- Patch also just changes how relation rewrite is done to just use FPIs instead of a bunch of heap inserts and that could be better. Maybe get rid of wal level minimal stuff but keep the other changes. Andres- Not sure if this is really safe to do this way though like in rollback. Tomas- Should be at least split into two patches if it actually works, a patch for the non-wal-level-minimal part and then a patch for the wal-level-minimal optimization.
Speed up releasing of locks -- Matthias- Good idea. Andres- Needs a bit more work and possible small slowdowns. Removing a lot of weird code. Doable for 16 if time is put into it.
Add log messages when replication slots become active and inactive -- Tomas- seems like a simple patch if we want them, which seems reasonable we can have it. Doable for 16 if we want them.
Daitch-Mokotoff soundex -- Tomas- Seems like a simple patch and will take a look and probably will commit it.
reduce impact of lengthy startup and checkpoint tasks -- Andres- Have serious doubts about it making things better for xid wraparound and other things. Good idea in theory but need much more pared down set of things as said on thread. Pretty large change and probably not for 16 unless a committer picks it up and spends a lot of time on it.
Add Amcheck option for checking unique constraints in btree indexes -- Mark- Responded on a couple of things, author submitted new patches, waiting for Peter G to see if he wants it. Probably doable if Peter G has time to review it.
pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory -- Probably can go in as a bugfix?
Error "initial slot snapshot too large" in create replication slot -- Andres- still couldn't figure out how to do much better than the current state. Not sure if anything new has happened.
AcquireExecutorLocks() and run-time pruning -- Tomas- Amit is working on it and getting feedback from Tom and so seems to be in progress.
64-bit SLRU page numbers (independent part of 64-bit XIDs) -- Heikki- seems like a good idea but not sure about implementation. Peter E- Not filled with confidence about it. Heikki- Is this independently useful? Matthias- Good to have in before 64bit xid because it reduces the patch size. Andres- could be useful with a different AM. Matthias- Also for the whole SLRUs and so for MultiXact it might help those too. Peter E- Asked why this is helpful, may have been changed? Matthias- Preparation for 64bit xids. Peter E- Question on whole patchset seems like things get added and then reverted in the patch series and is a bit confusing. Heikki- Looking at patch, it doesn't change more things..? There's some complicated logic in dealing with wraparound and switching to 64bit numbers should help with that but this patch doesn't seem to take advantage of that. Andres- would have to make pg_upgrade quite a bit more complicated to make it work. Heikki- There is a pg_upgrade part of the patch. Andres- not likely to make it into 16.
Pluggable toaster -- Andres- Don't see it going anyway and idea of content aware toasting is very complicated and patch adds a whole bunch of infra. Vik- Like the idea but is very complicated. Heikki- like the idea of making the toaster better. Matthias- For certain data types, specialized compression would be really good. Tomas- Seems like this isn't the right place to be putting this infra. Andres- want to compress json to get rid of keys, want to do it for everything not just toasted data and seems like isn't the right place to do this. Tomas- Seems like the wrong level. Want dictionary and use that for the data type and then compress and then it can be toasted like usual. Matthias- Not really a good way to make this available. Andres- This doesn't really get you much farther. Heikki- Seems actually more reasonable than thought. Two types of toasting, the compression and slicing data into tuples and putting in toast table. Matthias- This does both and tries to work with the data type to make the output more performant to access. json tuple deconstructed into multiple tuples following the structure. There have been some really compelling performance improvements using this. Heikki- Maybe be able to split this by the two different pieces. Very unlikely for 16 due to lack of consensus.
Add pg_stat_session -- Peter E- Seems reasonable to make it into 16 as long as agreement about usefulness. Needs someone to look at it. Maybe move some things from pg_stat_activity to here?
Allow parallel plan for referential integrity checks -- Mark- Robert marked it as unsafe because he wasn't sure if it was safe. Author doesn't seem to have time. Needs someone to pick it up or it should be closed or punted to 17.
warn if GUC set to an invalid shared library -- Seems to need some cleanup? Hopefully someone can look at it, not a lot of code and could probably make it if worked on.
add guc: hugepages_active -- Seems reasonable.
Time-delayed logical replication subscriber -- Peter E- seems to be getting worked on, could be done in time.
Add non-blocking version of PQcancel -- Heikki- Seems like a good idea in priciple. Peter E- being worked on and plausible for 16.
Add LZ4 compression in pg_dump -- Tomas- seems almost ready and have been reviewing it, probably good enough, likely for 16.
Move SLRU data into the regular buffer pool -- Andres- Probably not for 16 at this point. Heikki- Concern about performance. Matthias- Performance seems ok. Heikki- probably not going to make 16 just because it's quite large. Andres- Deletes more code than it adds at least.
doc: PQexecParams binary handling example for REAL data type -- Peter E- being worked on, should be fine.
Support logical replication of DDL commands -- Not likely to make it to 16 as it's quite large.
Skip replicating the tables specified in except table option -- Alvaro- seems like it needs some work, not sure if it'll be ok for 16.
Data is copied twice when specifying both child and parent table in publication -- Sounds like a bug?
Perform streaming logical transactions by background workers -- Partially committed?
Fix dsa_free() to re-bin segment -- bug fix?
pg_rewind: warn when checkpoint hasn't happened after promotion -- Heikki- Looking, not a large patch, seems sane are probably could make it.
generate_series in selected timezone, date_add in selected timezone -- no opinions
New hooks in the connection path -- Bertrand will update to remove hook which seems contentious and hopefully the rest is ok to go in.
Check consistency of GUC defaults between .sample.conf and pg_settings.boot_val -- Andres- Good idea but was a competing patch, not sure which way will go.
nbtree performance improvements through specialization on key shape -- Matthias- Needs some cleanup. Andres- Seems like too large a patch to make it in. Matt- Ask Peter G to review it? Matthias- Not sure how to make it much better than how it is.
Add sortsupport for range types and btree_gist -- Jeff- I can probably take a look and see. Not sure what state it's in.
Reducing planning time when tables have many partitions -- Alvaro- Rowley has been working on it.
CI and test improvements --
Transparent column encryption -- Peter E- feels like it's complete.. Want to try and get it in and make it acceptable
Switching XLog source from archive to streaming when primary available -- Andres- pretty reasonable patch, haven't looked at details but having a config option for this seems reasonable and could probably go in for 16.
An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication -- Andres- Not sure why this needs to be solved..? Tomas- Seems backwards from what the streaming sync replication does? We commit locally and then wait and so there shouldn't be more than one waiting. Andres- Busy loop which doesn't seem good? Not sure about this one.
Minimal logical decoding on standbys -- Bertrand- A lot of activity with feedback from Robert and Andres. Andres- Good chance that at least some of it could make it into 16.
Compression dictionaries for JSONB -- Alvaro- Related to toasting patch? Heikki- Why just do this for jsonb? Matthias- Specifically implemented for jsonb but should make it possible for others too? Don't think it will make 16 because don't have bandwidth and not a lot of others interested. Tomas- Think it actually does the infra. Main problem I had with the patch is I tried to measure the benefit to show improvement and had trouble seeing consistent improvements. Not sure if that was my problem but we need to decide if we want to do this or pluggable toaster or what. Andres- Uses typmod seems like a no-go for this? Tomas- data type specific / column type specific compression, we need context to identify the dictionary, patch is using typmod for that which doesn't seem good. Heikki- seems like would belong better in the toaster. Alvaro- maybe have a half-hour discussion with the devs around these things. Andres- Seems to require the dictionary be specified which doesn't seem good. Tomas- Just initial implementation, in future would be a process to handle doing that. Not likely for 16 just because these questions need to be figured out and discussed more. These are ok in POC but not going to be good enough to go in yet. Andres- Maybe dictionary go into pg_attribute or other context. Does not seem likely for 16.
ALTER TABLE SET ACCESS METHOD on partitioned tables -- Seems small enough and useful enough that could make it for 16.
Add SPLIT PARTITION/MERGE PARTITIONS commands -- Alvaro- Definitely want this but not sure we are going to be able to make it for 16.
Fix assertion failure with barriers in parallel hash join -- Bug fix?
Support load balancing in libpq -- Andres- Not a very large patch. Tomas- Reasonable and may be able to make it in for 16.
Add JIT deform_counter --
Amcheck verification of GiST and GIN --
Use fadvise in wal replay -- Andres- reject it. Tomas- Whole assumption is readahead is disabled, but if readahead is enabled then this is always worse. Nothing to solve here really.
Let libpq reject unexpected authentication requests -- Andres- doesn't address issue with peer, at least. Solve some problems maybe. Need to be clear in the documentation what it is actually doing. Could possibly make 16.
Support % wildcard in extension upgrade scripts -- Andres- Think this was pretty much rejected?
Fix recovery conflict SIGUSR1 handling -- bug fix
pg_visibility's pg_check_visible() yields false positive when working in parallel with autovacuum -- bug fix
Add 64-bit XIDs into PostgreSQL 16 -- Not gonna make it for 16.
Eliminating SPI from RI triggers -- Alvaro- Seems not likely to happen due to people being too busy with other things. Tomas- Was updated though? Maybe.
Add initdb option to initialize cluster with non-standard xid/mxid/mxoff. -- For testing 64bit patch but could be useful for other things. Mainly for testing.
Testing autovacuum wraparound -- Andres- Not planning on working on it really because we lack infra to do it without problems.
Improve dead tuple storage for lazy vacuum -- Andres- Making progress, not sure it'll be ready.
USAGE privilege on PUBLICATION --
explain analyze rows=%.0f --
Fix alter subscription concurrency errors --
ALTER TABLE and CLUSTER fail to use a BulkInsertState for toast tables --
Cygwin cleanup --
logical decoding and replication of sequences, take 2 --
doc: mention CREATE+ATTACH PARTITION as an alternative to CREATE..PARTITION OF --
Add index scan progress to pg_stat_progress_vacuum --