PgCon 2020 Developer Unconference/tableAM next steps
As part of merging PostgreSQL v12 into Greenplum, we are using the new table AM interface to refactor Greenplum's append-optimized row and column oriented access methods. This exercise led us to realize certain limitations of the existing table AM interface. The discussion was centered around recognizing these limitations and thinking about how to mitigate them.
- 1 DML operations
- 2 SMGR pluggability
- 3 Ability to recognize a table AM by inspecting file in datadir
A table AM may need state for insert/update/delete.
- variable length blocks
- no shared buffers, custom buffering mechanism or just use OS buffer cache
- bulk compression, may need to accumulate tuples in buffer
There is consensus that a table AM needs to keep state during DML operations. Andres: in-tree AMs as well may benefit from state, e.g. pin the last visibility map page.
Current table AM interface
- tuple_table_insert(Relation, TupleTableSlot, tid, options, bistate)
The bistate parameter is quite heap-specific, assumes shared buffers are used. It is a rather undesirable historical remnant and needs to be removed from the insert API.
Where to keep this state, who takes the responsibility?
- ResultRelInfo - seems like a good candidate, but where to keep the state is not important.
- Lifetime of the state is from ExecInitModifyTable to ExecEndModifyTable.
- The table AM implementation is responsible to create and destroy the state.
More important question is, how should the interface look like?
A new interface can be introduced on the lines of *beginscan, *getnext, *endscan interfaces. E.g. table_scan_begin_modify() interface can be used to create the state and return a pointer to AM-specific (opaque) struct. table_scan_end_modify() would destroy the state.
Bulk insert / multi-insert interface
How about an API to let the AM accumulate tuples and return a list of TIDs that were written at once to storage? E.g. get last 1000 buffered TIDs. Batch size needs to be negotiated.
The tuples should be buffered by the caller or the AM? How to deal with triggers, foreign keys, etc. Needs to be thought through carefully.
Current COPY multi-insert implementation is such that the caller of the multi-insert API ensures that the memory context used to buffer the tuples is not destroyed prematurely.
Generic modification layer that can be leveraged by COPY, insert-into-select (and DDLs?) be desirable. It will potentially reduce complexity in nodeModifyTable.
- Beyond RelFileNode? Current pluggable SMGR allows for a sync API implementation other than md.c. However, what gets recorded in XLOG and pending delete lists is RelFileNode.
- Some storage layers may need additional information such as segno to uniquely identify a file to be sync'ed or unlinked. E.g. Greenplum's append-optimized storage allows of variable length, non-contiguous segment numbers for a single relation's segment files.
- register_unlink_segment() accepts segment number as a parameter. Is that not enough?
- A convincing articulation of the limitation could not be made, a patch / discussion will be initiated on pgsql-hackers regarding this.
Ability to recognize a table AM by inspecting file in datadir
- pg_basebackup performs checksum validation and it will be nice to make it work for non-heap table AM's data files.
- Tools such as basebackup may not have access to catalog and are currently unable to identify a table AM / layout of a file in datadir.
- Metadata fork: basebackup, for example, may invoke AM-specific callbacks
- Adding more forks is not desirable, we already have too many forks. How about a metadata page at the beginning of each file? Some ideas on addressing upgrade implications of this proposal include adding the metadata page from pg_upgrade during upgrade.
Thank you everyone for a nice discussion! Special thanks to Soumyadeep Chakraborty for taking notes.