PgCon 2020 Developer Unconference/tableAM next steps

From PostgreSQL wiki
Jump to navigationJump to search

As part of merging PostgreSQL v12 into Greenplum, we are using the new table AM interface to refactor Greenplum's append-optimized row and column oriented access methods. This exercise led us to realize certain limitations of the existing table AM interface. The discussion was centered around recognizing these limitations and thinking about how to mitigate them.

DML operations

A table AM may need state for insert/update/delete.

  • variable length blocks
  • no shared buffers, custom buffering mechanism or just use OS buffer cache
  • bulk compression, may need to accumulate tuples in buffer

There is consensus that a table AM needs to keep state during DML operations. Andres: in-tree AMs as well may benefit from state, e.g. pin the last visibility map page.

Current table AM interface

  • tuple_table_insert(Relation, TupleTableSlot, tid, options, bistate)

The bistate parameter is quite heap-specific, assumes shared buffers are used. It is a rather undesirable historical remnant and needs to be removed from the insert API.

Where to keep this state, who takes the responsibility?

  • ResultRelInfo - seems like a good candidate, but where to keep the state is not important.
  • Lifetime of the state is from ExecInitModifyTable to ExecEndModifyTable.
  • The table AM implementation is responsible to create and destroy the state.

More important question is, how should the interface look like?

A new interface can be introduced on the lines of *beginscan, *getnext, *endscan interfaces. E.g. table_scan_begin_modify() interface can be used to create the state and return a pointer to AM-specific (opaque) struct. table_scan_end_modify() would destroy the state.

Bulk insert / multi-insert interface

How about an API to let the AM accumulate tuples and return a list of TIDs that were written at once to storage? E.g. get last 1000 buffered TIDs. Batch size needs to be negotiated.

The tuples should be buffered by the caller or the AM? How to deal with triggers, foreign keys, etc. Needs to be thought through carefully.

Current COPY multi-insert implementation is such that the caller of the multi-insert API ensures that the memory context used to buffer the tuples is not destroyed prematurely.

Generic modification layer that can be leveraged by COPY, insert-into-select (and DDLs?) be desirable. It will potentially reduce complexity in nodeModifyTable.

SMGR pluggability

  - Beyond RelFileNode?  Current pluggable SMGR allows for a sync API
    implementation other than md.c.  However, what gets recorded in XLOG and
    pending delete lists is RelFileNode.
  - Some storage layers may need additional information such as segno to
    uniquely identify a file to be sync'ed or unlinked.  E.g. Greenplum's
    append-optimized storage allows of variable length, non-contiguous segment
    numbers for a single relation's segment files.
  - register_unlink_segment() accepts segment number as a parameter.  Is that
    not enough?
  - A convincing articulation of the limitation could not be made, a patch /
    discussion will be initiated on pgsql-hackers regarding this.


Ability to recognize a table AM by inspecting file in datadir

  - pg_basebackup performs checksum validation and it will be nice to make it
    work for non-heap table AM's data files.
  - Tools such as basebackup may not have access to catalog and are currently
    unable to identify a table AM / layout of a file in datadir.
  - Metadata fork: basebackup, for example, may invoke AM-specific callbacks
  - Adding more forks is not desirable, we already have too many forks.  How
    about a metadata page at the beginning of each file?  Some ideas on
    addressing upgrade implications of this proposal include adding the
    metadata page from pg_upgrade during upgrade.

Thank you everyone for a nice discussion! Special thanks to Soumyadeep Chakraborty for taking notes.