This page contains a design draft of heapam refactoring. It seems to be a good place to keep API proposal.
This article is pretty large, but I want to discuss all details (and take into account all objections) before beginning this colossal refactoring. As I see it, the refactoring is not intended to bring any revolutionary ideas, just to clarify our vision of data storage manipulations and provide clear interfaces.
Disclaimer: Please, don’t complain that the names are bad. I know that, but I just needed some stubs to write this design draft. I’m open to any rational suggestions regarding the naming, but it’s not the thing I consider really important on this stage of discussion. The most important feedback I need is principal objections aginst the refactoring, overlooked moments of implementation and possible conflicts with current development.
I divided this message into 2 parts. The first contains a description of the current state of the code. And the second is about new interface proposal.
Looking forward to your comments. Please, don’t hesitate to point out the weak points of the proposal and ask for clarification.
Levels of abstraction
Let’s begin. Many of you saw presentations, where the way of a query inside PostgreSQL is explained in details. Parser, planner, optimizer and so on. But when it comes to Executor, we only say something like “And then the query is executed.”. What really happens on this stage? Both DDL and DML queries are mostly operating with a data stored in tables. I’d separate the levels of abstraction this way:
1. Relation management.
Here we process DDL queries: ExecDropStmt, and so on. We perform all catalog changes, renew caches and fill up the state structures to pass them further. Besides, we also set and track all dependencies between relations, decide whether we should update indexes, defined at our relation, and so on. In case of DML, we also prepare various structures to manage tuple’s flow. For example, set up a proper DestReceiver. IMHO, this part of the code needs to be cleaned. In attachements you can find README files for src/backend/commands and src/backend/catalog directories. I’m sure you will understand my concerns. Besides vague naming and file locations, many functions violate the layering. For example, functions in analyze.c iterate over relation pages directly. It means that we have to keep this code in accordance with heapam, and change it each time we want to change anything in heapam page layout. I wonder, why not to delegate this job to the next level? And so on.
2.Data access management.
In case we work with an index, AM does all the work on data layout. You can get an idea of index AM interface, reading pg_am documentation and the file src/include/access/amapi.h. It provides a pluggable interface and allows to add new index structures easily. Primary data is managed by heapam (see src/backend/heap), which is the only available data structure for data with MVCC information. Due to many historical reasons this code is complex and tangled. And that is exactly the point I’m going to reorganize.
This layer is responsible for shared buffers management, and also local buffers for temporary relations. On request of the upper level, it provides a buffer that is ready for use. I think this code is isolated pretty good in src/backend/storage/buffer. And I’m not willing to change anything in this area.
And finally the lowest level, that dispatches all file system operations for Postgres. Long story short, Postgres knows nothing about it’s own files without relfilenode mapping. There are many advantages and disadvantages of this approach. But what I see, is that this particular problem severely limits the development of backup and data verification utilities. The colleagues of mine are working on improvements in this area, so we want to synchronize our efforts.
Part 2. I propose the new interface for primary data management level. Inspired by index amapi, I’d like to implement API for primary access methods. Let’s call it PAM (primary access method) in this proposal.
Interface flags and variables:
|pamname||Name of the PAM. For now, only heapam is availiable. Options are LSM, inmemoryheap, readonlyheap, fixedsizeheap, etc.|
|pamisreadwriteonly||If the PAM doesn’t suppor update, delete and alter, many things can be done much easier and optimal. Usecase: log tables, MATERIALIZED VIEWS.|
|pamisreadonly||If the pam doesn’t support operations except multiinsert (load data into table using cCOPY FROM or CREATE AS), we can get rid of MVCC fields and compactify data. Usecase: archive tables, dictionaries, non-updateable MATERIALIZED VIEWS.|
|pamsupportsvarlena||Does this PAM supports varlena fields? Again we can optimize page layout for some tables.|
|pamisinmemory||Does this PAM implement in-memory storage (i.e. has no functionality to write data on disk)?|
|pamhaswallog||Does this PAM has Xlog functionality or it only provides UNLOGGED and TEMPORARY tables?|
|pamsupportsindexes||Can we create secondary index on relation of this kind?|
|pamsupportscatalog||Can we create a system catalog relations using this PAM? I suggest to restrict any new PAMs for system catalog tables. Since we want it to be stable and reliable. Anyway creation of system tables is hardcoded in the bootstrap process and noone will be able to change their PAM after that.|
|pamsupporttoast||Can we create a toast table using this PAM? Similar thoughts as about pamsupportscatalog.|
|pamsupportssequence||Can we create a sequence using this PAM?|
Interface functions (All other routines should be hidden inside PAM implementation):
Do we really need this function or it’s a job for relation manager?
|pamaddcolumn||ALTER TABLE ADD COLUMN|
|pamdropcolumn||ALTER TABLE DROP COLUMN|
pamrescan, pamendscan, pamgetnext
|SELECT .. FROM TABLE
|pamfetch||SELECT .. FROM TABLE
Fetch particular tuple by its TID.
|paminsert||INSERT INTO TABLE|
|pammultiinsert (optional)||COPY TO TABLE|
|pamdelete (optional)||DELETE FROM TABLE|
|pamupdate (optional)||UPDATE TABLE SET ...|
|pamvacuum (optional)||VACUUM table, autovacuum|
|pamrebuild (pamvacuumfull)||VACUUM FULL, CLUSTER|
|cluster.c||CLUSTER a table on an index. This is now also used for VACUUM FULL.
rebuild_relation() - creates transient PAM relation, does physical copying of PAM data and swaps relation files. See functions below:
make_new_heap() - Has nothing to do with heap. Creates new cataloged transient table that will be filled with new data during CLUSTER, ALTER TABLE, REFRESH MATVIEW and similar operations.
copy_heap_data() - Has nothing to do with heap. Recieve tuples one by one via index_getnext() or heap_getnext(), skip all tuples that not satisfy vacuum snapshot, reform tuples and rewrite them in reform_and_rewrite_tuple().
reform_and_rewrite_tuple() calls rewrite_heap_tuple() (*see src/backend/access/heap/rewriteheap.c) that inserts the tuple on a page according to heapam algorithm
|copy.c||Implements the COPY utility command.
CopyFrom() - after all preparations calls heap_insert(). Or in case we have a butch of buffered heap tuples calls heap_multi_insert() via CopyFromInsertBatch. CopyTo() - recieves tuple via heap_getnext()
|indexcmds.c||POSTGRES define and remove index code.
ReindexIndex() -> reindex_index()
ReindexTable() -> reindex_relation() -> reindex_index(). Maybe it’s a place for optimisation? Now, if we want to rebuild N indexes on a table, we perform index_build() and underlying table scan, N times. Maybe it’s possible to do it in one pass, building the indexes in parallel? Although it seems to be quite complicated...
|tablecmds.c||Commands for creating and altering table structures and settings.
TODO This file is too big. It has a lot of unrelated stuff, that should be replaced to other files.
1. RemoveRelations() - that implements DROP TABLE, DROP INDEX, DROP SEQUENCE, DROP VIEW, DROP MATERIALIZED VIEW, DROP FOREIGN TABLE. Actually it's only a wrapper for performMultipleDeletions() that lives in src/backend/catalog/dependency.c And again it's not any relation, TOAST TABLE and RELKIND_COMPOSITE_TYPE are hendled somewhere else. Toast tables cannot be deleted directly and Composite type use DROP TYPE instead of DROP relation. See ExecDropStmt().
2. ExecuteTruncate() - does all work on rel truncation, can perform MVCC-unsafe truncation heap_truncate_one_rel()
3. copy_relation_data() - copy physical file, block by block, when tablespase has been changed. TODO replace it closer to storage management.
|vacuum.c||The postgres vacuum cleaner.
This file now includes only control and dispatch code for VACUUM and ANALYZE commands. Regular VACUUM is implemented in vacuumlazy.c, ANALYZE in analyze.c, and VACUUM FULL is a variant of CLUSTER, handled in cluster.c. Also here lives vac_truncate_clog() that after a number of check calls TruncateCLOG, TruncateCommitTs, TruncateMultiXact.
|vacuumlazy.c||Concurrent ("lazy") vacuuming.
lazy_scan_heap() and lazy_vacuum_heap() should rather call interface functions of heapam (like lazy_vacuum_index() does), instead of having full heap-specific code in this file. As well as lazy_vacuum_page(), lazy_check_needs_freeze(), count_nondeletable_pages(), heap_page_is_all_visible().