Table partitioning
Background
Status Quo
Starting in PostgreSQL 10, we have declarative partitioning. With it, there is dedicated syntax to create range and list *partitioned* tables and their partitions. PostgreSQL 11 addressed various limitations that existed with the usage of partitioned tables in PostgreSQL, such as the inability to create indexes, row-level triggers, etc. on the partitioned parent table. PostgreSQL 11 also added hash partitioning.
It is still possible to use the older methods of partitioning if need to implement some custom partitioning criteria (other than the range and list methods that declarative partitioning natively supports), or if the limitations of declarative partitioned tables are seen as hindering. See PostgreSQL Partitioning for details. There are some 3rd party plugins that simplify the (manual) task/triggers, etc. see bottom of this page. Although declarative partitioning in PostgreSQL 10 reduces a lot of manual steps, such 3rd party plugins still offer features that the core system does not provide.
See the various blogs out there describing both the new declarative partitioning and the older inheritance-based implementation.
Resolved Issues
- SELECT, UPDATE, DELETE (in 8.2) : They can be handled with constraint_exclusion.
- TRUNCATE (in 8.4) : TRUNCATE for a parent table is expanded into child tables.
- ANALYZE (in 9.0) : ANALYZE to compute such stats for tables that have subclasses
- MAX()/MIN() (in 9.1) : Smarter partition detection.
- NO INHERIT constraints (in 9.2) make it possible to define a constraint only on the parent such that it will always be excluded; declarative partitioning always excludes the parent without any additional configuration
- Tuples inserted into the parent partitioned table are automatically routed to the leaf partitions (PostgreSQL 10)
- Executor-stage partition pruning or faster child table pruning or parallel partition processing added (PostgreSQL 11)
- Hash partitioning (PostgreSQL 11)
- UPDATEs that cause rows to move from one partition to another (PostgreSQL 11)
- Routing tuples to partitions that are foreign tables (PostgreSQL 11)
- Locally partitioned indexes on partitioned tables, which can be UNIQUE if they key includes partition key (PostgreSQL 11)
- FOREIGN KEY on partitioned tables referencing regular tables (PostgreSQL 11)
- A "catch-all" / "fallback" / "default" partition (PostgreSQL 11)
- Subset of Row triggers on the partitioned parent tables (PostgreSQL 11)
- Partitioned parent tables can now be referenced in foreign key relationships (PostgreSQL 12)
Limitations (of declarative partitioning as of PostgreSQL 12)
- No support for "splitting" or "merging" partitions using dedicated commands
- No support for automatic creation of partitions (e.g. for values not covered)
Overviews of Project Goals
List discussions
- (2005-05) Table Partitioning, Part 1
- (2007-03) Auto creation of Partitions
- (2007-04) Re: Auto Partitioning Patch - WIP version 1
- (2008-01) Dynamic Partitioning using Segment Visibility Maps
- (2008-01) Named vs Unnamed Partitions
- (2008-01) Storage Model for Partitioning
- (2008-01) Declarative partitioning grammar
- (2008-10) Auto-Partitioning patch discussion
- (2009-03) Partitioning feature
- (2009-05) Transparent table partitioning in future version of PG?
- (2009-07) Comments on automatic DML routing and explicit partitioning subcommands
- (2009-10) Patch for automated partitioning
- (2009-11) Syntax for partitioning
- (2009-11) Partitioning support for COPY
- (2010-01) Partitioning syntax
- (2010-07) Scalability of the planner with non trivial number of partitions
- (2011-07) New partitioning WAS: Check constraints on partition parents only?
- (2014-08) On partitioning
- (2015-02) Partitioning WIP patch
- (2015-08) Declarative partitioning
- (2016-08) Declarative partitioning - another take
Possible Directions
Oracle-Style
Allow users to declare their intention with partitioned tables. Ie, declare what the partition key is and what range or values are covered by each partition.
I think this would mean two new types of relation. One "meta-table" that acts like a view, in that it doesn't have an attached filenode. It would also have some kind of meta data about the partition key but no view definition, it would act like parent tables in nested table structure do now. The other would be "partition" which would be a separate namespace from tables and would have attached information about what values of the partition key it covered.
Pros:
- Makes it more reasonable to handle inserts automatically since the structure is explicit and doesn't require making logical deductions.
- More idiot-proof, ie you can't set up nonsensical combinations of constraints.
- Consistent with other databases and DBA expectations.
Cons:
- Less flexible, you can't set up arbitrary non-traditional structures such as having some data in the parent table or having extra columns in some children.
Background:
- Oracle CREATE TABLE syntax
- Partitioning in Oracle 10g
- Partition management in Oracle 10g
- Partition management in Oracle 11g including interval partitions
- MySQL partitioning
DB2-Style
DB2 uses modifier clauses in the CREATE TABLE statement for partitioning. It includes a native form of sharding in the same implementation
Clause in the CREATE TABLE statement | DB2 feature name |
---|---|
DISTRIBUTE BY HASH | DPF - Database Partitioning Feature |
ORGANIZE BY DIMENSION | MDC - Multidimensional Clustering |
PARTITION BY RANGE | TP - Table partitioning |
The clauses in any combination to achieve the desired effect. (cfr. https://www.ibm.com/developerworks/data/library/techarticle/dm-0608mcinerney/)
- DPF splits into "database partitions"(we would call them shards). "Each database partition has its own set of computing resources, including CPU and storage. In a DPF environment, each table row is distributed to a database partition according to the distribution key specified in the CREATE TABLE statement. When a query is processed, the request is divided so each database partition processes the rows that it is responsible for."
- MDC enables rows with similar values across multiple dimensions to be physically clustered together on disk. This clustering allows for efficient I/O for typical analytical queries. For example, all rows where Product='car', Region='East', and SaleMonthYear='Jan09' can be stored in the same storage location, known as a block.
- TP is what we know as "range partitioning" or "list partitioning", and is implemented in a very similar way as what Postgres currently has: "the user can manually define each data partition, including the range of values to include in that data partition." (and MDC automatically allocates storage for it). "Each TP partition is a separate database object (unlike other tables which are a single database object). Consequently, TP supports attaching and detaching a data partition from the TP table. A detached partitions becomes a regular table. As well, each data partition can be placed in its own table space, if desired."
The key point seems to be that all three features are orthogonal among them, and can be added at table creation time as well as later on. Moreover, sharding is made a first-class citizen and directly supported by the DB. ISTM that we could leverage an evolved version of postgres_fdw (plus some code borrowed from pg_shard and/or PL/Proxy) to this effect.
MQTs (materialized query tables) ---what we call materialized views--- are also subject to partitioning (apparently, also to sharding) directly.
Syntax Examples:
CREATE TABLE orders(id INT, shipdate DATE, …) PARTITION BY RANGE(shipdate) ( STARTING '1/1/2006' ENDING '12/31/2006' EVERY 3 MONTHS )
Auto-partitioning by interval is nice to have ...
CREATE TABLE orders(id INT, shipdate DATE, …) PARTITION BY RANGE(shipdate) ( PARTITION q4_05 STARTING MINVALUE, PARTITION q1_06 STARTING '1/1/2006', PARTITION q2_06 STARTING '4/1/2006', PARTITION q3_06 STARTING '7/1/2006', PARTITION q4_06 STARTING '10/1/2006' ENDING ‘12/31/2006' )
This is equivalent to "VALUES LESS THAN"(technically VALUES GREATER THAN) and includes a limit
The partition manipulation syntax (here, addition) is nice, too:
ALTER TABLE orders ATTACH PARTITION q1_07 STARTING '01/01/2007' ENDING '03/31/2007' FROM TABLE neworders
References:
- https://www.ibm.com/developerworks/data/library/techarticle/dm-0608mcinerney/
- http://www.ibm.com/developerworks/data/library/techarticle/dm-0605ahuja2/
MySQL-style
Fairly basic, supports RANGE, LIST and HASH
CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE) ENGINE=INNODB PARTITION BY HASH( MONTH(tr_date) ) PARTITIONS 6;
References:
Trigger-based
First attempts to support auto-partitioning have been made using triggers.
- avoid specific languages such as pgpsql that requires 'CREATE LANGUAGE'
- performance of C trigger 4 to 5 times faster than pgpsql
- insert/copy returns 0 rows when all rows have been routed by trigger from master to child tables
- chaining triggers allow tunable behavior in case of rows not matching any partition: add an error trigger, move to an overflow table, create new partitions dynamically
- constraint_exclusion does not work well with prepared statements. It might possible to convert CHECKs to One-Time Filter plan nodes if the condition is a variable.
Active Work In Progress
Syntax
Syntax is proposed at "Syntax for partitioning", second version. The syntax resembles Oracle and MySQL. See also Todo#Administration (Simplify ability to create partitioned tables).
-- create partitioned table and child partitions at once. CREATE TABLE parent (...) PARTITION BY [ RANGE | LIST ] ( key ) [ opclass ] [ ( PARTITION child { VALUES LESS THAN { ... | MAXVALUE } -- for RANGE | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST } [ WITH ( ... ) ] [ TABLESPACE tbs ] [, ...] ) ] ; -- add a partition key to a table. ALTER TABLE parent PARTITION BY [ RANGE | LIST ] ( key ) [ opclass ] [ (...) ] ; -- create a new partition on a partitioned table. CREATE PARTITION child ON parent VALUES ... ; -- add a table as a partition. ALTER TABLE parent ATTACH PARTITION child VALUES ... ; -- Remove a partition as a normal table. ALTER TABLE parent DETACH PARTITION child ;
Internal representation
On-disk structure is included in the "Syntax for partitioning" patch. On-memory structure will be proposed in a future patch
On-disk structure
A new system table "pg_partition" added. Partition keys are stored in it.
CREATE TABLE pg_catalog.pg_partition ( partrelid oid NOT NULL, -- partitioned table oid partopclass oid NOT NULL, -- operator class to compare keys partkind "char" NOT NULL, -- kind of partition: RANGE or LIST partkey text, -- partition key expression PRIMARY KEY (partrelid), FOREIGN KEY (partrelid) REFERENCES pg_class (oid), FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid) ) WITHOUT OIDS ;
A new column "inhvalues" are added into pg_inherits. Partition values for each partition are stored in it.
ALTER TABLE pg_class.pg_inherits ADD COLUMN inhvalues anyarray ;
- RANGE partition has an upper value of the range in inhvalues.
- LIST partition has an array with multiple elements in inhvalues.
- An overflow partition has an empty array in inhvalues.
- A normal inherited table has a NULL in inhvalues.
On-memory structure
A cached list of partitions are sorted by partition values and stored in the relcache for the parent table. Changes to the partitions would need to invalidate parent caches to ensure the cache is accurately maintained.
Operations
INSERT
INSERT TRIGGERs will be replaced with specialized tuple-routing feature using on-memory structure. Tuples will be routed in O(log N). It also solve "0 row affected" problem in INSERT TRIGGERs.
SELECT, UPDATE, DELETE
CHECK constraints continue to be used for a while.
It would be improved using on-memory structure; instead of CHECK constraints for each child tables, we can use a sorted list in the parent table. Constraint exclusion can be in O(log N) order instead of O(N) of now.
VACUUM, CLUSTER, REINDEX
We don't expand those commands for now, but they might have to be expanded like as TRUNCATE.
Future improvements
They are hard to fix in 9.0, but should continue to be improved in the future releases.
Syntax
- Support SPLIT and MERGE for existing partitions. See also Kedar's patch
- Support UPDATE of partition keys and values.
- Support adding a partition between existing partitions. It requires SPLIT feature.
- Support sub-partitions.
- Support some partition kinds for GIS types. For example, "PARTITION BY GIST" holds partition keys as a GiST tree in on-memory structure.
- Support HASH partitions. Each partition could be a FOREIGN TABLE in SQL/MED. In other words, it is PL/Proxy integration.
- Support CREATE TABLE AS -- CREATE TABLE tbl PARTITION BY ... AS SELECT ...;
Executor
- SELECT FOR SHARE/UPDATE for parent tables.
- Prepared statements that uses partition keys in place holders.
- An idea is to convert check constraints into One-Time_Filter [1]
- Unique constraint over multiple partitions, when each partition has a unique index on set/superset of partition keys
- Unique constraints over multiple partitions in the general case (typically called as "global index").
Planner
- Optimization for min/max, LIMIT + ORDER BY, GROUP BY on partition keys.
- Optimization when constraint exclusion are used with stable or volatile functions. It is a very common case that the partition key is timestamp and compared with now().
- Join optimization for two partitioned tables.
Third-Party Tools
PG Partition Manager
- Project Home Page
- This is an extension that automates time & serial based partitioning (basically does interval partitioning setting up the right triggers for you).
- Handles setting up, partitioning existing data, dropping unneeded child tables, & undoing partitioning.