Table partitioning

From PostgreSQL wiki

Jump to: navigation, search

Contents

Background

Status Quo

Currently we allow the user to (manually) create arbitrary nested tables with arbitrary constraints and then the planner tries to detect at run-time which child tables are candidates for the query. See PostgreSQL Partitioning for details. There are some 3rd party plugins that simplify the (manual) task/triggers, etc. see bottom of this page.

Today, at create time you create a master table, children that inherit from it (and how they are partitioned), separate indices for each child table, and create an insert trigger so that new rows are inserted to the appropriate (child) table (and/or more aggressive measures, such as allowing updates to the partitioned key [by default, updates to rows' partitioned key leave them in the same partition, possibly in error], dynamically allocating new child tables [be careful with race conditions], etc. see the various blogs out there).

Resolved Issues

Limitations

  • We never exclude the parent table
  • We don't handle INSERTs on the parent table
  • It requires a lot of manual work to set it up

Overviews of Project Goals

List discussions

Possible Directions

Oracle-Style

Allow users to declare their intention with partitioned tables. Ie, declare what the partition key is and what range or values are covered by each partition.

I think this would mean two new types of relation. One "meta-table" that acts like a view, in that it doesn't have an attached filenode. It would also have some kind of meta data about the partition key but no view definition, it would act like parent tables in nested table structure do now. The other would be "partition" which would be a separate namespace from tables and would have attached information about what values of the partition key it covered.

Pros:

  • Makes it more reasonable to handle inserts automatically since the structure is explicit and doesn't require making logical deductions.
  • More idiot-proof, ie you can't set up nonsensical combinations of constraints.
  • Consistent with other databases and DBA expectations.

Cons:

  • Less flexible, you can't set up arbitrary non-traditional structures such as having some data in the parent table or having extra columns in some children.

Background:


DB2-Style

DB2 uses modifier clauses in the CREATE TABLE statement for partitioning. It includes a native form of sharding in the same implementation

Clause in the CREATE TABLE statement DB2 feature name
DISTRIBUTE BY HASH DPF - Database Partitioning Feature
ORGANIZE BY DIMENSION MDC - Multidimensional Clustering
PARTITION BY RANGE TP - Table partitioning

The clauses in any combination to achieve the desired effect. (cfr. https://www.ibm.com/developerworks/data/library/techarticle/dm-0608mcinerney/)

- DPF splits into "database partitions"(we would call them shards). "Each database partition has its own set of computing resources, including CPU and storage. In a DPF environment, each table row is distributed to a database partition according to the distribution key specified in the CREATE TABLE statement. When a query is processed, the request is divided so each database partition processes the rows that it is responsible for."

- MDC enables rows with similar values across multiple dimensions to be physically clustered together on disk. This clustering allows for efficient I/O for typical analytical queries. For example, all rows where Product='car', Region='East', and SaleMonthYear='Jan09' can be stored in the same storage location, known as a block.

- TP is what we know as "range partitioning" or "list partitioning", and is implemented in a very similar way as what Postgres currently has: "the user can manually define each data partition, including the range of values to include in that data partition." (and MDC automatically allocates storage for it). "Each TP partition is a separate database object (unlike other tables which are a single database object). Consequently, TP supports attaching and detaching a data partition from the TP table. A detached partitions becomes a regular table. As well, each data partition can be placed in its own table space, if desired."

The key point seems to be that all three features are orthogonal among them, and can be added at table creation time as well as later on. Moreover, sharding is made a first-class citizen and directly supported by the DB. ISTM that we could leverage an evolved version of postgres_fdw (plus some come borrowed from pg_shard and/or PL/Proxy) to this effect.


MQTs (materialized query tables) ---what we call materialized views--- are also subject to partitioning (apparently, also to sharding) directly.

Syntax Examples:

CREATE TABLE orders(id INT, shipdate DATE, …)
PARTITION BY RANGE(shipdate)
(
   STARTING '1/1/2006' ENDING '12/31/2006' 
   EVERY 3 MONTHS
)

Auto-partitioning by interval is nice to have ...


CREATE TABLE orders(id INT, shipdate DATE, …)
PARTITION BY RANGE(shipdate)
(
   PARTITION q4_05 STARTING MINVALUE,
   PARTITION q1_06 STARTING '1/1/2006',
   PARTITION q2_06 STARTING '4/1/2006',
   PARTITION q3_06 STARTING '7/1/2006',
   PARTITION q4_06 STARTING '10/1/2006' 
   ENDING ‘12/31/2006'
)

This is equivalent to "VALUES LESS THAN"(technically VALUES GREATER THAN) and includes a limit

The partition manipulation syntax (here, addition) is nice, too:

ALTER TABLE orders
   ATTACH PARTITION q1_07
   STARTING '01/01/2007'
   ENDING   '03/31/2007'
   FROM TABLE neworders


References:

MySQL-style

Fairly basic, supports RANGE, LIST and HASH

CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)
   ENGINE=INNODB
   PARTITION BY HASH( MONTH(tr_date) )
   PARTITIONS 6;

References:

Trigger-based

First attempts to support auto-partitioning have been made using triggers.

  • avoid specific languages such as pgpsql that requires 'CREATE LANGUAGE'
  • performance of C trigger 4 to 5 times faster than pgpsql
  • insert/copy returns 0 rows when all rows have been routed by trigger from master to child tables
  • chaining triggers allow tunable behavior in case of rows not matching any partition: add an error trigger, move to an overflow table, create new partitions dynamically
  • constraint_exclusion does not work well with prepared statements. It might possible to convert CHECKs to One-Time Filter plan nodes if the condition is a variable.

Active Work In Progress

Syntax

Syntax is proposed at "Syntax for partitioning", second version. The syntax resembles Oracle and MySQL. See also Todo#Administration (Simplify ability to create partitioned tables).

-- create partitioned table and child partitions at once.
CREATE TABLE parent (...)
PARTITION BY [ RANGE | LIST ] ( key ) [ opclass ]
[ (
     PARTITION child
       {
           VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
         | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
       }
       [ WITH ( ... ) ] [ TABLESPACE tbs ]
     [, ...]
  ) ] ;

-- add a partition key to a table.
ALTER TABLE parent PARTITION BY  [ RANGE | LIST ] ( key ) [ opclass ] [ (...) ] ;

-- create a new partition on a partitioned table.
CREATE PARTITION child ON parent VALUES ... ;

-- add a table as a partition.
ALTER TABLE parent ATTACH PARTITION child VALUES ... ;

-- Remove a partition as a normal table.
ALTER TABLE parent DETACH PARTITION child ;

Internal representation

On-disk structure is included in the "Syntax for partitioning" patch. On-memory structure will be proposed in a future patch

On-disk structure

A new system table "pg_partition" added. Partition keys are stored in it.

CREATE TABLE pg_catalog.pg_partition
(
   partrelid   oid    NOT NULL, -- partitioned table oid
   partopclass oid    NOT NULL, -- operator class to compare keys
   partkind    "char" NOT NULL, -- kind of partition: RANGE or LIST
   partkey     text,            -- partition key expression

   PRIMARY KEY (partrelid),
   FOREIGN KEY (partrelid)   REFERENCES pg_class (oid),
   FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
)
WITHOUT OIDS ;

A new column "inhvalues" are added into pg_inherits. Partition values for each partition are stored in it.

ALTER TABLE pg_class.pg_inherits ADD COLUMN inhvalues anyarray ;
  • RANGE partition has an upper value of the range in inhvalues.
  • LIST partition has an array with multiple elements in inhvalues.
  • An overflow partition has an empty array in inhvalues.
  • A normal inherited table has a NULL in inhvalues.

On-memory structure

A cached list of partitions are sorted by partition values and stored in the relcache for the parent table. Changes to the partitions would need to invalidate parent caches to ensure the cache is accurately maintained.

Operations

INSERT

INSERT TRIGGERs will be replaced with specialized tuple-routing feature using on-memory structure. Tuples will be routed in O(log N). It also solve "0 row affected" problem in INSERT TRIGGERs.

SELECT, UPDATE, DELETE

CHECK constraints continue to be used for a while.

It would be improved using on-memory structure; instead of CHECK constraints for each child tables, we can use a sorted list in the parent table. Constraint exclusion can be in O(log N) order instead of O(N) of now.

VACUUM, CLUSTER, REINDEX

We don't expand those commands for now, but they might have to be expanded like as TRUNCATE.

Future improvements

They are hard to fix in 9.0, but should continue to be improved in the future releases.

Syntax

  • Support SPLIT and MERGE for existing partitions. See also Kedar's patch
  • Support UPDATE of partition keys and values.
  • Support adding a partition between existing partitions. It requires SPLIT feature.
  • Support sub-partitions.
  • Support some partition kinds for GIS types. For example, "PARTITION BY GIST" holds partition keys as a GiST tree in on-memory structure.
  • Support HASH partitions. Each partition could be a FOREIGN TABLE in SQL/MED. In other words, it is PL/Proxy integration.
  • Support CREATE TABLE AS -- CREATE TABLE tbl PARTITION BY ... AS SELECT ...;

Executor

  • SELECT FOR SHARE/UPDATE for parent tables.
  • Prepared statements that uses partition keys in place holders.
    • An idea is to convert check constraints into One-Time_Filter [1]
  • Unique constraint over multiple partitions, when each partition has a unique index on set/superset of partition keys
  • Unique constraints over multiple partitions in the general case (typically called as "global index").

Planner

  • Optimization for min/max, LIMIT + ORDER BY, GROUP BY on partition keys.
  • Optimization when constraint exclusion are used with stable or volatile functions. It is a very common case that the partition key is timestamp and compared with now().
  • Join optimization for two partitioned tables.

Third-Party Tools

PG Partition Manager

  • Project Home Page
  • This is an extension that automates time & serial based partitioning (basically does interval partitioning setting up the right triggers for you).
  • Handles setting up, partitioning existing data, dropping unneeded child tables, & undoing partitioning.
Personal tools