Table partitioning

From PostgreSQL wiki

Revision as of 18:56, 21 June 2011 by Dfetter (Talk | contribs)

Jump to: navigation, search

Contents

Background

Status Quo

Currently we allow the user to create arbitrary nested tables with arbitrary constraints and then the planner tries to detect at run-time which child tables are candidates for the query. See PostgreSQL Partitioning for details.

Resolved Issues

Limitations

  • We never exclude the parent table
  • We don't handle INSERTs on the parent table
  • It requires a lot of manual work to set it up

Overviews of Project Goals

List discussions

Possible Directions

Oracle-Style

Allow users to declare their intention with partitioned tables. Ie, declare what the partition key is and what range or values are covered by each partition.

I think this would mean two new types of relation. One "meta-table" that acts like a view, in that it doesn't have an attached filenode. It would also have some kind of meta data about the partition key but no view definition, it would act like parent tables in nested table structure do now. The other would be "partition" which would be a separate namespace from tables and would have attached information about what values of the partition key it covered.

Pros:

  • Makes it more reasonable to handle inserts automatically since the structure is explicit and doesn't require making logical deductions.
  • More idiot-proof, ie you can't set up nonsensical combinations of constraints.
  • Consistent with other databases and DBA expectations.

Cons:

  • Less flexible, you can't set up arbitrary non-traditional structures such as having some data in the parent table or having extra columns in some children.

Background:

Trigger-based

First attempts to support auto-partitioning have been made using triggers.

  • avoid specific languages such as pgpsql that requires 'CREATE LANGUAGE'
  • performance of C trigger 4 to 5 times faster than pgpsql
  • insert/copy returns 0 rows when all rows have been routed by trigger from master to child tables
  • chaining triggers allow tunable behavior in case of rows not matching any partition: add an error trigger, move to an overflow table, create new partitions dynamically
  • constraint_exclusion does not work well with prepared statements. It might possible to convert CHECKs to One-Time Filter plan nodes if the condition is a variable.

Active Work In Progress

Syntax

Syntax is proposed at "Syntax for partitioning", second version. The syntax resembles Oracle and MySQL. See also Todo#Administration (Simplify ability to create partitioned tables).

-- create partitioned table and child partitions at once.
CREATE TABLE parent (...)
PARTITION BY [ RANGE | LIST ] ( key ) [ opclass ]
[ (
     PARTITION child
       {
           VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
         | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
       }
       [ WITH ( ... ) ] [ TABLESPACE tbs ]
     [, ...]
  ) ] ;

-- add a partition key to a table.
ALTER TABLE parent PARTITION BY  [ RANGE | LIST ] ( key ) [ opclass ] [ (...) ] ;

-- create a new partition on a partitioned table.
CREATE PARTITION child ON parent VALUES ... ;

-- add a table as a partition.
ALTER TABLE parent ATTACH PARTITION child VALUES ... ;

-- Remove a partition as a normal table.
ALTER TABLE parent DETACH PARTITION child ;

Internal representation

On-disk structure is included in the "Syntax for partitioning" patch. On-memory structure will be proposed in a future patch

On-disk structure

A new system table "pg_partition" added. Partition keys are stored in it.

CREATE TABLE pg_catalog.pg_partition
(
   partrelid   oid    NOT NULL, -- partitioned table oid
   partopclass oid    NOT NULL, -- operator class to compare keys
   partkind    "char" NOT NULL, -- kind of partition: RANGE or LIST
   partkey     text,            -- partition key expression

   PRIMARY KEY (partrelid),
   FOREIGN KEY (partrelid)   REFERENCES pg_class (oid),
   FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
)
WITHOUT OIDS ;

A new column "inhvalues" are added into pg_inherits. Partition values for each partition are stored in it.

ALTER TABLE pg_class.pg_inherits ADD COLUMN inhvalues anyarray ;
  • RANGE partition has an upper value of the range in partkey.
  • LIST partition has an array with multiple elements in partkey.
  • An overflow partition has an empty array in partkey.
  • A normal inherited table has a NULL in partkey.

On-memory structure

A cached list of partitions are sorted by partition values and stored in the relcache for the parent table. Changes to the partitions would need to invalidate parent caches to ensure the cache is accurately maintained.

Operations

INSERT

INSERT TRIGGERs will be replaced with specialized tuple-routing feature using on-memory structure. Tuples will be routed in O(log N). It also solve "0 row affected" problem in INSERT TRIGGERs.

SELECT, UPDATE, DELETE

CHECK constraints continue to be used for a while.

It would be improved using on-memory structure; instead of CHECK constraints for each child tables, we can use a sorted list in the parent table. Constraint exclusion can be in O(log N) order instead of O(N) of now.

VACUUM, CLUSTER, REINDEX

We don't expand those commands for now, but they might have to be expanded like as TRUNCATE.

Future improvements

They are hard to fix in 9.0, but should continue to be improved in the future releases.

Syntax

  • Support SPLIT and MERGE for existing partitions. See also Kedar's patch
  • Support UPDATE of partition keys and values.
  • Support adding a partition between existing partitions. It requires SPLIT feature.
  • Support sub-partitions.
  • Support some partition kinds for GIS types. For example, "PARTITION BY GIST" holds partition keys as a GiST tree in on-memory structure.
  • Support HASH partitions. Each partition could be a FOREIGN TABLE in SQL/MED. In other words, it is PL/Proxy integration.
  • Support CREATE TABLE AS -- CREATE TABLE tbl PARTITION BY ... AS SELECT ...;

Executor

  • SELECT FOR SHARE/UPDATE for parent tables.
  • Prepared statements that uses partition keys in place holders.
    • An idea is to convert check constraints into One-Time_Filter [1]
  • Unique constraint over multiple partitions, when each partition has a unique index on set/superset of partition keys
  • Unique constraints over multiple partitions in the general case (typically called as "global index").

Planner

  • Optimization for min/max, LIMIT + ORDER BY, GROUP BY on partition keys.
  • Optimization when constraint exclusion are used with stable or volatile functions. It is a very common case that the partition key is timestamp and compared with now().
  • Join optimization for two partitioned tables.
Personal tools