Built-in Sharding

From PostgreSQL wiki
Jump to navigationJump to search

Introduction

There are over a dozen forks of Postgres which implement sharding. While many of these forks have been successful, they often lag behind the community release of Postgres. By implementing sharding in community Postgres, this feature will be available to all users in current releases of Postgres. This should greatly increase the adoption of community Postgres in environments that need high write scaling or have very large databases.

One great challenge to implementing sharding in Postgres is achieving this goal with minimal code changes. Most of the sharding forks of Postgres require a volume of changes to the community code that would be unacceptable to the general Postgres community, many of whom don't need sharding. With the advent of Foreign Data Wrappers (FDW), it is now possible to consider a built-in sharding implementation which could be accomplished with an acceptable level of code changes.

The basic design of this possible FDW-based sharding solution is based on the work done by Postgres-XC, which was developed by NTT for almost ten years. Postgres-XL is a more flexible implementation of this design. Citus incorporates ideas from both projects and provides sharding without forking from Postgres.

Enhance Existing Features

Implementing built-in sharing with minimal code changes requires improving several existing Postgres features:

  • Done? Improve the FDW infrastructure and postgres_fdw. Specifically, good performance requires that all reasonable operations be pushed to the foreign shards. In Postgres 9.6, joins, sorts, updates, and deletes are pushed to foreign servers. Aggregate pushdown will be supported in Postgres 10. Foreign tables can already participate in inheritance.
  • Done? Postgres 10 will include an overhaul of partitioning for single-node use to improve performance and enable more optimizations, e.g. executor-based partition pruning. This will be used for sharding too.
  • Add parallelism so FDW requests can be issued in parallel. This would allow parallel shard execution. This perhaps could be done by making multiple asynchronous libpq connections.

Improving these features has value independent of sharding.

New Subsystems

Additional subsystems will need to be developed:

  • Allow lookup tables to be replicated to all shards to allow more join pushdown. This can be accomplished with logical replication (done), but the optimizer must know the tables have been replicated to implement pushdown.
  • Write submodule to use the new partitioning system tables to create and issue FDW queries which match submitted queries.
  • Write submodule to collect the results of FDW queries and return them to the user.
  • Write a global transaction manager to efficiently allow shards to atomically commit transactions. This could perhaps be built upon prepared transactions, with some type of transaction manager to clean up open prepared transactions after a crash, e.g. XA.
  • Write a global snapshot manager which allows shards to see a consistent snapshot. (Will SERIALIZABLE transaction mode avoid cross-node snapshot conflicts? Does pg_export_snapshot() or hot_standby_feedback help?) This is also needed for consistent multi-shard backups.
  • Write user API to create, manage, and report on shards.

Use Cases

There are four possible use cases with increasing requirements:

  1. Cross-node read-only queries on read-only shards using aggregate queries, e.g. data warehouse:

    This is the simplest to implement as it doesn't require a global transaction manager, global snapshot manager, and the number of rows returned from the shards is minimal because of the aggregates.

  2. Cross-node read-only queries on read-only shards using non-aggregate queries:

    This will stress the coordinator to collect and process many returned rows, and will show how well the FDW transfer mechanism scales.

  3. Cross-node read-only queries on read/write shards:

    This will require a global snapshot manager to make sure the shards return consistent data.

  4. Cross-node read-write queries:

    This will require a global snapshot manager and global transaction manager.

Previous Documentation

Slides with diagrams of an early proposal are available. A meeting about clustering was held in 2015.