Scaleout Design

From PostgreSQL wiki
Jump to: navigation, search

Note: this is just a memo for discussion.



  • OLTP
  • analytics (with MPP parallelism)
  • Mixed

Scalability limit

  • Number of concurrent transactions: ??
  • Amount of data: Petabytes
  • Maximum number of nodes: ?? (Oracle RAC= 100, MySQL Cluster = 48, CockroachDB = thousands(goal)/10s(in practice))

Single node performance

The performance of transactions which only access one node in the cluster is almost the same as that of transactions which access a non-clustered node.

Cluster architecture

Distributed architecture (big crossroads #1)

  • Shared nothing
  • Shared disk
  • New architecture based on separation of compute and storage (e.g. Amazon Aurora)


Shared disk architecture

File:PostgreSQL shared disk scaleout EN.pdf

Server roles

Minimize single point of failure or contention. This doesn't mean to exclude a central management node.

  • Multiple data nodes: Store data, accept client connections and execute database requests.
  • Single management node: manages node information, system catalog, sequences.

Online cluster resizing without stopping applications

  • Add/delete nodes online
  • Rolling hardware/software maintenance: stop, patch, and restart nodes one by one

Cluster membership

  • Centrally control which nodes can join the cluster for security
  • Do we incorporate some existing membership management software like JGroups?

System catalog management

  • Applications can run DDL on any node. One central node stores the catalog, other nodes fetch necessary data from there and cache it in its memory.
  • Applications can run DDL on any node. DDL statements are executed on all nodes, synchronizing data with 2PC. (Postgres-XL)
  • Applications can only run DDL on one central node. The central node pushes those DDLs to other nodes, tracking which DDLs each node has executed. (Oracle Sharding)

If we replicate catalog on an SQL basis, Oid values for the same schema object differ on each node. Could it be inconvenient?

Connection pooling for inter-node communication

FDW or non-FDW (big crossroads #2)

Current limitations and potential barriers of FDW are:

  • Lacks support for INSERT statements with an ON CONFLICT DO UPDATE clause
  • User and privilege management
  • Sequences
  • Session variables
  • Oid and ctid: psqlODBC's cursor update/delete uses either Oidor primary key. If they are not available, it resorts to ctid.
  • Scrollable cursor across multiple nodes is OK?
  • libpq overhead: conversion of data between text and binary formats
  • Only IMMUTABLE built-in data types, operators, and functions are executed on remote servers
  • Transactions on remote servers use REPEATABLE READ even when the local transaction uses READ COMMITTED

Data sharding and placement

Unit of distribution

  • Partition and subpartition defined by the user
  • Automatic range partitioning into smaller units isn't necessary like Bigtable's tablets and HBase's regions?

Automatic and manual distribution

  • Automatic: Postgres determines which records to place on which nodes
  • Manual: The user controls which records to place on which nodes

Distribution method

  • Distribute by consistent hash
  • Distribute by range
  • Distribute by list: geographic distribution
  • Composite distribution: combine list/range and hash, e.g. list-partition by country to distribute data to nearby regions, and then hash-subpartition data across multiple nodes in each region
  • Distribute by replication: all nodes hold the same copy of a table, e.g. master data like the product catalog

Update of replicated tables

  • Log shipping replication
  • Statement-based replication (Postgres-XL)
  • Quorum-based replication (MySQL Group Replication, MySQL/MariaDB Galera Cluster)
  • Materialized view-based replication (Oracle Sharding: the default lag is 60 seconds, which is configurable)

Data redistribution

Shuffle data among existing and newly added nodes when distributing by consistent hash.

Move partitions to a more powerful node when distributing by list or range.

Online data relocation with minimal impact on application performance.

Distributed transaction


Two-phase commit among PostgreSQL servers is being developed, which is described here.

Consistency and isolation

Decentralized algorithm based on Clock-SI is being developed here.

Deadlock detection and resolution

We want to avoid a central node with heavy duty for lock management.

  • Continuously monitor all nodes for transactions stuck in deadlock (Citus: described here)
  • Communicate only with relevant nodes upon deadlock_timeout to build a distributed wait-for graph, and check for cycles in the wait-for graph


autovacuum processes on idle node need to be sensitive to updates on other nodes to prevent XID wraparound.

High availability


  • Maximum number of replicas (MySQL Cluster = 2, Amazon Aurora = 6)
  • Unit: whole database cluster ($PGDATA + tablespaces)
  • Read-only replica is enough? Do we want read-write replica like MySQL Group Replication and MySQL/MariaDB Galera Cluster?

Failure detection and failover

Split brain

Backup and recovery based on WAL

  • Unit: the entire cluster is enough?
  • Backup storage: assume shared storage to accumulate backups from all nodes (NFS/CIFS, object storage, cloud storage)
  • How do we name (label) each backup of the entire cluster?
  • Parallel execution across nodes
  • Recovery to preserve consistency among nodes: PITR needs to prevent a half-baked distributed transaction, which is replayed up to its commit on one node and not on other nodes

Disaster recovery

  • Cluster-to-cluster asynchronous replication topology
    • One-to-one: a single node on one cluster collects all updates and send them to one node on another cluster (MySQL Cluster)
    • Many-to-many: each node on one cluster sends its updates to a corresponding node on another cluster
  • Consistency among nodes, as in the case of PITR within the local data center

Client connection

Workload routing

Route connections to the nodes that have necessary data, based on some data specified during a connection request. Oracle Sharding allows applications to specify sharding key values upon connection.

Route read-only connections to the read replicas.

To eliminate overhead, directly connect to the database without passing through intermediate router software after establishing a connection.

Connection failover

Avoid attempts to connect to failed nodes.

Connection load balancing

Balance the number of connections among nodes which have the same data.


Cluster-wide startup and shutdown with a single command


  • Inter-node communication, e.g. count and amount of send/recv per session/database
  • Locks with node info
  • Logging of detected distributed deadlocks
  • Cluster-wide statistics views (pg_gstat_*) like Oracle's GV$ views

Unit of maintenance operation

Both per-partition and the whole table/index across all partitions on multiple nodes (REINDEX, ALTER TABLE, VACUUM, ANALYZE, TRUNCATE, CLUSTER)

Centralized log management

Do we leave it up to external log management software?