BDR User Guide

From PostgreSQL wiki

Revision as of 08:21, 8 March 2013 by Simon (Talk | contribs)

Jump to: navigation, search

BDR stands for BiDrectional Replication.

Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from 2ndQuadrant. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.

At the PgCon2012CanadaInCoreReplicationMeeting an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is available here.


Project Overview and Plans

Project Aims

  • in core
  • fast
  • reusable individual parts (see below), usable by other projects (slony, ...)
  • basis for easier sharding/write scalability
  • wide geographic distribution of replicated nodes

High Level Planning


Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects

  • background workers
  • xlogreader implementation
  • pg_xlogdump

Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.

Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.


Objective to implement main BDR features into core Postgres.


Additional features based upon feedback

Aspects of BDR

Bi-Directional Replication consists of a number of related features

  • Logical Log Streaming Replication - getting data from one master to another.
  • Global Sequences - ability to support sequences that work globally across a set of nodes
  • Conflict Detection & Resolution (options)
  • DDL Replication via Event Triggers

Taken together these features will allow replication in both directions for any pair of servers. We could call this "multi-master replication", but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.

Note that these features aren't "clustering" in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.

User Guide

Logical Log Streaming Replication

Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to physical log streaming replication from a user perspective - the main and big difference is that the receiving server is also a full master database that can also make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.

The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this "logical log replication". Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.

One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.

Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.

Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:

  • pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)
  • textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)
  • BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.

Replication of DML changes

All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLDR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.

(TRUNCATE currently not implemented yet)

LOCK statements are not replicated (possible future feature).

Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.

DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.

UPDATEs that change the Primary Key of a table will be replicated correctly.

All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.

All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.

Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. "identical servers", as with physical replication.

A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.

Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.

Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).

Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.

SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.

In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.

Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.

Table definitions and DDL replication

DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.

This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.

In general, "exact match" is the best guide. Current details (subject to change) are

  • Secondary indexes may differ between nodes
  • Constraints must match for BDR.
  • Storage parameters must match.
  • Table-level parameters, e.g. fillfactor, autovacuum may differ
  • Inheritance must be the same

Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.

Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.

Selective Replication (Table/Row-level filtering)

LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.

Other Terminology

(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLDR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.

Similarly, the terms Origin, Provider and Subcriber only work with one Origin.


Upstream master

  • wal_level = 'logical'
  • max_logical_slots = X
  • max_wal_senders = Y # Y = max_logical_slots plus any physical streaming requirements

Downstream master

  • shared_preload_libraries = 'bdr'
  • bdr.connections="name_of_upstream_master" # list of upstream master nodenames
  • bdr.<nodename>.dsn = 'dbname=postgres' # connection string for connection from downstream to upstream master
  • (Also need a parameter like bdr.<nodename>.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)

wal_keep_segments should be set to a value that allows for some downtime of server/network.

New/Changed Parameter Reference

bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.<nodename>.dsn entry

bdr.<nodename>.dsn - "data source name" - connection info for connecting to upstream master

max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start

wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.


As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.

There are no parameters for tuning transfer latency.

The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.

A variant of hot_standby_feedback could be implemented also, though would likely need renaming.

The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.

Operational Issues and Debugging

In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which may be re-tried.


Some new/changed views are available for monitoring activity

  • pg_stat_replication
  • pg_stat_logical_decoding
  • pg_stat_logical_replication

Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).

  • pg_stat_user_tables
  • pg_statio_user_tables

Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.

  • pg_stat_user_indexes
  • pg_statio_user_indexes

Bi-Directional Replication Use Cases

Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.

Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.

Simple multi-master pair

  • "HA Cluster"
    • Server "Alpha" - Master
    • Server "Beta" - Master
  • Alpha
    • wal_level = 'logical'
    • max_logical_slots = 3
    • max_wal_senders = 4 # Y = max_logical_slots plus any physical streaming requirements
    • wal_keep_segments = 5000
    • shared_preload_libraries = 'bdr'
    • bdr.connections="beta" # list of upstream master nodenames
    • bdr.beta.dsn = 'dbname=postgres' # connection string for connection from downstream to upstream master
  • Beta
    • wal_level = 'logical'
    • max_logical_slots = 3
    • max_wal_senders = 4 # Y = max_logical_slots plus any physical streaming requirements
    • wal_keep_segments = 5000
    • shared_preload_libraries = 'bdr'
    • bdr.connections="alpha" # list of upstream master nodenames
    • bdr.alpha.dsn = 'dbname=postgres' # connection string for connection from downstream to upstream master

HA and Logical Standby

Downstream masters allow users to create temporary tables, so they can be used as reporting servers.

  • "HA Cluster"
    • Server "Alpha" - Current Master
    • Server "Beta" - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names
    • Server "Gamma" - "Logical Standby" - downstream master

Very High Availability Multi-Master

A typical configuration for remote multi-master would then be:

  • Site 1
    • Server "Alpha" - Master - feeds changes to Beta using physical streaming with sync replication
    • Server "Beta" - Physical Standby - feeds changes to Gamma using logical streaming
  • Site 2
    • Server "Gamma" - Master - feeds changes to Delta using physical streaming with sync replication
    • Server "Delta" - Physical Standby - feeds changes to Alpha using logical streaming

Bandwidth between Site 1 and Site 2 is minimised

3-remote site simple Multi-Master

BDR supports "all to all" connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)

  • Site 1
    • Server "Alpha" - Master - feeds changes to Gamma, Echo using logical streaming
  • Site 2
    • Server "Gamma" - Master - feeds changes to Alpha, Echo using logical streaming
  • Site 3
    • Server "Echo" - Master - feeds changes to Alpha, Gamma using logical streaming

Using node names that match port numbers, for clarity

  • config for 5440:
    • port = 5440
    • bdr.connections='node_5441,node_5442'
    • bdr.node_5441.dsn='port=5441 dbname=postgres'
    • bdr.node_5442.dsn='port=5442 dbname=postgres'
  • config for 5440:
    • port = 5441
    • bdr.connections='node_5440,node_5442'
    • bdr.node_5440.dsn='port=5440 dbname=postgres'
    • bdr.node_5442.dsn='port=5442 dbname=postgres'
  • config for 5440:
    • port = 5441
    • bdr.connections='node_5440,node_5442'
    • bdr.node_5440.dsn='port=5440 dbname=postgres'
    • bdr.node_5441.dsn='port=5441 dbname=postgres'

3-remote site Max Availability Multi-Master

  • Site 1
    • Server "Alpha" - Master - feeds changes to Beta using physical streaming with sync replication
    • Server "Beta" - Physical Standby - feeds changes to Gamma, Echo using logical streaming
  • Site 2
    • Server "Gamma" - Master - feeds changes to Delta using physical streaming with sync replication
    • Server "Delta" - Physical Standby - feeds changes to Alpha, Echo using logical streaming
  • Site 3
    • Server "Echo" - Master - feeds changes to Foxtrot using physical streaming with sync replication
    • Server "Foxtrot" - Physical Standby - feeds changes to Alpha, Gamma using logical streaming

Bandwidth and latency between sites is minimised.

N-site symmetric cluster replication

Symmetric cluster is where all masters are connected to each other.

N=19 has been tested and works fine.

N masters requires N-1 connections to other masters, so practical limits are <100 servers, or less if you have many separate databases.

The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.

Complex/Assymetric Replication

Variety of options are possible.

Global Sequences


Conflict Detection & Resolution

Lock Conflicts

Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.

By concurrent activity on a row, we include explicit row level locking, locking from foreign keys, implicit locking because of row updates or deletes.

Data Conflicts

Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).

As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. We update

Personal tools