BDR User Guide

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
(Operational Issues and Debugging)
(Insufficient WAL segments retained ("requested WAL segment ... has already been removed"))
Line 260: Line 260:
but a more explicit error message for this condition is planned.
but a more explicit error message for this condition is planned.
The only way to recover from this fault is to re-seed the replica database.
This fault could be prevented with feedback from the replica to the master, but this feature is not planned for the first release of BDR.
=== Monitoring ===
=== Monitoring ===

Revision as of 03:54, 3 May 2013

BDR stands for BiDrectional Replication.

Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from 2ndQuadrant. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.

At the PgCon2012CanadaInCoreReplicationMeeting an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is available here.


Project Overview and Plans

Project Aims

  • in core
  • fast
  • reusable individual parts (see below), usable by other projects (slony, ...)
  • basis for easier sharding/write scalability
  • wide geographic distribution of replicated nodes

High Level Planning


Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects

  • background workers
  • xlogreader implementation
  • pg_xlogdump

Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.

Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.


Objective to implement main BDR features into core Postgres.


Additional features based upon feedback

Aspects of BDR

Bi-Directional Replication consists of a number of related features

  • Logical Log Streaming Replication - getting data from one master to another.
  • Global Sequences - ability to support sequences that work globally across a set of nodes
  • Conflict Detection & Resolution (options)
  • DDL Replication via Event Triggers

Taken together these features will allow replication in both directions for any pair of servers. We could call this "multi-master replication", but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.

Note that these features aren't "clustering" in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.

User Guide

Initial setup

To set up BDR you first need a patched PostgreSQL that can support LR/BDR, then you need to create one or more LR/BDR senders and one or more LR/BDR receivers.

Installing the patched PostgreSQL binaries

Currently BDR is only available in builds of the 'bdr' branch on Andres Freund's git repo on PostgreSQL 9.2 and below do not support BDR, and 9.3 requires patches, so this guide will not work for you if you are trying to use a normal install of PostgreSQL.

First you need to clone, configure, compile and install like normal. Clone the sources from git:// and checkout the bdr branch.

If you have an existing local PostgreSQL git tree specify it as --reference /path/to/existing/tree to greatly speed your git clone.


mkdir -p $HOME/bdr
cd bdr
git clone git:// $HOME/bdr/postgres-bdr-src
cd postgres-bdr-src
./configure --prefix=$HOME/bdr/postgres-bdr-bin
make install
cd contrib/bdr
make install

This will put everything in $HOME/bdr, with the source code and build tree in $HOME/bdr/postgres-bdr-src and the installed PostgreSQL in $HOME/bdr/postgres-bdr-bin. This is a convenient setup for testing and development because it doesn't require you to set up new users, wrangle permissions, run anything as root, etc, but it isn't recommended that you deploy this way in production.

To actually use these new binaries you will need to:

export PATH=$HOME/bdr/postgres-bdr-bin/bin:$PATH

before running initdb, postgres, etc. You don't have to use the psql or libpq you compiled but you're likely to get version mismatch warnings if you don't.


Upstream (sender):

 wal_level = 'logical'       # Include enough info for logical replication
 max_logical_slots = X       # Number of LLSR senders + any receivers
 max_wal_senders = Y         # Y = max_logical_slots plus any physical 
                             # streaming requirements
 wal_keep_segments = 5000    # Master must retain enough WAL segments to let 
                             # replicas catch up. Correct value depends on
                             # rate of writes on master, max replica downtime
                             # allowable. 5000 segments requires 78GB
                             # in pg_xlog

Downstream (receiver):

 shared_preload_libraries = 'bdr'
 bdr.connections="name_of_upstream_master"      # list of upstream master nodenames
 bdr.<nodename>.dsn = 'dbname=postgres'         # connection string for connection
                                                # from downstream to upstream master
 bdr.<nodename>.local_dbname = 'xxx'            # optional parameter to cover the case 
                                                # where the databasename on upstream 
                                                # and downstream master differ. 
                                                # (Not yet implemented)
 bdr.<nodename>.apply_delay                     # optional parameter to delay apply of
                                                # transactions, time in milliseconds 
 bdr.synchronous_commit = ...;                  # optional parameter to set the
                                                # synchronous_commit parameter the
                                                # apply processes will be using
 max_logical_slots = X                          # set to the number of remotes

wal_keep_segments should be set to a value that allows for some downtime of server/network and for heavy bursts of write activity on the master. Keep in mind that enough disk space must be available for the WAL segments, each of which is 16MB.

Note that a server can be both sender and receiver, either two servers to each other or more complex configurations like replication chains/trees.

New/Changed Parameter Reference

  • bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.<nodename>.dsn entry
  • bdr.<nodename>.dsn - "data source name" - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication
  • max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start
  • wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.

Tip: If you see an error like:

LOG:  invalid value for parameter ...

when setting one of these parameters then your server doesn't support logical replication and will need to be patched or updated.

Troubleshooting setup

If you see the error:

FATAL:  could not access file "bdr": No such file or directory

when starting a database set up to receive BDR replication, you probably forgot to install contrib/bdr. See above.

Logical Log Streaming Replication

Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to "streaming replication" i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.

The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this "logical log replication". Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.

One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.

Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.

Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:

  • pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)
  • textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)
  • BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.

Replication of DML changes

All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.

(TRUNCATE is not yet implemented, but will be implemented before the feature goes to final release).

LOCK statements are not replicated (possible future feature).

Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.

DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.

UPDATEs that change the Primary Key of a table will be replicated correctly.

The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions are evaluated on the master side and the resulting values are replicated. This means that any function side-effects outside the database tables (writing files, network socket activity, updating internal PostgreSQL variables, etc) will not occur on the replicas as the functions are not run again on the replica.

All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.

All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.

Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. "identical servers", as with physical replication.

A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.

Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.

Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).

Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.

SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.

NOTIFY is not supported across log based replication, either physical or logical.

In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.

From a performance and concurrency perspective the BDR apply process is similar to a normal backend. Frequent conflicts with locks from other transactions when replaying changes can slow things down and thus increase replication delay, so reducing the frequency of such conflicts can be a good way to speed things up. Any lock held by another transaction - `LOCK` statements, `SELECT ... FOR UPDATE/FOR SHARE`, or `INSERT`/`UPDATE`/`DELETE` row locks - can delay replication if the replication apply process needs to change the locked table/row.

Table definitions and DDL replication

DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.

This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.

In general, "exact match" is the best guide. Current details (subject to change) are

  • Secondary indexes may differ between nodes
  • Constraints must match for BDR.
  • Storage parameters must match.
  • Table-level parameters, e.g. fillfactor, autovacuum may differ
  • Inheritance must be the same
  • Dropped columns on master must be present on replicas
  • Custom types and enum definitions must match exactly
  • Composite types and enums must have the same oids on master and replication target
  • Extensions defining types used in replicated tables must be of the same version or fully SQL-level compatible

Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.

Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.

In future it is expected that composite types and enums with non-identical oids will be converted using text output and input functions. This feature is not yet implemented.

Selective Replication (Table/Row-level filtering)

LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.

Other Terminology

(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.

Similarly, the terms Origin, Provider and Subcriber only work with one Origin.


As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.

There are no parameters for tuning transfer latency.

The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.

A variant of hot_standby_feedback could be implemented also, though would likely need renaming.

The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.

Operational Issues and Debugging

In LLSR there are no user-level (ie SQL visible) ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.

Insufficient WAL segments retained ("requested WAL segment ... has already been removed")

If wal_keep_segments is insufficient to meet the requirements of a replica that has fallen far behind, the master will report errors like:

ERROR:  requested WAL segment 00000001000000010000002D has already been removed

Currently the replica errors look like:

WARNING:  Starting logical replication
LOG:  data stream ended
LOG:  worker process: master (PID 23812) exited with exit code 0
LOG:  starting background worker process "master"
LOG:  master initialized on master, remote dbname=master port=5434 replication=true fallback_application_name=bdr
LOG:  local sysid 5873181566046043070, remote: 5873181102189050714
LOG:  found valid replication identifier 1
LOG:  starting up replication at 1 from 1/2D9CA220

but a more explicit error message for this condition is planned.

The only way to recover from this fault is to re-seed the replica database.

This fault could be prevented with feedback from the replica to the master, but this feature is not planned for the first release of BDR.


Some new/changed views are available for monitoring activity

  • pg_stat_replication
  • pg_stat_logical_decoding
  • pg_stat_logical_replication

Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).

  • pg_stat_user_tables
  • pg_statio_user_tables

Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.

  • pg_stat_user_indexes
  • pg_statio_user_indexes

Bi-Directional Replication Use Cases

Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.

Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.

Simple multi-master pair

  • "HA Cluster"
    • Server "Alpha" - Master
    • Server "Bravo" - Master
  • Alpha
    • wal_level = 'logical'
    • max_logical_slots = 3
    • max_wal_senders = 4 # Y = max_logical_slots plus any physical streaming requirements
    • wal_keep_segments = 5000
    • shared_preload_libraries = 'bdr'
    • bdr.connections="bravo" # list of upstream master nodenames
    • bdr.bravo.dsn = 'dbname=postgres' # connection string for connection from downstream to upstream master
  • Bravo
    • wal_level = 'logical'
    • max_logical_slots = 3
    • max_wal_senders = 4 # Y = max_logical_slots plus any physical streaming requirements
    • wal_keep_segments = 5000
    • shared_preload_libraries = 'bdr'
    • bdr.connections="alpha" # list of upstream master nodenames
    • bdr.alpha.dsn = 'dbname=postgres' # connection string for connection from downstream to upstream master

HA and Logical Standby

Downstream masters allow users to create temporary tables, so they can be used as reporting servers.

  • "HA Cluster"
    • Server "Alpha" - Current Master
    • Server "Bravo" - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names
    • Server "Charlie" - "Logical Standby" - downstream master

Very High Availability Multi-Master

A typical configuration for remote multi-master would then be:

  • Site 1
    • Server "Alpha" - Master - feeds changes to Bravo using physical streaming with sync replication
    • Server "Bravo" - Physical Standby - feeds changes to Charlie using logical streaming
  • Site 2
    • Server "Charlie" - Master - feeds changes to Delta using physical streaming with sync replication
    • Server "Delta" - Physical Standby - feeds changes to Alpha using logical streaming

Bandwidth between Site 1 and Site 2 is minimised

3-remote site simple Multi-Master Plex

BDR supports "all to all" connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)

  • Site 1
    • Server "Alpha" - Master - feeds changes to Charlie, Echo using logical streaming
  • Site 2
    • Server "Charlie" - Master - feeds changes to Alpha, Echo using logical streaming replication
  • Site 3
    • Server "Echo" - Master - feeds changes to Alpha, Charlie using logical streaming replication

Using node names that match port numbers, for clarity

  • config for 5440:
    • port = 5440
    • bdr.connections='node_5441,node_5442'
    • bdr.node_5441.dsn='port=5441 dbname=postgres'
    • bdr.node_5442.dsn='port=5442 dbname=postgres'
  • config for 5441:
    • port = 5441
    • bdr.connections='node_5440,node_5442'
    • bdr.node_5440.dsn='port=5440 dbname=postgres'
    • bdr.node_5442.dsn='port=5442 dbname=postgres'
  • config for 5442:
    • port = 5442
    • bdr.connections='node_5440,node_5441'
    • bdr.node_5440.dsn='port=5440 dbname=postgres'
    • bdr.node_5441.dsn='port=5441 dbname=postgres'

3-remote site simple Multi-Master Circular Replication

Simpler config uses "circular replication". This is simpler but results in higher latency for changes as the number of nodes increases.

  • Site 1
    • Server "Alpha" - Master - feeds changes to Charlie using logical streaming replication
  • Site 2
    • Server "Charlie" - Master - feeds changes to Echo using logical streaming replication
  • Site 3
    • Server "Echo" - Master - feeds changes to Alpha using logical streaming replication

Using node names that match port numbers, for clarity

  • config for 5440:
    • port = 5440
    • bdr.connections='node_5441'
    • bdr.node_5441.dsn='port=5441 dbname=postgres'
  • config for 5441:
    • port = 5441
    • bdr.connections='node_5442'
    • bdr.node_5442.dsn='port=5442 dbname=postgres'
  • config for 5442:
    • port = 5442
    • bdr.connections='node_5440'
    • bdr.node_5440.dsn='port=5440 dbname=postgres'

Regrettably this doesn't actually work yet because we don't cascade logical changes (yet).

3-remote site Max Availability Multi-Master Plex

  • Site 1
    • Server "Alpha" - Master - feeds changes to Bravo using physical streaming with sync replication
    • Server "Bravo" - Physical Standby - feeds changes to Charlie, Echo using logical streaming
  • Site 2
    • Server "Charlie" - Master - feeds changes to Delta using physical streaming with sync replication
    • Server "Delta" - Physical Standby - feeds changes to Alpha, Echo using logical streaming
  • Site 3
    • Server "Echo" - Master - feeds changes to Foxtrot using physical streaming with sync replication
    • Server "Foxtrot" - Physical Standby - feeds changes to Alpha, Charlie using logical streaming

Bandwidth and latency between sites is minimised.

Config left as an exercise for the reader.

N-site symmetric cluster replication

Symmetric cluster is where all masters are connected to each other.

N=19 has been tested and works fine.

N masters requires N-1 connections to other masters, so practical limits are <100 servers, or less if you have many separate databases.

The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.

Complex/Assymetric Replication

Variety of options are possible.

Conflict Avoidance

Distributed Locking

Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.

Global Sequences

Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are "divergent errors" and are not easily resolvable.

SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.

BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.

Conflict Detection & Resolution

Lock Conflicts

Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.

By concurrent activity on a row, we include

  • explicit row level locking
  • locking from foreign keys
  • implicit locking because of row updates or deletes, from local activity or apply from other servers

Data Conflicts

Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.

Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).

Updates and Inserts may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as "divergent conflicts". Currently, replication stops should a divergent conflict occur.

Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.

Conflicts are logged if we specify bdr.log_conflicts = on

All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in "false conflicts", where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins.

Changing unlogged and logged tables in same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.


As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row.

  • We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check....
Personal tools