BDR User Guide

From PostgreSQL wiki

Revision as of 07:58, 27 May 2013 by Ringerc (Talk | contribs)

Jump to: navigation, search

This page is the users and administrators guide for BDR. If you're looking for technical details on the project plan and implementation, see BDR Project.


Contents

BDR User Guide

BDR (BiDrectional Replication) is a feature being developed for inclusion in PostgreSQL core that provides greatly enhanced replication capabilities.

BDR allows users to create a geographically distributed multi-master database using Logical Log Streaming Replication (LLSR) transport. BDR is designed to provide both high availability and geographically distributed disaster recovery capabilities.

BDR is not “clustering” as some vendors use the term, in that it doesn't have a distributed lock manager, global transaction co-ordinator, etc. Each member server is separate yet connected, with design choices that allow separation between nodes that would not be possible with global transaction coordination.

Guidance on getting a testing setup established are in #Initial setup. Please read the full documentation if you intend to put BDR into production.

Logical Log Streaming Replication

Logical log streaming replication (LLSR) allows one PostgreSQL master (the "upstream master") to stream a sequence of changes to another read/write PostgreSQL server (the "downstream master"). Data is sent in one direction only over a normal libpq connection.

Multiple LLSR connections can be used to set up bi-directional replication as discussed later in this guide.

Overview of logical replication

In some ways LLSR is similar to "streaming replication" i.e. physical log streaming replication (PLSR) from a user perspective; both replicate changes from one server to another. However, in LLSR the receiving server is also a full master database that can make changes, unlike the read-only replicas offered by PLSR hot standby. Additionally, LLSR is per-database, whereas PLSR is per-cluster and replicates all databases at once. There are many more differences discussed in the relevant sections of this document.

In LLSR the data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this "logical log replication". Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.

One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.

Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE. A restart of the downstream master is also required. The upstream master only needs restarting if the max_logical_slots parameter is too low to allow a new replica to be added. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated. Setup is discussed in more detail below.

Changes are processed by the downstream master using bdr plug-ins. This allows flexible handing of replication input, including:

  • BDR apply process - applies logical changes to the downstream master. The apply process makes changes directly rather than generating SQL text and then parse/plan/executing SQL.
  • Textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)
  • pg_xlogdump - examines physical WAL records and produces textual debugging output. This server program is included in PostgreSQL 9.3.

Replication of DML changes

All changes are replicated: INSERT, UPDATE, DELETE and TRUNCATE.

(TRUNCATE is not yet implemented, but will be implemented before the feature goes to final release).

Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though it has overheads that mean that it doesn't always use less bandwidth than PLSR.

Locks taken by LOCK and SELECT ... FOR UPDATE/SHARE on the upstream master are not replicated to downstream masters. Locks taken automatically by INSERT, UPDATE, DELETE or TRUNCATE *are* taken on the downstream master and may delay replication apply or concurrent transactions - see Lock Conflicts.

TEMPORARY and UNLOGGED tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables. However, temporary tables remain specific to a particular session so creating a temporary table on the upstream master does not create a similar table on the downstream master.

DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master. These are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL. A warning is issued in the downstream master's logs if the expected constraint is absent. INSERT on upstream master do not require a unique constraint in order to replicate correctly, though such usage would prevent conflict detection between multiple masters, if that was considered important.

UPDATEs that change the value of the Primary Key of a table will be replicated correctly.

The values applied are the final values from the UPDATE on the upstream master, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value. Volatile or stable functions are evaluated on the master side and the resulting values are replicated. Consequently any function side-effects (writing files, network socket activity, updating internal PostgreSQL variables, etc) will not occur on the replicas as the functions are not run again on the replica.

All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.

All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master (see "Limitations").

The current LLSR plugin implementation uses the binary libpq protocol, so it requires that the upstream and downstream master use same CPU architecture and word-length, i.e. "identical servers", as with physical replication. A textual output option will be added later for passing data between non-identical servers, e.g. laptops or mobile devices communicating with a central server.

Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is efficiently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.

Changes are applied to the downstream master in the sequence in which they were commited on the upstream master. This is a known-good serialized ordering of changes, so replication serialization failures are not theoretically possible. Such failures are common in systems that use statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).

Larger transactions spill to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.

SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. We always update the correct tables, whatever the setting of search_path. Values are replicated correctly irrespective of the values of bytea_output, TimeZone, DateStyle, etc.

NOTIFY is not supported across log based replication, either physical or logical. NOTIFY and LISTEN will work fine on the upstream master but an upstream NOTIFY will not trigger a downstream LISTENer.

In some cases, additional deadlocks can occur on apply. This causes an automatic retry of the apply of the replaying transaction and is only an issue if the deadlock recurs repeatedly, delaying replication.

From a performance and concurrency perspective the BDR apply process is similar to a normal backend. Frequent conflicts with locks from other transactions when replaying changes can slow things down and thus increase replication delay, so reducing the frequency of such conflicts can be a good way to speed things up. Any lock held by another transaction on the downstream master - LOCK statements, SELECT ... FOR UPDATE/FOR SHARE, or INSERT/UPDATE/DELETE row locks - can delay replication if the replication apply process needs to change the locked table/row.

Table definitions and DDL replication

DML changes are replicated between tables with matching "Schemaname"."Tablename" on both upstream and downstream masters. e.g. changes from upstream's public.mytable will go to downstream's public.mytable while changes to the upstream mychema.mytable will go to the downstream myschema.mytable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.

This requires careful synchronization of table definitions on each node otherwise ERRORs will be generated by the replication apply process. In general, tables must be an exact match between upstream and downstream masters.

There are no plans to implement working replication between dissimilar table definitions.

Tables must meet the following requirements to be compatible for purposes of LLSR:

  • The downstream master must only have constraints (CHECK, UNIQUE, EXCLUSION, FOREIGN KEY, etc) that are also present on the upstream master. Replication may initially work with mismatched constraints but is likely to fail as soon as the downstream master rejects a row the upstream master accepted.
  • The table referenced by a FOREIGN KEY on a downstream master must have all the keys present in the upstream master version of the same table.
  • Storage parameters must match except for as allowed below
  • Inheritance must be the same
  • Dropped columns on master must be present on replicas
  • Custom types and enum definitions must match exactly
  • Composite types and enums must have the same oids on master and replication target
  • Extensions defining types used in replicated tables must be of the same version or fully SQL-level compatible and the oids of the types they define must match.

The following differences are permissible between tables on different nodes:

  • The table's pg_class oid, the oid of its associated TOAST table, and the oid of the table's rowtype in pg_type may differ;
  • Extra or missing non-UNIQUE indexes
  • Extra keys in downstream lookup tables for FOREIGN KEY references that are not present on the upstream master
  • The table-level storage parameters for fillfactor and autovacuum
  • Triggers and rules may differ (they are not executed by replication apply)

Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR (see LLSR Limitations).

Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.

In future it is expected that composite types and enums with non-identical oids will be converted using text output and input functions. This feature is not yet implemented.

LLSR limitations

The current LLSR implementation is subject to some limitations, which are being progressively removed as work progresses.

Data definition compatibility

Table definitions, types, extensions, etc must be near identical between upstream and downstream masters. See Table definitions and DDL replication.

DDL Replication

DDL replication is not yet supported.

Upstream feedback

No feedback from downstream masters to the upstream master is implemented for asynchronous LLSR, so upstream masters must be configured to keep enough WAL. See Configuration.

TRUNCATE is not replicated

TRUNCATE is not yet supported, however workarounds with user-level triggers are possible and a ProcessUtility hook is planned to implement a similar approach globally.

The safest option is to define a user-level BEFORE trigger on each table that RAISEs an ERROR when TRUNCATE is attempted.

A simple truncate-blocking trigger is:

CREATE OR REPLACE FUNCTION deny_truncate() RETURNS trigger AS $$
BEGIN
  IF tg_op = 'TRUNCATE' THEN
    RAISE EXCEPTION 'TRUNCATE is not supported on this table. Please use DELETE FROM.';
  ELSE
    RAISE EXCEPTION 'This trigger only supports TRUNCATE';
  END IF;
END;
$$ LANGUAGE plpgsql;

It can be applied to a table with:

CREATE TRIGGER deny_truncate_on_<tablename> BEFORE TRUNCATE ON <tablename>
FOR EACH STATEMENT EXECUTE PROCEDURE deny_truncate();

A PL/PgSQL DO block that queries pg_class and loops over it to EXECUTE a dynamic SQL CREATE TRIGGER command for each table that does not already have the trigger can be used to apply the trigger to all tables.

Initial setup

To set up LLSR or BDR you first need a patched PostgreSQL that can support LLSR/BDR, then you need to create one or more LLSR/BDR senders and one or more LLSR/BDR receivers.

Installing the patched PostgreSQL binaries

Currently BDR is only available in builds of the 'bdr' branch on Andres Freund's git repo on git.postgresql.org. PostgreSQL 9.2 and below do not support BDR, and 9.3 requires patches, so this guide will not work for you if you are trying to use a normal install of PostgreSQL.

First you need to clone, configure, compile and install like normal. Clone the sources from git://git.postgresql.org/git/users/andresfreund/postgres.git and checkout the bdr branch.

If you have an existing local PostgreSQL git tree specify it as --reference /path/to/existing/tree to greatly speed your git clone.

Example:

mkdir -p $HOME/bdr
cd bdr
git clone -b bdr git://git.postgresql.org/git/users/andresfreund/postgres.git $HOME/bdr/postgres-bdr-src
cd postgres-bdr-src
./configure --prefix=$HOME/bdr/postgres-bdr-bin
make install
cd contrib/bdr
make install

This will put everything in $HOME/bdr, with the source code and build tree in $HOME/bdr/postgres-bdr-src and the installed PostgreSQL in $HOME/bdr/postgres-bdr-bin. This is a convenient setup for testing and development because it doesn't require you to set up new users, wrangle permissions, run anything as root, etc, but it isn't recommended that you deploy this way in production.

To actually use these new binaries you will need to:

export PATH=$HOME/bdr/postgres-bdr-bin/bin:$PATH

before running initdb, postgres, etc. You don't have to use the psql or libpq you compiled but you're likely to get version mismatch warnings if you don't.

Parameter Reference

The following parameters are new or have been changed in PostgreSQL's new logical streaming replication.

shared_preload_libraries = ‘bdr’

To load support for receiving changes on a downstream master, the bdr library must be added to the existing ‘shared_preload_libraries’ parameter. This loads the bdr library during postmaster start-up and allows it to create the required background worker(s).

Upstream masters don't need to load the bdr library unless they're also operating as a downstream master as is the case in a BDR configuration.

bdr.connections

A comma-separated list of upstream master connection names is specified in bdr.connections. These names must be simple alphanumeric strings. They are used when naming the connection in error messages, configuration options and logs, but are otherwise of no special meaning.

A typical two-upstream-master setting might be:

bdr.connections = ‘upstream1, upstream2’

bdr.<connection_name>.dsn

Each connection name must have at least a data source name specified using the bdr.<connection_name>.dsn parameter. The DSN syntax is the same as that used by libpq so it is not discussed in further detail here. A dbname for the database to connect to must be specified; all other parts of the DSN are optional.

The local (downstream) database name is assumed to be the same as the name of the upstream database being connected to, though future versions will make this configurable.

For the above two-master setting for bdr.connections the DSNs might look like:

bdr.upstream1.dsn = 'host=10.1.1.2 user=postgres dbname=replicated_db'
bdr.upstream2.dsn = 'host=10.1.1.3 user=postgres dbname=replicated_db'

max_logical_slots

The new parameter max_logical_slots has been added for use on both upstream and downstream masters. This parameter controls the maximum number of logical replication slots - upstream or downstream - that this cluster may have at a time. It must be set at postmaster start time.

As logical replication slots are persistent, slots are consumed even by replicas that are not currently connected. Slot management is discussed in Starting, Stopping and Managing Replication.

max_logical_slots should be set to the sum of the number of logical replication upstream masters this server will have plus the number of logical replication downstream masters will connect to it it.

wal_level = 'logical'

A new setting, 'logical', has been added for the existing wal_level parameter. ‘logical’ includes everything that the existing hot_standby setting does and adds additional details required for logical changeset decoding to the write-ahead logs.

This additional information is consumed by the upstream-master-side xlog decoding worker. Downstream masters that do not also act as upstream masters do not require wal_level to be increased above the default 'minimal'.

wal_level, except for the new 'logical' setting, is documented in the main PostgreSQL manual.

max_wal_senders

Logical replication hasn't altered the max_wal_senders parameter, but it is important in upstream masters for logical replication and BDR because every logical sender consumes a max_wal_senders entry.

You should configure max_wal_senders to the sum of the number of physical and logical replicas you want to allow an upstream master to serve. If you intend to use pg_basebackup you should add at least two more senders to allow for its use.

Like max_logical_slots, max_wal_senders entries don't cost a large amount of memory, so you can overestimate fairly safely.

max_wal_senders is documented in the main PostgreSQL documentation.

wal_keep_segments

Like max_wal_senders, the wal_keep_segments parameter isn't directly changed by logical replication but is still important for upstream masters. It is not required on downstream-only masters.

wal_keep_segments should be set to a value that allows for some downtime or unreachable periods for downstream masters and for heavy bursts of write activity on the upstream master.

Keep in mind that enough disk space must be available for the WAL segments, each of which is 16MB. If you run out of disk space the server will halt until disk space is freed and it may be quite difficult to free space when you can no longer start the server.

If you exceed the required wal_keep_segments and "Insufficient WAL segments retained" error will be reported. See Troubleshooting.

wal_keep_segments is documented in the the main PostgreSQL manual.

track_commit_timestamp

Setting this parameter to "on" enables commit timestamp tracking, which is used to implement last-UPDATE-wins conflict resolution.

It is also required for use of the pg_get_transaction_committime function.

Function reference

The LLSR/BDR patches add several functions to the PostgreSQL core.

pg_get_transaction_committime

pg_get_transaction_committime(txid integer): Get the timestamp at which the specified transaction, as identified by transaction ID, committed. This function can be useful when monitoring replication lag.

init_logical_replication

init_logical_replication(slotname name, plugin name, OUT slotname text, OUT xlog_position text) performs the same work as pg_receivellog --init, creating a new logical replication slot. Given the slot name to create and the plugin to use for the slot, it returns the fully qualified resulting slot name and the start position in the transaction logs.

This function is mostly useful for implementers of replication tools based on logical replication's features, and is not typically directly useful to end users. Starting a downstream master will automatically create a slot so you do not generally need this function when you are using BDR.

stop_logical_replication

stop_logical_replication(slotname name) stops logical replication for the named slot, removing its entry from the upstream master. It has the same effect as pg_receivellog --stop, but is callable from SQL.

This function is mainly useful for removing unused slots.

pg_stat_bdr

This function is used by the pg_stat_bdr view. You don't need to call it directly.


bdr_sequence_alloc, bdr_sequence_options, bdr_sequence_setval

These functions are internal to BDR's distributed sequence implementation and are invoked via the sequence access methods. You use ordinary sequence manipulation functions like nextval and setval to manage distributed sequences.

Distributed Sequences

Distributed sequences, or global sequences, are a sequence that is synchronized across all the nodes in a BDR cohort. A distributed sequence is more expensive to access than a purely local sequence, but it produces values that are guaranteed unique across the entire cohort.

Using distributed sequences allows you to avoid the problems with inserts conflicts.

TODO: Demo distributed sequences

Configuration

Details on individual parameters are described in the parameter reference section.

The following configuration is an example of a simple one-way LLSR replication setup - a single upstream master to a single downstream master.

The upstream master (sender)'s postgresql.conf should contain settings like:

 wal_level = 'logical'       # Include enough info for logical replication
 max_logical_slots = X       # Number of LLSR senders + any receivers
 max_wal_senders = Y         # Y = max_logical_slots plus any physical 
                             # streaming requirements
 wal_keep_segments = 5000    # Master must retain enough WAL segments to let 
                             # replicas catch up. Correct value depends on
                             # rate of writes on master, max replica downtime
                             # allowable. 5000 segments requires 78GB
                             # in pg_xlog
 track_commit_timestamp = on # Not strictly required for LLSR, only for BDR
                             # conflict resolution.

Downstream (receiver) postgresql.conf:

 shared_preload_libraries = 'bdr'
 
 bdr.connections="name_of_upstream_master"      # list of upstream master nodenames
 bdr.<nodename>.dsn = 'dbname=postgres'         # connection string for connection
                                                # from downstream to upstream master
 bdr.<nodename>.local_dbname = 'xxx'            # optional parameter to cover the case 
                                                # where the databasename on upstream 
                                                # and downstream master differ. 
                                                # (Not yet implemented)
 bdr.<nodename>.apply_delay                     # optional parameter to delay apply of
                                                # transactions, time in milliseconds 
 bdr.synchronous_commit = ...;                  # optional parameter to set the
                                                # synchronous_commit parameter the
                                                # apply processes will be using
 max_logical_slots = X                          # set to the number of remotes

Note that a server can be both sender and receiver, either two servers to each other or more complex configurations like replication chains/trees.

The upstream (sender) pg_hba.conf must be configured to allow the downstream master to connect for replication. Otherwise you'll see errors like the following on the downstream master:

FATAL:  could not connect to the primary server: FATAL:  no pg_hba.conf entry for replication connection from host "[local]", user "postgres"

A suitable pg_hba.conf entry for a replication connection from the replica server 10.1.4.8 might be:

 host    replication     postgres        10.1.4.8/32            trust

(the user name should match the user name configured in the downstream master's dsn. md5 password authentication is supported.)

For more details on these parameters, see Parameter Reference.

Troubleshooting

Could not access file "bdr": No such file or directory

If you see the error:

FATAL:  could not access file "bdr": No such file or directory

when starting a database set up to receive BDR replication, you probably forgot to install contrib/bdr. See above.

Invalid value for parameter

An error like:

LOG:  invalid value for parameter ...

when setting one of these parameters means your server doesn't support logical replication and will need to be patched or updated.

Insufficient WAL segments retained ("requested WAL segment ... has already been removed")

If wal_keep_segments is insufficient to meet the requirements of a replica that has fallen far behind, the master will report errors like:

ERROR:  requested WAL segment 00000001000000010000002D has already been removed

Currently the replica errors look like:

WARNING:  Starting logical replication
LOG:  data stream ended
LOG:  worker process: master (PID 23812) exited with exit code 0
LOG:  starting background worker process "master"
LOG:  master initialized on master, remote dbname=master port=5434 replication=true fallback_application_name=bdr
LOG:  local sysid 5873181566046043070, remote: 5873181102189050714
LOG:  found valid replication identifier 1
LOG:  starting up replication at 1 from 1/2D9CA220

but a more explicit error message for this condition is planned.

The only way to recover from this fault is to re-seed the replica database.

This fault could be prevented with feedback from the replica to the master, but this feature is not planned for the first release of BDR. Another alternative considered for future releases is making wal_keep_segments a dynamic parameter that is sized on demand.

Monitoring of maximum replica lag and appropriate adjustment of wal_keep_segments will prevent this fault from arising.

Couldn't find logical slot

An error like:

ERROR:  couldn't find logical slot "bdr: 16384:5873181566046043070-1-24596:"

on the upstream master suggests that a downstream master is trying to connect to a logical replication slot that no longer exists. The slot can not be re-created, so it is necessary to re-seed the downstream replica database.

Operational Issues and Debugging

In LLSR there are no user-level (ie SQL visible) ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.

Monitoring

The following views are available for monitoring replication activity:

The following configuration and logging parameters are useful for monitoring replication:

pg_stat_logical_decoding

The new pg_stat_logical_decoding view is specific to logical replication. It is based on the underlying pg_stat_get_logical_replication_slots function and has the following structure:

 View "pg_catalog.pg_stat_logical_decoding"
          Column          |  Type   | Modifiers 
--------------------------+---------+-----------
 slot_name                | text    | 
 plugin                   | text    | 
 database                 | oid     | 
 active                   | boolean | 
 xmin                     | xid     | 
 last_required_checkpoint | text    | 

It contains one row for every connection from a downstream master to the server being queried (the upstream master). On a standalone PostgreSQL server or a downstream-only master this view will contain no rows.

  • slot_name: An internal name for a given logical replication slot (a connection from a downstream master to this upstream master). This slot name is used by the downstream master to uniquely identify its self and is used with the pg_receivellog command when managing logical replication slots. The slot name is composed of the decoding plugin name, the upstream database oid, the downstream system identifier (from pg_control), the downstream slot number, and the downstream database oid.
  • plugin: The logical replication plugin being used to decode WAL archives. You'll generally only see bdr_output here.
  • database: The oid of the database being replicated by this slot. You can get the database name by joining on pg_database.oid.
  • active: Whether this slot currently has an active connection.
  • xmin: The lowest upstream master transaction ID this replication slot can "see", like the xmin of a transaction or prepared transaction. xmin should keep on advancing as replication continues. If xmin stops advancing that's a sign that replication has stalled, possibly leading to the accumulation of bloat in the system catalogs. You can turn this txid into the time it was committed with pg_get_transaction_committime for monitoring purposes.
  • last_required_checkpoint: The checkpoint identifying the oldest WAL record required to bring this slot up to date with the upstream master. (This column is likely to be removed in a future version).

pg_stat_bdr

The pg_stat_bdr view is supplied by the bdr extension. It provides information on a server's connection(s) to its upstream master(s). It is not present on upstream-only masters.

The primary purpose of this view is to report statistics on the progress of LLSR apply on a per-upstream master connection basis.

View structure:

        View "public.pg_stat_bdr"
       Column       |  Type  | Modifiers 
--------------------+--------+-----------
 rep_node_id        | oid    | 
 riremotesysid      | name   | 
 riremotedb         | oid    | 
 rilocaldb          | oid    | 
 nr_commit          | bigint | 
 nr_rollback        | bigint | 
 nr_insert          | bigint | 
 nr_insert_conflict | bigint | 
 nr_update          | bigint | 
 nr_update_conflict | bigint | 
 nr_delete          | bigint | 
 nr_delete_conflict | bigint | 
 nr_disconnect      | bigint | 

Fields:

  • rep_node_id: An internal identifier for the replication slot.
  • riremotesysid: The remote database system identifier, as reported by the Database system identifier line of pg_controldata /path/to/datadir
  • riremotedb: The remote database OID, ie the oid column of the remote server's pg_catalog.pg_database entry for the replicated database. You can get the database name with select datname from pg_database where oid = 12345 (where '12345' is the riremotedb oid).
  • rilocaldb : The local database OID, with the same meaning as riremotedb but with oids from the local system.

The rest of the rows are statistics about this upstream master slot:

  • nr_commit: Number of commits applied to date from this master
  • nr_rollback: Number of rollbacks performed by this apply process due to recoverable errors (deadlock retries, lost races, etc) or unrecoverable errors like mismatched constraint errors.
  • nr_insert: Number of INSERTs performed
  • nr_insert_conflict: Number of INSERTs that resulted in conflicts.
  • nr_update: Number of UPDATEs performed
  • nr_update_conflict: Number of UPDATEs that resulted in conflicts.
  • nr_delete: Number of deletes performed
  • nr_delete_conflict: Number of deletes that resulted in conflicts.
  • nr_disconnect: Number of times this apply process has lost its connection to the upstream master since it was started.


This view does not contain any information about how far behind the upstream master this downstream master is. The upstream master's pg_stat_logical_decoding and pg_stat_replication views must be queried to determine replication lag.

Monitoring replication status and lag

As with any replication setup, it is vital to monitor replication status on all BDR nodes to ensure no node is lagging severely behind the others or is stuck.

In the case of BDR a stuck or crashed node will eventually cause disk space and table bloat problems on other masters so stuck nodes should be detected and removed or repaired in a reasonably timely manner. Exactly how urgent this is depends on the workload of the BDR group.

The pg_stat_logical_decoding view described above may be used to verify that a downstream master is connected to its upstream master by querying it on the upstream side - the active boolean column is t if there's a downstream master connected to this upstream.

The xmin column provides an indication of whether replication is advancing; it should increase as replication progresses. You can turn this into the time the transaction was committed on the master by running pg_get_transaction_committime(xmin) on the upstream master. Since txids are different between upstream and downstream masters, running it on a downstream master with a txid from the upstream master as input would result in an error or and incorrect result.

Example:

 postgres=# select slot_name, plugin, database, active, xmin,
            pg_get_transaction_committime(xmin)
            FROM pg_stat_logical_decoding ;
-[ RECORD 1 ]-----------------+----------------------------------------
slot_name                     | bdr: 12910:5882534759278050995-1-12910:
plugin                        | bdr_output
database                      | 12910
active                        | f
xmin                          | 1827
pg_get_transaction_committime | 2013-05-27 06:14:36.851423+00

Table and index usage statistics

Statistics on table and index usage are updated normally by the downstream master. This is essential for correct function of auto-vacuum. If there are no local writes on the downstream master and stats have not been reset these two views should show matching results between upstream and downstream:

  • pg_stat_user_tables
  • pg_statio_user_tables

Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs than non-identifying indexes are.

The built-in index monitoring views are:

  • pg_stat_user_indexes
  • pg_statio_user_indexes

All these views are discussed in the PostgreSQL documentation on the statistics views.

Starting, stopping and managing replication

Replication is managed with the postgresql.conf settings described in "Parameter Reference" and "Configuration" above, and using the pg_receivellog utility command.

Starting a new LLSR connection

Logical replication is started automatically when a database is configured as a downstream master in postgresql.conf (see Configuration) and the postmaster is started. No explicit action is required to start replication, but replication will not actually work unless the upstream and downstream databases are identical within the requirements set by LLSR in the |Table definitions and DDL replication section.

pg_dump and pg_restore may be used to set up the new replica's database.

Viewing logical replication slots

Examining the state of logical replication is discussed in Monitoring.

Temporarily stopping an LLSR replica

LLSR replicas can be temporarily stopped by shutting down the downstream master's postmaster.

If the replica is not started back up before the upstream master discards the oldest WAL segment required for the downstream master to resume replay, as identified by the last_required_checkpoint column of pg_catalog.pg_stat_logical_decoding then the replica will not resume replay. The error Insufficient WAL segments retained will be reported in the upstream master's logs. The replica must be re-created for replication to continue.

Removing an LLSR replica permanently

To remove a replication connection permanently, remove its entries from the downstream master's postgresql.conf, restart the downstream master, then use pg_receivellog to remove the replication slot on the upstream master.

It is important to remove the replication slot from the upstream master(s) to prevent xid wrap-around problems and issues with table bloat caused by delayed vacuum.

Cleaning up abandoned replication slots

To remove a replication slot that was used for a now-defunct replica, find its slot name from the pg_stat_logical_decoding view on the upstream master then run:

pg_receivellog -p 5434 -h master-hostname -d dbname \
   --slot='bdr: 16384:5873181566046043070-1-16384:' --stop

where the argument to '--slot' is the slot name you found from the view.

You may need to do this if you've created and then deleted several replicas so max_logical_slots has filled up with entries for replicas that no longer exist.

Bi-Directional Replication

Bi-Directional replication is built directly on LLSR by configuring two or more servers as both upstream and downstream masters of each other.

All of the Log Level Streaming Replication documentation applies to BDR and should be read before moving on to reading about and configuring BDR.

Bi-Directional Replication Use Cases

Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.

Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.

Simple multi-master pair

A simple mulit-master "HA Cluster" with two servers:

  • Server "Alpha" - Master
  • Server "Bravo" - Master
Configuration

Alpha:

wal_level = 'logical'
max_logical_slots = 3
max_wal_senders = 4
wal_keep_segments = 5000
shared_preload_libraries = 'bdr'
bdr.connections="bravo"
bdr.bravo.dsn = 'dbname=dbtoreplicate'
track_commit_timestamp = on

Bravo:

wal_level = 'logical'
max_logical_slots = 3
max_wal_senders = 4
wal_keep_segments = 5000
shared_preload_libraries = 'bdr'
bdr.connections="alpha"
bdr.alpha.dsn = 'dbname=dbtoreplicate'
track_commit_timestamp = on

See Configuration for an explanation of these parameters.

HA and Logical Standby

Downstream masters allow users to create temporary tables, so they can be used as reporting servers.

"HA Cluster":

  • Server "Alpha" - Current Master
  • Server "Bravo" - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names
  • Server "Charlie" - "Logical Standby" - downstream master

Very High Availability Multi-Master

A typical configuration for remote multi-master would then be:

  • Site 1
    • Server "Alpha" - Master - feeds changes to Bravo using physical streaming with sync replication
    • Server "Bravo" - Physical Standby - feeds changes to Charlie using logical streaming
  • Site 2
    • Server "Charlie" - Master - feeds changes to Delta using physical streaming with sync replication
    • Server "Delta" - Physical Standby - feeds changes to Alpha using logical streaming

Bandwidth between Site 1 and Site 2 is minimised

3-remote site simple Multi-Master Plex

BDR supports "all to all" connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)

  • Site 1
    • Server "Alpha" - Master - feeds changes to Charlie, Echo using logical streaming
  • Site 2
    • Server "Charlie" - Master - feeds changes to Alpha, Echo using logical streaming replication
  • Site 3
    • Server "Echo" - Master - feeds changes to Alpha, Charlie using logical streaming replication
Configuration

If you wanted to test this configuration locally you could run three PostgreSQL instances on different ports. Such a configuration would look like the following if the port numbers were used as node names for the sake of notational clarity:

Config for node_5440:

port = 5440
bdr.connections='node_5441,node_5442'
bdr.node_5441.dsn='port=5441 dbname=postgres'
bdr.node_5442.dsn='port=5442 dbname=postgres'

Config for node_5441:

port = 5441
bdr.connections='node_5440,node_5442'
bdr.node_5440.dsn='port=5440 dbname=postgres'
bdr.node_5442.dsn='port=5442 dbname=postgres'

Config for node_5442:

port = 5442
bdr.connections='node_5440,node_5441'
bdr.node_5440.dsn='port=5440 dbname=postgres'
bdr.node_5441.dsn='port=5441 dbname=postgres'

In a typical real-world configuration each server would be on the same port on a different host instead.

3-remote site simple Multi-Master Circular Replication

Simpler config uses "circular replication". This is simpler but results in higher latency for changes as the number of nodes increases. It's also less resilient to network disruptions and node faults.

  • Site 1
    • Server "Alpha" - Master - feeds changes to Charlie using logical streaming replication
  • Site 2
    • Server "Charlie" - Master - feeds changes to Echo using logical streaming replication
  • Site 3
    • Server "Echo" - Master - feeds changes to Alpha using logical streaming replication

TODO: Regrettably this doesn't actually work yet because we don't cascade logical changes (yet).

Configuration

Using node names that match port numbers, for clarity

Config for node_5440:

port = 5440
bdr.connections='node_5441'
bdr.node_5441.dsn='port=5441 dbname=postgres'

Config for node_5441:

port = 5441
bdr.connections='node_5442'
bdr.node_5442.dsn='port=5442 dbname=postgres'

Config for node_5442:

port = 5442
bdr.connections='node_5440'
bdr.node_5440.dsn='port=5440 dbname=postgres'

This would usually be done in the real world with databases on different hosts, all running on the same port.

3-remote site Max Availability Multi-Master Plex

  • Site 1
    • Server "Alpha" - Master - feeds changes to Bravo using physical streaming with sync replication
    • Server "Bravo" - Physical Standby - feeds changes to Charlie, Echo using logical streaming
  • Site 2
    • Server "Charlie" - Master - feeds changes to Delta using physical streaming with sync replication
    • Server "Delta" - Physical Standby - feeds changes to Alpha, Echo using logical streaming
  • Site 3
    • Server "Echo" - Master - feeds changes to Foxtrot using physical streaming with sync replication
    • Server "Foxtrot" - Physical Standby - feeds changes to Alpha, Charlie using logical streaming

Bandwidth and latency between sites is minimised.

Config left as an exercise for the reader.

N-site symmetric cluster replication

Symmetric cluster is where all masters are connected to each other.

N=19 has been tested and works fine.

N masters requires N-1 connections to other masters, so practical limits are <100 servers, or less if you have many separate databases.

The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.

Complex/Assymetric Replication

Variety of options are possible.

Conflict Avoidance

Distributed Locking

Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications as very low latency is critical for acceptable performance.

Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible but allow some types of conflict to occur and and resolve them when they arise.

Global Sequences

Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are "divergent errors" and are not easily resolvable.

The SQL standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with PostgreSQL's SERIAL pseudo-type.

BDR requires sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.

Conflict Detection & Resolution

Because local writes can occur on a master, conflict detection and avoidance is a concern for basic LLSR setups as well as full BDR configurations.

Lock Conflicts

Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues with apply blocking on locks.

By concurrent activity on a row, we include

  • explicit row level locking (SELECT ... FOR UPDATE/FOR SHARE)
  • locking from foreign keys
  • implicit locking because of row UPDATEs, INSERTs or DELETEs, either from local activity or apply from other servers

Data Conflicts

Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in a consistent and idempotent manner so that all servers end up with identical results.

Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control though this may change in a future release.

UPDATEs and INSERTs may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as "divergent conflicts". Currently, replication stops should a divergent conflict occur. The errors causing the conflict can be seen in the error log of the downstream master with the problem.

Updates which cannot locate a row are presumed to be DELETE/UPDATE conflicts. These are accepted as successful operations but in the case of UPDATE the data in the UPDATE is discarded.

All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in "false conflicts", where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins. It is not practical to decide when a row should be merged and when a last-update-wins stragegy should be used at the database level; such decision making would require support for application-specific conflict resolution plugins.

Changing unlogged and logged tables in the same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.

Examples

As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row.

  • We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check....
Personal tools