BDR User Guide

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
(Clarify statement about volatile functions and values)
(Added details on locking delays, suggested workarounds)
Line 66: Line 66:
 
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.
 
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.
  
(TRUNCATE currently not implemented yet)
+
(TRUNCATE is not yet implemented, but will be implemented before the feature goes to final release).
  
 
LOCK statements are not replicated (possible future feature).
 
LOCK statements are not replicated (possible future feature).
Line 76: Line 76:
 
UPDATEs that change the Primary Key of a table will be replicated correctly.
 
UPDATEs that change the Primary Key of a table will be replicated correctly.
  
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions are evaluated on the master side and the resulting values are replicated. This means that any function side-effects outside the database (writing files, network socket activity, etc) will not occur on the replicas as the functions are not run again on the replica.
+
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions are evaluated on the master side and the resulting values are replicated. This means that any function side-effects outside the database tables (writing files, network socket activity, updating internal PostgreSQL variables, etc) will not occur on the replicas as the functions are not run again on the replica.
  
 
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.
 
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.
Line 98: Line 98:
 
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.
 
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.
  
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.
+
Lock waits would cause latency problems/apply delays. This is true of any ordinary locking - LOCK statements, DDL that takes locks, SELECT ... FOR UPDATE/FOR SHARE row-level locks, and the row-level locks taken by UPDATE and DELETE. Because locks are not replicated from the master this is only a concern for locks taken by other transactions running on the replication target server. To minimize replication delays caused by locking try to use read-only transactions without explicit locking wherever possible, favor the use of SERIALIZABLE isolation over use of SELECT ... FOR SHARE, and keep write transactions that touch replicated tables short.
  
 
=== Table definitions and DDL replication ===
 
=== Table definitions and DDL replication ===

Revision as of 03:20, 1 May 2013

BDR stands for BiDrectional Replication.

Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from 2ndQuadrant. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.

At the PgCon2012CanadaInCoreReplicationMeeting an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is available here.

Contents

Project Overview and Plans

Project Aims

  • in core
  • fast
  • reusable individual parts (see below), usable by other projects (slony, ...)
  • basis for easier sharding/write scalability
  • wide geographic distribution of replicated nodes

High Level Planning

9.3

Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects

  • background workers
  • xlogreader implementation
  • pg_xlogdump

Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.

Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.

9.4

Objective to implement main BDR features into core Postgres.

9.5

Additional features based upon feedback

Aspects of BDR

Bi-Directional Replication consists of a number of related features

  • Logical Log Streaming Replication - getting data from one master to another.
  • Global Sequences - ability to support sequences that work globally across a set of nodes
  • Conflict Detection & Resolution (options)
  • DDL Replication via Event Triggers

Taken together these features will allow replication in both directions for any pair of servers. We could call this "multi-master replication", but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.

Note that these features aren't "clustering" in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.

User Guide

Logical Log Streaming Replication

Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to "streaming replication" i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.

The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this "logical log replication". Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.

One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.

Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.

Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:

  • pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)
  • textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)
  • BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.

Replication of DML changes

All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.

(TRUNCATE is not yet implemented, but will be implemented before the feature goes to final release).

LOCK statements are not replicated (possible future feature).

Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.

DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.

UPDATEs that change the Primary Key of a table will be replicated correctly.

The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions are evaluated on the master side and the resulting values are replicated. This means that any function side-effects outside the database tables (writing files, network socket activity, updating internal PostgreSQL variables, etc) will not occur on the replicas as the functions are not run again on the replica.

All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.

All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.

Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. "identical servers", as with physical replication.

A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.

Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.

Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).

Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.

SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.

NOTIFY is not supported across log based replication, either physical or logical.

In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.

Lock waits would cause latency problems/apply delays. This is true of any ordinary locking - LOCK statements, DDL that takes locks, SELECT ... FOR UPDATE/FOR SHARE row-level locks, and the row-level locks taken by UPDATE and DELETE. Because locks are not replicated from the master this is only a concern for locks taken by other transactions running on the replication target server. To minimize replication delays caused by locking try to use read-only transactions without explicit locking wherever possible, favor the use of SERIALIZABLE isolation over use of SELECT ... FOR SHARE, and keep write transactions that touch replicated tables short.

Table definitions and DDL replication

DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.

This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.

In general, "exact match" is the best guide. Current details (subject to change) are

  • Secondary indexes may differ between nodes
  • Constraints must match for BDR.
  • Storage parameters must match.
  • Table-level parameters, e.g. fillfactor, autovacuum may differ
  • Inheritance must be the same

Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.

Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.

Selective Replication (Table/Row-level filtering)

LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.

Other Terminology

(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.

Similarly, the terms Origin, Provider and Subcriber only work with one Origin.

Configuration

Upstream master

  • wal_level = 'logical'
  • max_logical_slots = X
  • max_wal_senders = Y # Y = max_logical_slots plus any physical streaming requirements

Downstream master

  • shared_preload_libraries = 'bdr'
  • bdr.connections="name_of_upstream_master" # list of upstream master nodenames
  • bdr.<nodename>.dsn = 'dbname=postgres' # connection string for connection from downstream to upstream master
  • bdr.<nodename>.local_dbname = 'xxx' # optional parameter to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)
  • bdr.<nodename>.apply_delay # optional parameter to delay apply of transactions, time in milliseconds
  • bdr.synchronous_commit = ...; # optional parameter to set the synchronous_commit parameter the apply processes will be using
  • max_logical_slots = X # set to the number of remotes

wal_keep_segments should be set to a value that allows for some downtime of server/network.

New/Changed Parameter Reference

bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.<nodename>.dsn entry

bdr.<nodename>.dsn - "data source name" - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication

max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start

wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.

Tuning

As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.

There are no parameters for tuning transfer latency.

The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.

A variant of hot_standby_feedback could be implemented also, though would likely need renaming.

The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.

Operational Issues and Debugging

In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.

Monitoring

Some new/changed views are available for monitoring activity

  • pg_stat_replication
  • pg_stat_logical_decoding
  • pg_stat_logical_replication

Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).

  • pg_stat_user_tables
  • pg_statio_user_tables

Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.

  • pg_stat_user_indexes
  • pg_statio_user_indexes

Bi-Directional Replication Use Cases

Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.

Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.

Simple multi-master pair

  • "HA Cluster"
    • Server "Alpha" - Master
    • Server "Bravo" - Master
Configuration
  • Alpha
    • wal_level = 'logical'
    • max_logical_slots = 3
    • max_wal_senders = 4 # Y = max_logical_slots plus any physical streaming requirements
    • wal_keep_segments = 5000
    • shared_preload_libraries = 'bdr'
    • bdr.connections="bravo" # list of upstream master nodenames
    • bdr.bravo.dsn = 'dbname=postgres' # connection string for connection from downstream to upstream master
  • Bravo
    • wal_level = 'logical'
    • max_logical_slots = 3
    • max_wal_senders = 4 # Y = max_logical_slots plus any physical streaming requirements
    • wal_keep_segments = 5000
    • shared_preload_libraries = 'bdr'
    • bdr.connections="alpha" # list of upstream master nodenames
    • bdr.alpha.dsn = 'dbname=postgres' # connection string for connection from downstream to upstream master

HA and Logical Standby

Downstream masters allow users to create temporary tables, so they can be used as reporting servers.

  • "HA Cluster"
    • Server "Alpha" - Current Master
    • Server "Bravo" - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names
    • Server "Charlie" - "Logical Standby" - downstream master

Very High Availability Multi-Master

A typical configuration for remote multi-master would then be:

  • Site 1
    • Server "Alpha" - Master - feeds changes to Bravo using physical streaming with sync replication
    • Server "Bravo" - Physical Standby - feeds changes to Charlie using logical streaming
  • Site 2
    • Server "Charlie" - Master - feeds changes to Delta using physical streaming with sync replication
    • Server "Delta" - Physical Standby - feeds changes to Alpha using logical streaming

Bandwidth between Site 1 and Site 2 is minimised

3-remote site simple Multi-Master Plex

BDR supports "all to all" connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)

  • Site 1
    • Server "Alpha" - Master - feeds changes to Charlie, Echo using logical streaming
  • Site 2
    • Server "Charlie" - Master - feeds changes to Alpha, Echo using logical streaming replication
  • Site 3
    • Server "Echo" - Master - feeds changes to Alpha, Charlie using logical streaming replication
Configuration

Using node names that match port numbers, for clarity

  • config for 5440:
    • port = 5440
    • bdr.connections='node_5441,node_5442'
    • bdr.node_5441.dsn='port=5441 dbname=postgres'
    • bdr.node_5442.dsn='port=5442 dbname=postgres'
  • config for 5441:
    • port = 5441
    • bdr.connections='node_5440,node_5442'
    • bdr.node_5440.dsn='port=5440 dbname=postgres'
    • bdr.node_5442.dsn='port=5442 dbname=postgres'
  • config for 5442:
    • port = 5442
    • bdr.connections='node_5440,node_5441'
    • bdr.node_5440.dsn='port=5440 dbname=postgres'
    • bdr.node_5441.dsn='port=5441 dbname=postgres'

3-remote site simple Multi-Master Circular Replication

Simpler config uses "circular replication". This is simpler but results in higher latency for changes as the number of nodes increases.

  • Site 1
    • Server "Alpha" - Master - feeds changes to Charlie using logical streaming replication
  • Site 2
    • Server "Charlie" - Master - feeds changes to Echo using logical streaming replication
  • Site 3
    • Server "Echo" - Master - feeds changes to Alpha using logical streaming replication
Configuration

Using node names that match port numbers, for clarity

  • config for 5440:
    • port = 5440
    • bdr.connections='node_5441'
    • bdr.node_5441.dsn='port=5441 dbname=postgres'
  • config for 5441:
    • port = 5441
    • bdr.connections='node_5442'
    • bdr.node_5442.dsn='port=5442 dbname=postgres'
  • config for 5442:
    • port = 5442
    • bdr.connections='node_5440'
    • bdr.node_5440.dsn='port=5440 dbname=postgres'

Regrettably this doesn't actually work yet because we don't cascade logical changes (yet).

3-remote site Max Availability Multi-Master Plex

  • Site 1
    • Server "Alpha" - Master - feeds changes to Bravo using physical streaming with sync replication
    • Server "Bravo" - Physical Standby - feeds changes to Charlie, Echo using logical streaming
  • Site 2
    • Server "Charlie" - Master - feeds changes to Delta using physical streaming with sync replication
    • Server "Delta" - Physical Standby - feeds changes to Alpha, Echo using logical streaming
  • Site 3
    • Server "Echo" - Master - feeds changes to Foxtrot using physical streaming with sync replication
    • Server "Foxtrot" - Physical Standby - feeds changes to Alpha, Charlie using logical streaming

Bandwidth and latency between sites is minimised.

Config left as an exercise for the reader.

N-site symmetric cluster replication

Symmetric cluster is where all masters are connected to each other.

N=19 has been tested and works fine.

N masters requires N-1 connections to other masters, so practical limits are <100 servers, or less if you have many separate databases.

The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.

Complex/Assymetric Replication

Variety of options are possible.

Conflict Avoidance

Distributed Locking

Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.

Global Sequences

Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are "divergent errors" and are not easily resolvable.

SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.

BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.

Conflict Detection & Resolution

Lock Conflicts

Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.

By concurrent activity on a row, we include

  • explicit row level locking
  • locking from foreign keys
  • implicit locking because of row updates or deletes, from local activity or apply from other servers

Data Conflicts

Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.

Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).

Updates and Inserts may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as "divergent conflicts". Currently, replication stops should a divergent conflict occur.

Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.

Conflicts are logged if we specify bdr.log_conflicts = on

All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in "false conflicts", where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins.

Changing unlogged and logged tables in same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.

Examples

As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row.

  • We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check....
Personal tools