<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="http://wiki.postgresql.org/skins/common/feed.css?207"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
		<id>http://wiki.postgresql.org/index.php?title=Special:Contributions&amp;feed=atom&amp;target=Simon</id>
		<title>PostgreSQL wiki - User contributions [en]</title>
		<link rel="self" type="application/atom+xml" href="http://wiki.postgresql.org/index.php?title=Special:Contributions&amp;feed=atom&amp;target=Simon"/>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/Special:Contributions/Simon"/>
		<updated>2013-05-20T16:06:11Z</updated>
		<subtitle>From PostgreSQL wiki</subtitle>
		<generator>MediaWiki 1.15.5-2squeeze5</generator>

	<entry>
		<id>http://wiki.postgresql.org/wiki/PostgreSQL_9.3_Open_Items</id>
		<title>PostgreSQL 9.3 Open Items</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/PostgreSQL_9.3_Open_Items"/>
				<updated>2013-05-19T16:36:59Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Project Planning ==&lt;br /&gt;
See the [[PostgreSQL 9.3 Development Plan]].&lt;br /&gt;
&lt;br /&gt;
== Blockers for 9.3 ==&lt;br /&gt;
* [http://www.postgresql.org/message-id/12365.1358098148@sss.pgh.pa.us Restore protection against accidentally creating stuff in pg_catalog schema]&lt;br /&gt;
** There are a couple of ways to fix that, per thread, but we need to do something.&lt;br /&gt;
* Improve the {{messageLink|m2zjwxpk8j.fsf@2ndQuadrant.fr|event trigger API documentation?}}&lt;br /&gt;
* [http://www.postgresql.org/message-id/5188CFFA.3020209@vmware.com Fast promotion failure] - doesn't appear to be an issue with fast promotion, not sure where though...&lt;br /&gt;
&lt;br /&gt;
== Not Blockers for 9.3 ==&lt;br /&gt;
&lt;br /&gt;
* Consider whether COPY into newly created tables should ALWAYS freeze, perhaps only when checksums enabled - requested by Noah, Robert&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAHGQGwHs3r2k-N7C=vWEk5e-fE7sTGWgZjbkD6X2_s0h+zqoVQ@mail.gmail.com fast promotion and log_checkpoints] - trivial cosmetic issue; not a bug of any kind&lt;br /&gt;
&lt;br /&gt;
== Meta-Issues ==&lt;br /&gt;
&lt;br /&gt;
== Resolved Issues ==&lt;br /&gt;
* Restructure ProcessUtility to fix event-trigger clobber-cache-always failures&lt;br /&gt;
* Agree on the new page checksum algorithm&lt;br /&gt;
* Fix planner {{messageLink|6546.1365701142@sss.pgh.pa.us|equivalence-class bugs}}&lt;br /&gt;
* pg_ctl's new idempotent option [http://www.postgresql.org/message-id/CAMkU=1zKGzGoDoO=u4MON8h6Q=biRL59PTZvRmR9J7uX0yKoyA@mail.gmail.com broke crash recovery cases]&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAHGQGwESOme9HUmUq_jTYi8j++qP2HoZxyqXR=37zuU8tHEOkw@mail.gmail.com VACUUM breaks matview scannability state]&lt;br /&gt;
* Do something about {{messageLink|14345.1365001149@sss.pgh.pa.us|unlogged matviews}}&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAHGQGwEO1GF3=7LAeyn630+PNjjkC4fnwqx_L_Vyq6Y+OsB5jg@mail.gmail.com Another assertion failure at promotion]&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAB7nPqRhuCuuD012GCB_tAAFrixx2WioN_zfXQcvLuRab8DN2g@mail.gmail.com Assertion failure when promoting node by deleting recovery.conf and restart node]&lt;br /&gt;
&lt;br /&gt;
== Long-term Issues ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:PostgreSQL 9.3]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/PostgreSQL_9.3_Open_Items</id>
		<title>PostgreSQL 9.3 Open Items</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/PostgreSQL_9.3_Open_Items"/>
				<updated>2013-05-19T14:38:01Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Not Blockers for 9.3 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Project Planning ==&lt;br /&gt;
See the [[PostgreSQL 9.3 Development Plan]].&lt;br /&gt;
&lt;br /&gt;
== Blockers for 9.3 ==&lt;br /&gt;
* [http://www.postgresql.org/message-id/12365.1358098148@sss.pgh.pa.us Restore protection against accidentally creating stuff in pg_catalog schema]&lt;br /&gt;
** There are a couple of ways to fix that, per thread, but we need to do something.&lt;br /&gt;
* Improve the {{messageLink|m2zjwxpk8j.fsf@2ndQuadrant.fr|event trigger API documentation?}}&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAB7nPqRhuCuuD012GCB_tAAFrixx2WioN_zfXQcvLuRab8DN2g@mail.gmail.com Assertion failure when promoting node by deleting recovery.conf and restart node]&lt;br /&gt;
* [http://www.postgresql.org/message-id/5188CFFA.3020209@vmware.com Fast promotion failure]&lt;br /&gt;
&lt;br /&gt;
== Not Blockers for 9.3 ==&lt;br /&gt;
&lt;br /&gt;
* Consider whether COPY into newly created tables should ALWAYS freeze, perhaps only when checksums enabled - requested by Noah, Robert&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAHGQGwHs3r2k-N7C=vWEk5e-fE7sTGWgZjbkD6X2_s0h+zqoVQ@mail.gmail.com fast promotion and log_checkpoints] - trivial cosmetic issue; not a bug of any kind&lt;br /&gt;
&lt;br /&gt;
== Meta-Issues ==&lt;br /&gt;
&lt;br /&gt;
== Resolved Issues ==&lt;br /&gt;
* Restructure ProcessUtility to fix event-trigger clobber-cache-always failures&lt;br /&gt;
* Agree on the new page checksum algorithm&lt;br /&gt;
* Fix planner {{messageLink|6546.1365701142@sss.pgh.pa.us|equivalence-class bugs}}&lt;br /&gt;
* pg_ctl's new idempotent option [http://www.postgresql.org/message-id/CAMkU=1zKGzGoDoO=u4MON8h6Q=biRL59PTZvRmR9J7uX0yKoyA@mail.gmail.com broke crash recovery cases]&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAHGQGwESOme9HUmUq_jTYi8j++qP2HoZxyqXR=37zuU8tHEOkw@mail.gmail.com VACUUM breaks matview scannability state]&lt;br /&gt;
* Do something about {{messageLink|14345.1365001149@sss.pgh.pa.us|unlogged matviews}}&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAHGQGwEO1GF3=7LAeyn630+PNjjkC4fnwqx_L_Vyq6Y+OsB5jg@mail.gmail.com Another assertion failure at promotion]&lt;br /&gt;
&lt;br /&gt;
== Long-term Issues ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:PostgreSQL 9.3]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/PostgreSQL_9.3_Open_Items</id>
		<title>PostgreSQL 9.3 Open Items</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/PostgreSQL_9.3_Open_Items"/>
				<updated>2013-05-19T14:37:19Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Not Blockers for 9.3 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Project Planning ==&lt;br /&gt;
See the [[PostgreSQL 9.3 Development Plan]].&lt;br /&gt;
&lt;br /&gt;
== Blockers for 9.3 ==&lt;br /&gt;
* [http://www.postgresql.org/message-id/12365.1358098148@sss.pgh.pa.us Restore protection against accidentally creating stuff in pg_catalog schema]&lt;br /&gt;
** There are a couple of ways to fix that, per thread, but we need to do something.&lt;br /&gt;
* Improve the {{messageLink|m2zjwxpk8j.fsf@2ndQuadrant.fr|event trigger API documentation?}}&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAB7nPqRhuCuuD012GCB_tAAFrixx2WioN_zfXQcvLuRab8DN2g@mail.gmail.com Assertion failure when promoting node by deleting recovery.conf and restart node]&lt;br /&gt;
* [http://www.postgresql.org/message-id/5188CFFA.3020209@vmware.com Fast promotion failure]&lt;br /&gt;
&lt;br /&gt;
== Not Blockers for 9.3 ==&lt;br /&gt;
&lt;br /&gt;
* Consider whether message for COPY FREEZE needs to be added&lt;br /&gt;
* Consider whether COPY into newly created tables should ALWAYS freeze, perhaps only when checksums enabled&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAHGQGwHs3r2k-N7C=vWEk5e-fE7sTGWgZjbkD6X2_s0h+zqoVQ@mail.gmail.com fast promotion and log_checkpoints] - trivial cosmetic issue; not a bug of any kind&lt;br /&gt;
&lt;br /&gt;
== Meta-Issues ==&lt;br /&gt;
&lt;br /&gt;
== Resolved Issues ==&lt;br /&gt;
* Restructure ProcessUtility to fix event-trigger clobber-cache-always failures&lt;br /&gt;
* Agree on the new page checksum algorithm&lt;br /&gt;
* Fix planner {{messageLink|6546.1365701142@sss.pgh.pa.us|equivalence-class bugs}}&lt;br /&gt;
* pg_ctl's new idempotent option [http://www.postgresql.org/message-id/CAMkU=1zKGzGoDoO=u4MON8h6Q=biRL59PTZvRmR9J7uX0yKoyA@mail.gmail.com broke crash recovery cases]&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAHGQGwESOme9HUmUq_jTYi8j++qP2HoZxyqXR=37zuU8tHEOkw@mail.gmail.com VACUUM breaks matview scannability state]&lt;br /&gt;
* Do something about {{messageLink|14345.1365001149@sss.pgh.pa.us|unlogged matviews}}&lt;br /&gt;
* [http://www.postgresql.org/message-id/CAHGQGwEO1GF3=7LAeyn630+PNjjkC4fnwqx_L_Vyq6Y+OsB5jg@mail.gmail.com Another assertion failure at promotion]&lt;br /&gt;
&lt;br /&gt;
== Long-term Issues ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:PostgreSQL 9.3]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-05-16T13:36:32Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Replication of DML changes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;----&lt;br /&gt;
This page is the users and administrators guide for BDR. If you're looking for technical details on the project plan and implementation, see [[BDR Project]].&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
= BDR User Guide =&lt;br /&gt;
&lt;br /&gt;
BDR (BiDrectional Replication) is a feature being developed for inclusion in PostgreSQL core that provides greatly enhanced replication capabilities.&lt;br /&gt;
&lt;br /&gt;
BDR allows users to create a geographically distributed multi-master database using Logical Log Streaming Replication (LLSR) transport.&lt;br /&gt;
BDR is designed to provide both high availability and geographically distributed disaster recovery capabilities. &lt;br /&gt;
&lt;br /&gt;
BDR is not “clustering” as some vendors use the term, in that it doesn't have a distributed lock manager, global transaction co-ordinator, etc. Each member server is separate yet connected, with design choices that allow separation between nodes that would not be possible with global transaction coordination.&lt;br /&gt;
&lt;br /&gt;
Guidance on getting a testing setup established are in [[#Initial setup]]. Please read the full documentation if you intend to put BDR into production.&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows one PostgreSQL master (the &amp;quot;upstream master&amp;quot;) to stream a sequence of changes to another read/write PostgreSQL server (the &amp;quot;downstream master&amp;quot;). Data is sent in one direction only over a normal libpq connection.&lt;br /&gt;
&lt;br /&gt;
Multiple LLSR connections can be used to set up bi-directional replication as discussed later in this guide.&lt;br /&gt;
&lt;br /&gt;
=== Overview of logical replication ===&lt;br /&gt;
&lt;br /&gt;
In some ways LLSR is similar to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective; both replicate changes from one server to another. However, in LLSR the receiving server is also a full master database that can make changes, unlike the read-only replicas offered by PLSR hot standby. Additionally, LLSR is per-database, whereas PLSR is per-cluster and replicates all databases at once. There are many more differences discussed in the relevant sections of this document.&lt;br /&gt;
&lt;br /&gt;
In LLSR the data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after &amp;lt;tt&amp;gt;CREATE DATABASE&amp;lt;/tt&amp;gt;. A restart of the downstream master is also required. The upstream master only needs restarting if the &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; parameter is too low to allow a new replica to be added. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated. Setup is discussed in more detail below.&lt;br /&gt;
&lt;br /&gt;
Changes are processed by the downstream master using &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; plug-ins. This allows flexible handing of replication input, including:&lt;br /&gt;
&lt;br /&gt;
* BDR apply process - applies logical changes to the downstream master. The apply process makes changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
* Textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_xlogdump&amp;lt;/tt&amp;gt; - examines physical WAL records and produces textual debugging output. This server program is included in PostgreSQL 9.3.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;TRUNCATE&amp;lt;/tt&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
(TRUNCATE is not yet implemented, but will be implemented before the feature goes to final release).&lt;br /&gt;
&lt;br /&gt;
Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though it has overheads that mean that it doesn't always use less bandwidth than PLSR.&lt;br /&gt;
&lt;br /&gt;
Locks taken by &amp;lt;tt&amp;gt;LOCK&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/SHARE&amp;lt;/tt&amp;gt; on the upstream master are not replicated to downstream masters. Locks taken automatically by &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;TRUNCATE&amp;lt;/tt&amp;gt; *are* taken on the downstream master and may delay replication apply or concurrent transactions - see [[#Lock Conflicts|Lock Conflicts]].&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;TEMPORARY&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;UNLOGGED&amp;lt;/tt&amp;gt; tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables. However, temporary tables remain specific to a particular session so creating a temporary table on the upstream master does not create a similar table on the downstream master.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; statements that affect multiple rows on upstream master will cause a series of row changes on downstream master. These are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s require some form of unique constraint, either &amp;lt;tt&amp;gt;PRIMARY KEY&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;UNIQUE NOT NULL&amp;lt;/tt&amp;gt;. A warning is issued in the downstream master's logs if the expected constraint is absent. &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt; on upstream master do not require a unique constraint in order to replicate correctly, though such usage would prevent conflict detection between multiple masters, if that was considered important.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s that change the value of the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the final values from the &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; on the upstream master, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value. Volatile or stable functions are evaluated on the master side and the resulting values are replicated. Consequently any function side-effects (writing files, network socket activity, updating internal PostgreSQL variables, etc) will not occur on the replicas as the functions are not run again on the replica.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master (see &amp;quot;Limitations&amp;quot;).&lt;br /&gt;
&lt;br /&gt;
The current LLSR plugin implementation uses the binary libpq protocol, so it requires that the upstream and downstream master use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication. A textual output option will be added later for passing data between non-identical servers, e.g. laptops or mobile devices communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is efficiently implemented. Parallel apply is a possible future feature, especially for changes made while holding &amp;lt;tt&amp;gt;AccessExclusiveLock&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in the sequence in which they were commited on the upstream master. This is a known-good serialized ordering of changes, so replication serialization failures are not theoretically possible. Such failures are common in systems that use statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions spill to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;SET&amp;lt;/tt&amp;gt; statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. We always update the correct tables, whatever the setting of &amp;lt;tt&amp;gt;search_path&amp;lt;/tt&amp;gt;. Values are replicated correctly irrespective of the values of &amp;lt;tt&amp;gt;bytea_output&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;TimeZone&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DateStyle&amp;lt;/tt&amp;gt;, etc.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; is not supported across log based replication, either physical or logical. &amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;LISTEN&amp;lt;/tt&amp;gt; will work fine on the upstream master but an upstream &amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; will not trigger a downstream &amp;lt;tt&amp;gt;LISTEN&amp;lt;/tt&amp;gt;er.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes an automatic retry of the apply of the replaying transaction and is only an issue if the deadlock recurs repeatedly, delaying replication.&lt;br /&gt;
&lt;br /&gt;
From a performance and concurrency perspective the BDR apply process is similar to a normal backend. Frequent conflicts with locks from other transactions when replaying changes can slow things down and thus increase replication delay, so reducing the frequency of such conflicts can be a good way to speed things up. Any lock held by another transaction on the downstream master - &amp;lt;tt&amp;gt;LOCK&amp;lt;/tt&amp;gt; statements, &amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/FOR SHARE&amp;lt;/tt&amp;gt;, or &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; row locks - can delay replication if the replication apply process needs to change the locked table/row.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching &amp;lt;tt&amp;gt;&amp;quot;Schemaname&amp;quot;.&amp;quot;Tablename&amp;quot;&amp;lt;/tt&amp;gt; on both upstream and downstream masters. e.g. changes from upstream's &amp;lt;tt&amp;gt;public.mytable&amp;lt;/tt&amp;gt; will go to downstream's &amp;lt;tt&amp;gt;public.mytable&amp;lt;/tt&amp;gt; while changes to the upstream &amp;lt;tt&amp;gt;mychema.mytable&amp;lt;/tt&amp;gt; will go to the downstream &amp;lt;tt&amp;gt;myschema.mytable&amp;lt;/tt&amp;gt;. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful synchronization of table definitions on each node otherwise &amp;lt;tt&amp;gt;ERROR&amp;lt;/tt&amp;gt;s will be generated by the replication apply process. In general, tables must be an exact match between upstream and downstream masters. &lt;br /&gt;
&lt;br /&gt;
There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
Tables must meet the following requirements to be compatible for purposes of LLSR:&lt;br /&gt;
&lt;br /&gt;
* The downstream master must only have constraints (&amp;lt;tt&amp;gt;CHECK&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UNIQUE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;EXCLUSION&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;FOREIGN KEY&amp;lt;/tt&amp;gt;, etc) that are also present on the upstream master. Replication may initially work with mismatched constraints but is likely to fail as soon as the downstream master rejects a row the upstream master accepted.&lt;br /&gt;
* The table referenced by a FOREIGN KEY on a downstream master must have all the keys present in the upstream master version of the same table.&lt;br /&gt;
* Storage parameters must match except for as allowed below&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
* Dropped columns on master must be present on replicas&lt;br /&gt;
* Custom types and enum definitions must match exactly&lt;br /&gt;
* Composite types and enums must have the same oids on master and replication target&lt;br /&gt;
* Extensions defining types used in replicated tables must be of the same version or fully SQL-level compatible and the oids of the types they define must match.&lt;br /&gt;
&lt;br /&gt;
The following differences are permissible between tables on different nodes:&lt;br /&gt;
&lt;br /&gt;
* The table's &amp;lt;tt&amp;gt;pg_class&amp;lt;/tt&amp;gt; oid, the oid of its associated TOAST table, and the oid of the table's rowtype in &amp;lt;tt&amp;gt;pg_type&amp;lt;/tt&amp;gt; may differ;&lt;br /&gt;
* Extra or missing non-&amp;lt;tt&amp;gt;UNIQUE&amp;lt;/tt&amp;gt; indexes&lt;br /&gt;
* Extra keys in downstream lookup tables for &amp;lt;tt&amp;gt;FOREIGN KEY&amp;lt;/tt&amp;gt; references that are not present on the upstream master&lt;br /&gt;
* The table-level storage parameters for fillfactor and autovacuum&lt;br /&gt;
* Triggers and rules may differ (they are not executed by replication apply)&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR (see [[#LLSR Limitations|LLSR Limitations]]).&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of &amp;lt;tt&amp;gt;session_replication_role = origin&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
In future it is expected that composite types and enums with non-identical oids will be converted using text output and input functions. This feature is not yet implemented.&lt;br /&gt;
&lt;br /&gt;
=== LLSR limitations ===&lt;br /&gt;
&lt;br /&gt;
The current LLSR implementation is subject to some limitations, which are being progressively removed as work progresses.&lt;br /&gt;
&lt;br /&gt;
==== Data definition compatibility ====&lt;br /&gt;
&lt;br /&gt;
Table definitions, types, extensions, etc must be near identical between upstream and downstream masters. See [[#Table definitions and DDL replication|Table definitions and DDL replication]].&lt;br /&gt;
&lt;br /&gt;
==== DDL Replication ====&lt;br /&gt;
&lt;br /&gt;
DDL replication is not yet supported.&lt;br /&gt;
&lt;br /&gt;
==== Upstream feedback ====&lt;br /&gt;
&lt;br /&gt;
No feedback from downstream masters to the upstream master is implemented for asynchronous LLSR, so upstream masters must be configured to keep enough WAL. See [[#Configuration|Configuration]].&lt;br /&gt;
&lt;br /&gt;
==== TRUNCATE is not replicated ====&lt;br /&gt;
&lt;br /&gt;
TRUNCATE is not yet supported, however workarounds with user-level triggers are possible and a ProcessUtility hook is planned to implement a similar approach globally.&lt;br /&gt;
&lt;br /&gt;
The safest option is to define a user-level BEFORE trigger on each table that RAISEs an ERROR when TRUNCATE is attempted.&lt;br /&gt;
&lt;br /&gt;
A simple truncate-blocking trigger is:&lt;br /&gt;
&lt;br /&gt;
 CREATE OR REPLACE FUNCTION deny_truncate() RETURNS trigger AS $$&lt;br /&gt;
 BEGIN&lt;br /&gt;
   IF tg_op = 'TRUNCATE' THEN&lt;br /&gt;
     RAISE EXCEPTION 'TRUNCATE is not supported on this table. Please use DELETE FROM.';&lt;br /&gt;
   ELSE&lt;br /&gt;
     RAISE EXCEPTION 'This trigger only supports TRUNCATE';&lt;br /&gt;
   END IF;&lt;br /&gt;
 END;&lt;br /&gt;
 $$ LANGUAGE plpgsql;&lt;br /&gt;
&lt;br /&gt;
It can be applied to a table with:&lt;br /&gt;
&lt;br /&gt;
 CREATE TRIGGER deny_truncate_on_&amp;lt;tablename&amp;gt; BEFORE TRUNCATE ON &amp;lt;tablename&amp;gt;&lt;br /&gt;
 FOR EACH STATEMENT EXECUTE PROCEDURE deny_truncate();&lt;br /&gt;
&lt;br /&gt;
A PL/PgSQL DO block that queries &amp;lt;tt&amp;gt;pg_class&amp;lt;/tt&amp;gt; and loops over it to &amp;lt;tt&amp;gt;EXECUTE&amp;lt;/tt&amp;gt; a dynamic SQL &amp;lt;tt&amp;gt;CREATE TRIGGER&amp;lt;/tt&amp;gt; command for each table that does not already have the trigger can be used to apply the trigger to all tables.&lt;br /&gt;
&lt;br /&gt;
=== Initial setup ===&lt;br /&gt;
&lt;br /&gt;
To set up LLSR or BDR you first need a patched PostgreSQL that can support LLSR/BDR, then you need to create one or more LLSR/BDR senders and one or more LLSR/BDR receivers.&lt;br /&gt;
&lt;br /&gt;
==== Installing the patched PostgreSQL binaries ====&lt;br /&gt;
&lt;br /&gt;
Currently BDR is only available in builds of the 'bdr' branch on Andres Freund's git repo on git.postgresql.org. PostgreSQL 9.2 and below do not support BDR, and 9.3 requires patches, so this guide will not work for you if you are trying to use a normal install of PostgreSQL.&lt;br /&gt;
&lt;br /&gt;
First you need to clone, configure, compile and install like normal. Clone the sources from &amp;lt;tt&amp;gt;git://git.postgresql.org/git/users/andresfreund/postgres.git&amp;lt;/tt&amp;gt; and checkout the &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; branch.&lt;br /&gt;
&lt;br /&gt;
If you have an existing local PostgreSQL git tree specify it as &amp;lt;tt&amp;gt;--reference /path/to/existing/tree&amp;lt;/tt&amp;gt; to greatly speed your git clone.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
 mkdir -p $HOME/bdr&lt;br /&gt;
 cd bdr&lt;br /&gt;
 git clone git://git.postgresql.org/git/users/andresfreund/postgres.git $HOME/bdr/postgres-bdr-src&lt;br /&gt;
 cd postgres-bdr-src&lt;br /&gt;
 ./configure --prefix=$HOME/bdr/postgres-bdr-bin&lt;br /&gt;
 make install&lt;br /&gt;
 cd contrib/bdr&lt;br /&gt;
 make install&lt;br /&gt;
&lt;br /&gt;
This will put everything in &amp;lt;tt&amp;gt;$HOME/bdr&amp;lt;/tt&amp;gt;, with the source code and build tree in &amp;lt;tt&amp;gt;$HOME/bdr/postgres-bdr-src&amp;lt;/tt&amp;gt; and the installed PostgreSQL in &amp;lt;tt&amp;gt;$HOME/bdr/postgres-bdr-bin&amp;lt;/tt&amp;gt;. This is a convenient setup for testing and development because it doesn't require you to set up new users, wrangle permissions, run anything as root, etc, but it isn't recommended that you deploy this way in production.&lt;br /&gt;
&lt;br /&gt;
To actually use these new binaries you will need to:&lt;br /&gt;
&lt;br /&gt;
 export PATH=$HOME/bdr/postgres-bdr-bin/bin:$PATH&lt;br /&gt;
&lt;br /&gt;
before running &amp;lt;tt&amp;gt;initdb&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;postgres&amp;lt;/tt&amp;gt;, etc. You don't have to use the &amp;lt;tt&amp;gt;psql&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;libpq&amp;lt;/tt&amp;gt; you compiled but you're likely to get version mismatch warnings if you don't.&lt;br /&gt;
&lt;br /&gt;
=== Parameter Reference ===&lt;br /&gt;
&lt;br /&gt;
The following parameters are new or have been changed in PostgreSQL's new logical streaming replication.&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;shared_preload_libraries = ‘bdr’&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
To load support for receiving changes on a downstream master, the &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; library must be added to the existing ‘shared_preload_libraries’ parameter. This loads the bdr library during postmaster start-up and allows it to create the required background worker(s).&lt;br /&gt;
&lt;br /&gt;
Upstream masters don't need to load the bdr library unless they're also operating as a downstream master as is the case in a BDR configuration.&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;bdr.connections&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
A comma-separated list of upstream master connection names is specified in &amp;lt;tt&amp;gt;bdr.connections&amp;lt;/tt&amp;gt;. These names must be simple alphanumeric strings. They are used when naming the connection in error messages, configuration options and logs, but are otherwise of no special meaning.&lt;br /&gt;
&lt;br /&gt;
A typical two-upstream-master setting might be:&lt;br /&gt;
&lt;br /&gt;
 bdr.connections = ‘upstream1, upstream2’&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;bdr.&amp;amp;lt;connection_name&amp;amp;gt;.dsn&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Each connection name must have at least a data source name specified using the &amp;lt;tt&amp;gt;bdr.&amp;amp;lt;connection_name&amp;amp;gt;.dsn&amp;lt;/tt&amp;gt; parameter. The DSN syntax is the same as that used by libpq so it is not discussed in further detail here. A &amp;lt;tt&amp;gt;dbname&amp;lt;/tt&amp;gt; for the database to connect to must be specified; all other parts of the DSN are optional.&lt;br /&gt;
&lt;br /&gt;
The local (downstream) database name is assumed to be the same as the name of the upstream database being connected to, though future versions will make this configurable.&lt;br /&gt;
&lt;br /&gt;
For the above two-master setting for &amp;lt;tt&amp;gt;bdr.connections&amp;lt;/tt&amp;gt; the DSNs might look like:&lt;br /&gt;
&lt;br /&gt;
 bdr.upstream1.dsn = 'host=10.1.1.2 user=postgres dbname=replicated_db'&lt;br /&gt;
 bdr.upstream2.dsn = 'host=10.1.1.3 user=postgres dbname=replicated_db'&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
The new parameter &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; has been added for use on both upstream and downstream masters. This parameter controls the maximum number of logical replication slots - upstream or downstream - that this cluster may have at a time. It must be set at postmaster start time.&lt;br /&gt;
&lt;br /&gt;
As logical replication slots are persistent, slots are consumed even by replicas that are not currently connected. Slot management is discussed in Starting, Stopping and Managing Replication.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; should be set to the sum of the number of logical replication upstream masters this server will have plus the number of logical replication downstream masters will connect to it it.&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;wal_level = 'logical'&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
A new setting, &amp;lt;tt&amp;gt;'logical'&amp;lt;/tt&amp;gt;, has been added for the existing &amp;lt;tt&amp;gt;wal_level&amp;lt;/tt&amp;gt; parameter. &amp;lt;tt&amp;gt;‘logical’&amp;lt;/tt&amp;gt; includes everything that the existing &amp;lt;tt&amp;gt;hot_standby&amp;lt;/tt&amp;gt; setting does and adds additional details required for logical changeset decoding to the write-ahead logs. &lt;br /&gt;
&lt;br /&gt;
This additional information is consumed by the upstream-master-side xlog decoding worker. Downstream masters that do not also act as upstream masters do not require &amp;lt;tt&amp;gt;wal_level&amp;lt;/tt&amp;gt; to be increased above the default &amp;lt;tt&amp;gt;'minimal'&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;wal_level&amp;lt;/tt&amp;gt;, except for the new &amp;lt;tt&amp;gt;'logical'&amp;lt;/tt&amp;gt; setting, is [http://www.postgresql.org/docs/current/static/runtime-config-wal.html documented in the main PostgreSQL manual].&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Logical replication hasn't altered the &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; parameter, but it is important in upstream masters for logical replication and BDR because every logical sender consumes a &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; entry.&lt;br /&gt;
&lt;br /&gt;
You should configure &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; to the sum of the number of physical and logical replicas you want to allow an upstream master to serve. If you intend to use &amp;lt;tt&amp;gt;pg_basebackup&amp;lt;/tt&amp;gt; you should add at least two more senders to allow for its use.&lt;br /&gt;
&lt;br /&gt;
Like &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; entries don't cost a large amount of memory, so you can overestimate fairly safely.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; is documented in [http://www.postgresql.org/docs/current/static/runtime-config-replication.html the main PostgreSQL documentation].&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Like &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt;, the &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; parameter isn't directly changed by logical replication but is still important for upstream masters. It is not required on downstream-only masters.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; should be set to a value that allows for some downtime or unreachable periods for downstream masters and for heavy bursts of write activity on the upstream master. &lt;br /&gt;
&lt;br /&gt;
Keep in mind that enough disk space must be available for the WAL segments, each of which is 16MB. If you run out of disk space the server will halt until disk space is freed and it may be quite difficult to free space when you can no longer start the server.&lt;br /&gt;
&lt;br /&gt;
If you exceed the required &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; and &amp;quot;Insufficient WAL segments retained&amp;quot; error will be reported. See [[#Troubleshooting|Troubleshooting]].&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; is documented in the [http://www.postgresql.org/docs/current/static/runtime-config-replication.html the main PostgreSQL manual].&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;track_commit_timestamp&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Setting this parameter to &amp;quot;on&amp;quot; enables commit timestamp tracking, which is used to implement last-UPDATE-wins conflict resolution.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Details on individual parameters are described in the [[parameter reference]] section.&lt;br /&gt;
&lt;br /&gt;
The following configuration is an example of a simple one-way LLSR replication setup - a single upstream master to a single downstream master.&lt;br /&gt;
&lt;br /&gt;
The upstream master (sender)'s &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt; should contain settings like:&lt;br /&gt;
&lt;br /&gt;
  wal_level = 'logical'       # Include enough info for logical replication&lt;br /&gt;
  max_logical_slots = X       # Number of LLSR senders + any receivers&lt;br /&gt;
  max_wal_senders = Y         # Y = max_logical_slots plus any physical &lt;br /&gt;
                              # streaming requirements&lt;br /&gt;
  wal_keep_segments = 5000    # Master must retain enough WAL segments to let &lt;br /&gt;
                              # replicas catch up. Correct value depends on&lt;br /&gt;
                              # rate of writes on master, max replica downtime&lt;br /&gt;
                              # allowable. 5000 segments requires 78GB&lt;br /&gt;
                              # in pg_xlog&lt;br /&gt;
&lt;br /&gt;
Downstream (receiver) &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
  shared_preload_libraries = 'bdr'&lt;br /&gt;
  &lt;br /&gt;
  bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection&lt;br /&gt;
                                                 # from downstream to upstream master&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case &lt;br /&gt;
                                                 # where the databasename on upstream &lt;br /&gt;
                                                 # and downstream master differ. &lt;br /&gt;
                                                 # (Not yet implemented)&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.apply_delay                     # optional parameter to delay apply of&lt;br /&gt;
                                                 # transactions, time in milliseconds &lt;br /&gt;
  bdr.synchronous_commit = ...;                  # optional parameter to set the&lt;br /&gt;
                                                 # synchronous_commit parameter the&lt;br /&gt;
                                                 # apply processes will be using&lt;br /&gt;
  max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
Note that a server can be both sender and receiver, either two servers to each other or more complex configurations like replication chains/trees.&lt;br /&gt;
&lt;br /&gt;
The upstream (sender) &amp;lt;tt&amp;gt;pg_hba.conf&amp;lt;/tt&amp;gt; must be configured to allow the downstream master to connect for replication. Otherwise you'll see errors like the following on the downstream master:&lt;br /&gt;
&lt;br /&gt;
 FATAL:  could not connect to the primary server: FATAL:  no pg_hba.conf entry for replication connection from host &amp;quot;[local]&amp;quot;, user &amp;quot;postgres&amp;quot;&lt;br /&gt;
&lt;br /&gt;
A suitable &amp;lt;tt&amp;gt;pg_hba.conf&amp;lt;/tt&amp;gt; entry for a replication connection from the replica server 10.1.4.8 might be:&lt;br /&gt;
&lt;br /&gt;
  host    replication     postgres        10.1.4.8/32            trust&lt;br /&gt;
&lt;br /&gt;
(the user name should match the user name configured in the downstream master's dsn. md5 password authentication is supported.)&lt;br /&gt;
&lt;br /&gt;
For more details on these parameters, see [[#Parameter Reference|Parameter Reference]].&lt;br /&gt;
&lt;br /&gt;
=== Troubleshooting ===&lt;br /&gt;
&lt;br /&gt;
==== Could not access file &amp;quot;bdr&amp;quot;: No such file or directory ====&lt;br /&gt;
&lt;br /&gt;
If you see the error:&lt;br /&gt;
&lt;br /&gt;
 FATAL:  could not access file &amp;quot;bdr&amp;quot;: No such file or directory&lt;br /&gt;
&lt;br /&gt;
when starting a database set up to receive BDR replication, you probably forgot to install &amp;lt;tt&amp;gt;contrib/bdr&amp;lt;/tt&amp;gt;. See above.&lt;br /&gt;
&lt;br /&gt;
==== Invalid value for parameter ====&lt;br /&gt;
&lt;br /&gt;
An error like:&lt;br /&gt;
&lt;br /&gt;
 LOG:  invalid value for parameter ...&lt;br /&gt;
&lt;br /&gt;
when setting one of these parameters means your server doesn't support logical replication and will need to be patched or updated.&lt;br /&gt;
&lt;br /&gt;
==== Insufficient WAL segments retained (&amp;quot;requested WAL segment ... has already been removed&amp;quot;) ====&lt;br /&gt;
&lt;br /&gt;
If &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; is insufficient to meet the requirements of a replica that has fallen far behind, the master will report errors like:&lt;br /&gt;
&lt;br /&gt;
 ERROR:  requested WAL segment 00000001000000010000002D has already been removed&lt;br /&gt;
&lt;br /&gt;
Currently the replica errors look like:&lt;br /&gt;
&lt;br /&gt;
 WARNING:  Starting logical replication&lt;br /&gt;
 LOG:  data stream ended&lt;br /&gt;
 LOG:  worker process: master (PID 23812) exited with exit code 0&lt;br /&gt;
 LOG:  starting background worker process &amp;quot;master&amp;quot;&lt;br /&gt;
 LOG:  master initialized on master, remote dbname=master port=5434 replication=true fallback_application_name=bdr&lt;br /&gt;
 LOG:  local sysid 5873181566046043070, remote: 5873181102189050714&lt;br /&gt;
 LOG:  found valid replication identifier 1&lt;br /&gt;
 LOG:  starting up replication at 1 from 1/2D9CA220&lt;br /&gt;
&lt;br /&gt;
but a more explicit error message for this condition is planned.&lt;br /&gt;
&lt;br /&gt;
The only way to recover from this fault is to re-seed the replica database.&lt;br /&gt;
&lt;br /&gt;
This fault could be prevented with feedback from the replica to the master, but this feature is not planned for the first release of BDR. Another alternative considered for future releases is making wal_keep_segments a dynamic parameter that is sized on demand.&lt;br /&gt;
&lt;br /&gt;
Monitoring of maximum replica lag and appropriate adjustment of wal_keep_segments will prevent this fault from arising.&lt;br /&gt;
&lt;br /&gt;
==== Couldn't find logical slot ====&lt;br /&gt;
&lt;br /&gt;
An error like:&lt;br /&gt;
&lt;br /&gt;
 ERROR:  couldn't find logical slot &amp;quot;bdr: 16384:5873181566046043070-1-24596:&amp;quot;&lt;br /&gt;
&lt;br /&gt;
on the upstream master suggests that a downstream master is trying to connect to a logical replication slot that no longer exists. The slot can not be re-created, so it is necessary to re-seed the downstream replica database.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level (ie SQL visible) ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
The following views are available for monitoring replication activity:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/monitoring-stats.html#MONITORING-STATS-VIEWS-TABLE pg_stat_replication]&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; (described below)&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_bdr&amp;lt;/tt&amp;gt; (described below)&lt;br /&gt;
&lt;br /&gt;
The following configuration and logging parameters are useful for monitoring replication:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LOCK-WAITS log_lock_waits]&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== pg_stat_logical_replication ====&lt;br /&gt;
&lt;br /&gt;
The new &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; view is specific to logical replication. It is based on the underlying &amp;lt;tt&amp;gt;pg_stat_get_logical_replication_slots&amp;lt;/tt&amp;gt; function and has the following structure:&lt;br /&gt;
&lt;br /&gt;
  View &amp;quot;pg_catalog.pg_stat_logical_replication&amp;quot;&lt;br /&gt;
           Column          |  Type   | Modifiers &lt;br /&gt;
 --------------------------+---------+-----------&lt;br /&gt;
  slot_name                | text    | &lt;br /&gt;
  plugin                   | text    | &lt;br /&gt;
  database                 | oid     | &lt;br /&gt;
  active                   | boolean | &lt;br /&gt;
  xmin                     | xid     | &lt;br /&gt;
  last_required_checkpoint | text    | &lt;br /&gt;
&lt;br /&gt;
It contains one row for every connection from a downstream master to the server being queried (the upstream master). On a standalone PostgreSQL server or a downstream-only master this view will contain no rows.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;slot_name&amp;lt;/tt&amp;gt;: An internal name for a given logical replication slot (a connection from a downstream master to this upstream master). This slot name is used by the downstream master to uniquely identify its self and is used with the &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; command when managing logical replication slots. The slot name is composed of the decoding plugin name, the upstream database oid, the downstream system identifier (from &amp;lt;tt&amp;gt;pg_control&amp;lt;/tt&amp;gt;), the downstream slot number, and the downstream database oid.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;plugin&amp;lt;/tt&amp;gt;: The logical replication plugin being used to decode WAL archives. You'll generally only see &amp;lt;tt&amp;gt;bdr_output&amp;lt;/tt&amp;gt; here.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;database&amp;lt;/tt&amp;gt;: The oid of the database being replicated by this slot. You can get the database name by joining on &amp;lt;tt&amp;gt;pg_database.oid&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;active&amp;lt;/tt&amp;gt;: Whether this slot currently has an active connection.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;xmin&amp;lt;/tt&amp;gt;: The lowest transaction ID this replication slot can &amp;quot;see&amp;quot;, like the xmin of a transaction or prepared transaction. xmin should keep on advancing as replication continues.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;last_required_checkpoint&amp;lt;/tt&amp;gt;: The checkpoint identifying the oldest WAL record required to bring this slot up to date with the upstream master. (This column is likely to be removed in a future version).&lt;br /&gt;
&lt;br /&gt;
==== pg_stat_bdr ====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;pg_stat_bdr&amp;lt;/tt&amp;gt; view is supplied by the &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; extension. It provides information on a server's connection(s) to its upstream master(s). It is not present on upstream-only masters.&lt;br /&gt;
&lt;br /&gt;
The primary purpose of this view is to report statistics on the progress of LLSR apply on a per-upstream master connection basis.&lt;br /&gt;
&lt;br /&gt;
View structure:&lt;br /&gt;
&lt;br /&gt;
         View &amp;quot;public.pg_stat_bdr&amp;quot;&lt;br /&gt;
        Column       |  Type  | Modifiers &lt;br /&gt;
 --------------------+--------+-----------&lt;br /&gt;
  rep_node_id        | oid    | &lt;br /&gt;
  riremotesysid      | name   | &lt;br /&gt;
  riremotedb         | oid    | &lt;br /&gt;
  rilocaldb          | oid    | &lt;br /&gt;
  nr_commit          | bigint | &lt;br /&gt;
  nr_rollback        | bigint | &lt;br /&gt;
  nr_insert          | bigint | &lt;br /&gt;
  nr_insert_conflict | bigint | &lt;br /&gt;
  nr_update          | bigint | &lt;br /&gt;
  nr_update_conflict | bigint | &lt;br /&gt;
  nr_delete          | bigint | &lt;br /&gt;
  nr_delete_conflict | bigint | &lt;br /&gt;
  nr_disconnect      | bigint | &lt;br /&gt;
&lt;br /&gt;
Fields:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;rep_node_id&amp;lt;/tt&amp;gt;: An internal identifier for the replication slot.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;riremotesysid&amp;lt;/tt&amp;gt;: The remote database system identifier, as reported by the &amp;lt;tt&amp;gt;Database system identifier&amp;lt;/tt&amp;gt; line of &amp;lt;tt&amp;gt;pg_controldata /path/to/datadir&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;riremotedb&amp;lt;/tt&amp;gt;: The remote database OID, ie the &amp;lt;tt&amp;gt;oid&amp;lt;/tt&amp;gt; column of the remote server's &amp;lt;tt&amp;gt;pg_catalog.pg_database&amp;lt;/tt&amp;gt; entry for the replicated database. You can get the database name with &amp;lt;tt&amp;gt;select datname from pg_database where oid = 12345&amp;lt;/tt&amp;gt; (where '12345' is the &amp;lt;tt&amp;gt;riremotedb&amp;lt;/tt&amp;gt; oid).&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;rilocaldb &amp;lt;/tt&amp;gt;: The local database OID, with the same meaning as &amp;lt;tt&amp;gt;riremotedb&amp;lt;/tt&amp;gt; but with oids from the local system.&lt;br /&gt;
&lt;br /&gt;
''The rest of the rows are statistics about this upstream master slot'':&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_commit&amp;lt;/tt&amp;gt;: Number of commits applied to date from this master&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_rollback&amp;lt;/tt&amp;gt;: Number of rollbacks performed by this apply process due to recoverable errors (deadlock retries, lost races, etc) or unrecoverable errors like mismatched constraint errors.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_insert&amp;lt;/tt&amp;gt;: Number of &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s performed&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_insert_conflict&amp;lt;/tt&amp;gt;:  Number of &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s that resulted in conflicts.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_update&amp;lt;/tt&amp;gt;: Number of &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s performed&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_update_conflict&amp;lt;/tt&amp;gt;: Number of &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s that resulted in conflicts.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_delete&amp;lt;/tt&amp;gt;: Number of deletes performed&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_delete_conflict&amp;lt;/tt&amp;gt;: Number of deletes that resulted in conflicts.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_disconnect&amp;lt;/tt&amp;gt;: Number of times this apply process has lost its connection to the upstream master since it was started.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This view does not contain any information about how far behind the upstream master this downstream master is. The upstream master's &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;pg_stat_replication&amp;lt;/tt&amp;gt; views must be queried to determine replication lag.&lt;br /&gt;
&lt;br /&gt;
==== Monitoring replication status and lag ====&lt;br /&gt;
&lt;br /&gt;
As with any replication setup, it is vital to monitor replication status on all BDR nodes to ensure no node is lagging severely behind the others or is stuck.&lt;br /&gt;
&lt;br /&gt;
In the case of BDR a stuck or crashed node will eventually cause disk space and table bloat problems on other masters so stuck nodes should be detected and removed or repaired in a reasonably timely manner. Exactly how urgent this is depends on the workload of the BDR group.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; view described above may be used to verify that a downstream master is connected to its upstream master - the &amp;lt;tt&amp;gt;active&amp;lt;/tt&amp;gt; boolean column is &amp;lt;tt&amp;gt;t&amp;lt;/tt&amp;gt; if there's a downstream master connected.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;xmin&amp;lt;/tt&amp;gt; column provides an indication of whether replication is advancing; it should increase as replication progresses. There is no simple way to turn &amp;lt;tt&amp;gt;xmin&amp;lt;/tt&amp;gt; into the time the last applied transaction was committed on the master, so it doesn't provide an indication of wall-clock lag.&lt;br /&gt;
&lt;br /&gt;
To determine wall-clock replication lag an application-level ticker may be used to periodically update a timestamp in a replicated table. The difference between this timestamp on the upstream and downstream masters provides the wall-clock replication lag. For BDR one row may be added to the table for each BDR master, giving an indication of how much lag each master has relative to each other master.&lt;br /&gt;
&lt;br /&gt;
=== Table and index usage statistics ===&lt;br /&gt;
&lt;br /&gt;
Statistics on table and index usage are updated normally by the downstream master. This is essential for correct function of auto-vacuum. If there are no local writes on the downstream master and stats have not been reset these two views should show matching results between upstream and downstream:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_user_tables&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_statio_user_tables&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s than non-identifying indexes are. &lt;br /&gt;
&lt;br /&gt;
The built-in index monitoring views are:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_user_indexes&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_statio_user_indexes&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All these views are discussed in [http://www.postgresql.org/docs/current/static/monitoring-stats.html#MONITORING-STATS-VIEWS-TABLE the PostgreSQL documentation on the statistics views].&lt;br /&gt;
&lt;br /&gt;
=== Starting, stopping and managing replication ===&lt;br /&gt;
&lt;br /&gt;
Replication is managed with the &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt; settings described in &amp;quot;Parameter Reference&amp;quot; and &amp;quot;Configuration&amp;quot; above, and using the &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; utility command.&lt;br /&gt;
&lt;br /&gt;
==== Starting a new LLSR connection ====&lt;br /&gt;
&lt;br /&gt;
Logical replication is started automatically when a database is configured as a downstream master in &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt; (see [[#Configuration|Configuration]]) and the postmaster is started. No explicit action is required to start replication, but replication will not actually work unless the upstream and downstream databases are identical within the requirements set by LLSR in the [[#Table definitions and DDL replication||Table definitions and DDL replication]] section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;pg_dump&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;pg_restore&amp;lt;/tt&amp;gt; may be used to set up the new replica's database.&lt;br /&gt;
&lt;br /&gt;
==== Viewing logical replication slots ====&lt;br /&gt;
&lt;br /&gt;
Examining the state of logical replication is discussed in [[#Monitoring|Monitoring]].&lt;br /&gt;
&lt;br /&gt;
==== Temporarily stopping an LLSR replica ====&lt;br /&gt;
&lt;br /&gt;
LLSR replicas can be temporarily stopped by shutting down the downstream master's postmaster.&lt;br /&gt;
&lt;br /&gt;
If the replica is not started back up before the upstream master discards the oldest WAL segment required for the downstream master to resume replay, as identified by the &amp;lt;tt&amp;gt;last_required_checkpoint&amp;lt;/tt&amp;gt; column of &amp;lt;tt&amp;gt;pg_catalog.pg_stat_logical_replication&amp;lt;/tt&amp;gt; then the replica will not resume replay. The error [[#Insufficient_WAL_segments_retained_.28.22requested_WAL_segment_..._has_already_been_removed.22.29|Insufficient WAL segments retained]] will be reported in the upstream master's logs. The replica must be re-created for replication to continue.&lt;br /&gt;
&lt;br /&gt;
==== Removing an LLSR replica permanently ====&lt;br /&gt;
&lt;br /&gt;
To remove a replication connection permanently, remove its entries from the downstream master's &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt;, restart the downstream master, then use &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; to remove the replication slot on the upstream master.&lt;br /&gt;
&lt;br /&gt;
It is important to remove the replication slot from the upstream master(s) to prevent xid wrap-around problems and issues with table bloat caused by delayed vacuum.&lt;br /&gt;
&lt;br /&gt;
==== Cleaning up abandoned replication slots ====&lt;br /&gt;
&lt;br /&gt;
To remove a replication slot that was used for a now-defunct replica, find its slot name from the &amp;lt;tt&amp;gt;[[#pg_stat_logical_replication|pg_stat_logical_replication]]&amp;lt;/tt&amp;gt; view on the upstream master then run:&lt;br /&gt;
&lt;br /&gt;
 pg_receivellog -p 5434 -h master-hostname -d dbname \&lt;br /&gt;
    --slot='bdr: 16384:5873181566046043070-1-16384:' --stop&lt;br /&gt;
&lt;br /&gt;
where the argument to '--slot' is the slot name you found from the view.&lt;br /&gt;
&lt;br /&gt;
You may need to do this if you've created and then deleted several replicas so &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; has filled up with entries for replicas that no longer exist.&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional replication is built directly on LLSR by configuring two or more servers as both upstream ''and'' downstream masters of each other.&lt;br /&gt;
&lt;br /&gt;
All of the Log Level Streaming Replication documentation applies to BDR and should be read before moving on to reading about and configuring BDR.&lt;br /&gt;
&lt;br /&gt;
=== Bi-Directional Replication Use Cases ===&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
==== Simple multi-master pair ====&lt;br /&gt;
&lt;br /&gt;
A simple mulit-master &amp;quot;HA Cluster&amp;quot; with two servers:&lt;br /&gt;
&lt;br /&gt;
* Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
* Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Alpha:&lt;br /&gt;
&lt;br /&gt;
 wal_level = 'logical'&lt;br /&gt;
 max_logical_slots = 3&lt;br /&gt;
 max_wal_senders = 4&lt;br /&gt;
 wal_keep_segments = 5000&lt;br /&gt;
 shared_preload_libraries = 'bdr'&lt;br /&gt;
 bdr.connections=&amp;quot;bravo&amp;quot;&lt;br /&gt;
 bdr.bravo.dsn = 'dbname=dbtoreplicate'&lt;br /&gt;
 track_commit_timestamp = on&lt;br /&gt;
&lt;br /&gt;
Bravo:&lt;br /&gt;
&lt;br /&gt;
 wal_level = 'logical'&lt;br /&gt;
 max_logical_slots = 3&lt;br /&gt;
 max_wal_senders = 4&lt;br /&gt;
 wal_keep_segments = 5000&lt;br /&gt;
 shared_preload_libraries = 'bdr'&lt;br /&gt;
 bdr.connections=&amp;quot;alpha&amp;quot;&lt;br /&gt;
 bdr.alpha.dsn = 'dbname=dbtoreplicate'&lt;br /&gt;
 track_commit_timestamp = on&lt;br /&gt;
&lt;br /&gt;
See [[#Configuration|Configuration]] for an explanation of these parameters.&lt;br /&gt;
&lt;br /&gt;
==== HA and Logical Standby ====&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;HA Cluster&amp;quot;:&lt;br /&gt;
&lt;br /&gt;
* Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
* Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
* Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
==== Very High Availability Multi-Master ====&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site simple Multi-Master Plex ====&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
If you wanted to test this configuration locally you could run three PostgreSQL instances on different ports. Such a configuration would look like the following if the port numbers were used as node names for the sake of notational clarity:&lt;br /&gt;
&lt;br /&gt;
Config for node_5440:&lt;br /&gt;
&lt;br /&gt;
 port = 5440&lt;br /&gt;
 bdr.connections='node_5441,node_5442'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5441:&lt;br /&gt;
&lt;br /&gt;
 port = 5441&lt;br /&gt;
 bdr.connections='node_5440,node_5442'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5442:&lt;br /&gt;
&lt;br /&gt;
 port = 5442&lt;br /&gt;
 bdr.connections='node_5440,node_5441'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
In a typical real-world configuration each server would be on the same port on a different host instead.&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site simple Multi-Master Circular Replication ====&lt;br /&gt;
&lt;br /&gt;
Simpler config uses &amp;quot;circular replication&amp;quot;. This is simpler but results in higher latency for changes as the number of nodes increases. It's also less resilient to network disruptions and node faults.&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
TODO: Regrettably this doesn't actually work yet because we don't cascade logical changes (yet).&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
Config for node_5440:&lt;br /&gt;
&lt;br /&gt;
 port = 5440&lt;br /&gt;
 bdr.connections='node_5441'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5441:&lt;br /&gt;
&lt;br /&gt;
 port = 5441&lt;br /&gt;
 bdr.connections='node_5442'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5442:&lt;br /&gt;
&lt;br /&gt;
 port = 5442&lt;br /&gt;
 bdr.connections='node_5440'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
This would usually be done in the real world with databases on different hosts, all running on the same port.&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site Max Availability Multi-Master Plex ====&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
Config left as an exercise for the reader.&lt;br /&gt;
&lt;br /&gt;
==== N-site symmetric cluster replication ====&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
==== Complex/Assymetric Replication ====&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
=== Conflict Avoidance ===&lt;br /&gt;
&lt;br /&gt;
==== Distributed Locking ====&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications as very low latency is critical for acceptable performance.&lt;br /&gt;
&lt;br /&gt;
Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible but allow some types of conflict to occur and and resolve them when they arise.&lt;br /&gt;
&lt;br /&gt;
==== Global Sequences ====&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
The SQL standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using &amp;lt;tt&amp;gt;DEFAULT nextval('mysequence')&amp;lt;/tt&amp;gt;, as with PostgreSQL's &amp;lt;tt&amp;gt;SERIAL&amp;lt;/tt&amp;gt; pseudo-type.&lt;br /&gt;
&lt;br /&gt;
BDR requires sequences to work together across multiple nodes. This is implemented as a new &amp;lt;tt&amp;gt;SequenceAccessMethod&amp;lt;/tt&amp;gt; API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
=== Conflict Detection &amp;amp; Resolution ===&lt;br /&gt;
&lt;br /&gt;
Because local writes can occur on a master, conflict detection and avoidance is a concern for basic LLSR setups as well as full BDR configurations.&lt;br /&gt;
&lt;br /&gt;
==== Lock Conflicts ====&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LOCK-WAITS log_lock_waits]&amp;lt;/tt&amp;gt; facility to look for issues with apply blocking on locks.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
&lt;br /&gt;
* explicit row level locking (&amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/FOR SHARE&amp;lt;/tt&amp;gt;)&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s, &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s or &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s, either from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
==== Data Conflicts ====&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in a consistent and idempotent manner so that all servers end up with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from &amp;lt;tt&amp;gt;pg_control&amp;lt;/tt&amp;gt; though this may change in a future release.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur. The errors causing the conflict can be seen in the error log of the downstream master with the problem.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; conflicts. These are accepted as successful operations but in the case of &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; the data in the &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; is discarded.&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins. It is not practical to decide when a row should be merged and when a last-update-wins stragegy should be used at the database level; such decision making would require support for application-specific conflict resolution plugins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in the same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
==== Examples ====&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-05-16T13:32:57Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Replication of DML changes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;----&lt;br /&gt;
This page is the users and administrators guide for BDR. If you're looking for technical details on the project plan and implementation, see [[BDR Project]].&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
= BDR User Guide =&lt;br /&gt;
&lt;br /&gt;
BDR (BiDrectional Replication) is a feature being developed for inclusion in PostgreSQL core that provides greatly enhanced replication capabilities.&lt;br /&gt;
&lt;br /&gt;
BDR allows users to create a geographically distributed multi-master database using Logical Log Streaming Replication (LLSR) transport.&lt;br /&gt;
BDR is designed to provide both high availability and geographically distributed disaster recovery capabilities. &lt;br /&gt;
&lt;br /&gt;
BDR is not “clustering” as some vendors use the term, in that it doesn't have a distributed lock manager, global transaction co-ordinator, etc. Each member server is separate yet connected, with design choices that allow separation between nodes that would not be possible with global transaction coordination.&lt;br /&gt;
&lt;br /&gt;
Guidance on getting a testing setup established are in [[#Initial setup]]. Please read the full documentation if you intend to put BDR into production.&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows one PostgreSQL master (the &amp;quot;upstream master&amp;quot;) to stream a sequence of changes to another read/write PostgreSQL server (the &amp;quot;downstream master&amp;quot;). Data is sent in one direction only over a normal libpq connection.&lt;br /&gt;
&lt;br /&gt;
Multiple LLSR connections can be used to set up bi-directional replication as discussed later in this guide.&lt;br /&gt;
&lt;br /&gt;
=== Overview of logical replication ===&lt;br /&gt;
&lt;br /&gt;
In some ways LLSR is similar to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective; both replicate changes from one server to another. However, in LLSR the receiving server is also a full master database that can make changes, unlike the read-only replicas offered by PLSR hot standby. Additionally, LLSR is per-database, whereas PLSR is per-cluster and replicates all databases at once. There are many more differences discussed in the relevant sections of this document.&lt;br /&gt;
&lt;br /&gt;
In LLSR the data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after &amp;lt;tt&amp;gt;CREATE DATABASE&amp;lt;/tt&amp;gt;. A restart of the downstream master is also required. The upstream master only needs restarting if the &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; parameter is too low to allow a new replica to be added. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated. Setup is discussed in more detail below.&lt;br /&gt;
&lt;br /&gt;
Changes are processed by the downstream master using &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; plug-ins. This allows flexible handing of replication input, including:&lt;br /&gt;
&lt;br /&gt;
* BDR apply process - applies logical changes to the downstream master. The apply process makes changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
* Textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_xlogdump&amp;lt;/tt&amp;gt; - examines physical WAL records and produces textual debugging output. This server program is included in PostgreSQL 9.3.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;TRUNCATE&amp;lt;/tt&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
(TRUNCATE is not yet implemented, but will be implemented before the feature goes to final release).&lt;br /&gt;
&lt;br /&gt;
Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though it has overheads that mean that it doesn't always use less bandwidth than PLSR.&lt;br /&gt;
&lt;br /&gt;
Locks taken by &amp;lt;tt&amp;gt;LOCK&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/SHARE&amp;lt;/tt&amp;gt; on the upstream master are not replicated to downstream masters. Locks taken automatically by &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;TRUNCATE&amp;lt;/tt&amp;gt; *are* taken on the downstream master and may delay replication apply or concurrent transactions - see [[#Lock Conflicts|Lock Conflicts]].&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;TEMPORARY&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;UNLOGGED&amp;lt;/tt&amp;gt; tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables. However, temporary tables remain specific to a particular session so creating a temporary table on the upstream master does not create a similar table on the downstream master.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; statements that affect multiple rows on upstream master will cause a series of row changes on downstream master. These are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s require some form of unique constraint, either &amp;lt;tt&amp;gt;PRIMARY KEY&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;UNIQUE NOT NULL&amp;lt;/tt&amp;gt;. A warning is issued in the downstream master's logs if the expected constraint is absent. &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt; on upstream master do not require a unique constraint in order to replicate correctly, though such usage would prevent conflict detection between multiple masters, if that was considered important.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s that change the value of the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the final values from the &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; on the upstream master, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value. Volatile or stable functions are evaluated on the master side and the resulting values are replicated. Consequently any function side-effects (writing files, network socket activity, updating internal PostgreSQL variables, etc) will not occur on the replicas as the functions are not run again on the replica.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master (see &amp;quot;Limitations&amp;quot;).&lt;br /&gt;
&lt;br /&gt;
The current LLSR plugin implementation uses the binary libpq protocol, so it requires that the upstream and downstream master use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication. A textual output option will be added later for passing data between non-identical servers, e.g. laptops or mobile devices communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is efficiently implemented. Parallel apply is a possible future feature, especially for changes made while holding &amp;lt;tt&amp;gt;AccessExclusiveLock&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in the sequence in which they were commited on the upstream master. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions spill to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;SET&amp;lt;/tt&amp;gt; statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. We always update the correct tables, whatever the setting of &amp;lt;tt&amp;gt;search_path&amp;lt;/tt&amp;gt;. Values are replicated correctly irrespective of the values of &amp;lt;tt&amp;gt;bytea_output&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;TimeZone&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DateStyle&amp;lt;/tt&amp;gt;, etc.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; is not supported across log based replication, either physical or logical. &amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;LISTEN&amp;lt;/tt&amp;gt; will work fine on the upstream master but an upstream &amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; will not trigger a downstream &amp;lt;tt&amp;gt;LISTEN&amp;lt;/tt&amp;gt;er.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes an automatic retry of the apply of the replaying transaction and is only an issue if the deadlock recurs repeatedly, delaying replication.&lt;br /&gt;
&lt;br /&gt;
From a performance and concurrency perspective the BDR apply process is similar to a normal backend. Frequent conflicts with locks from other transactions when replaying changes can slow things down and thus increase replication delay, so reducing the frequency of such conflicts can be a good way to speed things up. Any lock held by another transaction on the downstream master - &amp;lt;tt&amp;gt;LOCK&amp;lt;/tt&amp;gt; statements, &amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/FOR SHARE&amp;lt;/tt&amp;gt;, or &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; row locks - can delay replication if the replication apply process needs to change the locked table/row.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching &amp;lt;tt&amp;gt;&amp;quot;Schemaname&amp;quot;.&amp;quot;Tablename&amp;quot;&amp;lt;/tt&amp;gt; on both upstream and downstream masters. e.g. changes from upstream's &amp;lt;tt&amp;gt;public.mytable&amp;lt;/tt&amp;gt; will go to downstream's &amp;lt;tt&amp;gt;public.mytable&amp;lt;/tt&amp;gt; while changes to the upstream &amp;lt;tt&amp;gt;mychema.mytable&amp;lt;/tt&amp;gt; will go to the downstream &amp;lt;tt&amp;gt;myschema.mytable&amp;lt;/tt&amp;gt;. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful synchronization of table definitions on each node otherwise &amp;lt;tt&amp;gt;ERROR&amp;lt;/tt&amp;gt;s will be generated by the replication apply process. In general, tables must be an exact match between upstream and downstream masters. &lt;br /&gt;
&lt;br /&gt;
There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
Tables must meet the following requirements to be compatible for purposes of LLSR:&lt;br /&gt;
&lt;br /&gt;
* The downstream master must only have constraints (&amp;lt;tt&amp;gt;CHECK&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UNIQUE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;EXCLUSION&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;FOREIGN KEY&amp;lt;/tt&amp;gt;, etc) that are also present on the upstream master. Replication may initially work with mismatched constraints but is likely to fail as soon as the downstream master rejects a row the upstream master accepted.&lt;br /&gt;
* The table referenced by a FOREIGN KEY on a downstream master must have all the keys present in the upstream master version of the same table.&lt;br /&gt;
* Storage parameters must match except for as allowed below&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
* Dropped columns on master must be present on replicas&lt;br /&gt;
* Custom types and enum definitions must match exactly&lt;br /&gt;
* Composite types and enums must have the same oids on master and replication target&lt;br /&gt;
* Extensions defining types used in replicated tables must be of the same version or fully SQL-level compatible and the oids of the types they define must match.&lt;br /&gt;
&lt;br /&gt;
The following differences are permissible between tables on different nodes:&lt;br /&gt;
&lt;br /&gt;
* The table's &amp;lt;tt&amp;gt;pg_class&amp;lt;/tt&amp;gt; oid, the oid of its associated TOAST table, and the oid of the table's rowtype in &amp;lt;tt&amp;gt;pg_type&amp;lt;/tt&amp;gt; may differ;&lt;br /&gt;
* Extra or missing non-&amp;lt;tt&amp;gt;UNIQUE&amp;lt;/tt&amp;gt; indexes&lt;br /&gt;
* Extra keys in downstream lookup tables for &amp;lt;tt&amp;gt;FOREIGN KEY&amp;lt;/tt&amp;gt; references that are not present on the upstream master&lt;br /&gt;
* The table-level storage parameters for fillfactor and autovacuum&lt;br /&gt;
* Triggers and rules may differ (they are not executed by replication apply)&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR (see [[#LLSR Limitations|LLSR Limitations]]).&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of &amp;lt;tt&amp;gt;session_replication_role = origin&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
In future it is expected that composite types and enums with non-identical oids will be converted using text output and input functions. This feature is not yet implemented.&lt;br /&gt;
&lt;br /&gt;
=== LLSR limitations ===&lt;br /&gt;
&lt;br /&gt;
The current LLSR implementation is subject to some limitations, which are being progressively removed as work progresses.&lt;br /&gt;
&lt;br /&gt;
==== Data definition compatibility ====&lt;br /&gt;
&lt;br /&gt;
Table definitions, types, extensions, etc must be near identical between upstream and downstream masters. See [[#Table definitions and DDL replication|Table definitions and DDL replication]].&lt;br /&gt;
&lt;br /&gt;
==== DDL Replication ====&lt;br /&gt;
&lt;br /&gt;
DDL replication is not yet supported.&lt;br /&gt;
&lt;br /&gt;
==== Upstream feedback ====&lt;br /&gt;
&lt;br /&gt;
No feedback from downstream masters to the upstream master is implemented for asynchronous LLSR, so upstream masters must be configured to keep enough WAL. See [[#Configuration|Configuration]].&lt;br /&gt;
&lt;br /&gt;
==== TRUNCATE is not replicated ====&lt;br /&gt;
&lt;br /&gt;
TRUNCATE is not yet supported, however workarounds with user-level triggers are possible and a ProcessUtility hook is planned to implement a similar approach globally.&lt;br /&gt;
&lt;br /&gt;
The safest option is to define a user-level BEFORE trigger on each table that RAISEs an ERROR when TRUNCATE is attempted.&lt;br /&gt;
&lt;br /&gt;
A simple truncate-blocking trigger is:&lt;br /&gt;
&lt;br /&gt;
 CREATE OR REPLACE FUNCTION deny_truncate() RETURNS trigger AS $$&lt;br /&gt;
 BEGIN&lt;br /&gt;
   IF tg_op = 'TRUNCATE' THEN&lt;br /&gt;
     RAISE EXCEPTION 'TRUNCATE is not supported on this table. Please use DELETE FROM.';&lt;br /&gt;
   ELSE&lt;br /&gt;
     RAISE EXCEPTION 'This trigger only supports TRUNCATE';&lt;br /&gt;
   END IF;&lt;br /&gt;
 END;&lt;br /&gt;
 $$ LANGUAGE plpgsql;&lt;br /&gt;
&lt;br /&gt;
It can be applied to a table with:&lt;br /&gt;
&lt;br /&gt;
 CREATE TRIGGER deny_truncate_on_&amp;lt;tablename&amp;gt; BEFORE TRUNCATE ON &amp;lt;tablename&amp;gt;&lt;br /&gt;
 FOR EACH STATEMENT EXECUTE PROCEDURE deny_truncate();&lt;br /&gt;
&lt;br /&gt;
A PL/PgSQL DO block that queries &amp;lt;tt&amp;gt;pg_class&amp;lt;/tt&amp;gt; and loops over it to &amp;lt;tt&amp;gt;EXECUTE&amp;lt;/tt&amp;gt; a dynamic SQL &amp;lt;tt&amp;gt;CREATE TRIGGER&amp;lt;/tt&amp;gt; command for each table that does not already have the trigger can be used to apply the trigger to all tables.&lt;br /&gt;
&lt;br /&gt;
=== Initial setup ===&lt;br /&gt;
&lt;br /&gt;
To set up LLSR or BDR you first need a patched PostgreSQL that can support LLSR/BDR, then you need to create one or more LLSR/BDR senders and one or more LLSR/BDR receivers.&lt;br /&gt;
&lt;br /&gt;
==== Installing the patched PostgreSQL binaries ====&lt;br /&gt;
&lt;br /&gt;
Currently BDR is only available in builds of the 'bdr' branch on Andres Freund's git repo on git.postgresql.org. PostgreSQL 9.2 and below do not support BDR, and 9.3 requires patches, so this guide will not work for you if you are trying to use a normal install of PostgreSQL.&lt;br /&gt;
&lt;br /&gt;
First you need to clone, configure, compile and install like normal. Clone the sources from &amp;lt;tt&amp;gt;git://git.postgresql.org/git/users/andresfreund/postgres.git&amp;lt;/tt&amp;gt; and checkout the &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; branch.&lt;br /&gt;
&lt;br /&gt;
If you have an existing local PostgreSQL git tree specify it as &amp;lt;tt&amp;gt;--reference /path/to/existing/tree&amp;lt;/tt&amp;gt; to greatly speed your git clone.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
 mkdir -p $HOME/bdr&lt;br /&gt;
 cd bdr&lt;br /&gt;
 git clone git://git.postgresql.org/git/users/andresfreund/postgres.git $HOME/bdr/postgres-bdr-src&lt;br /&gt;
 cd postgres-bdr-src&lt;br /&gt;
 ./configure --prefix=$HOME/bdr/postgres-bdr-bin&lt;br /&gt;
 make install&lt;br /&gt;
 cd contrib/bdr&lt;br /&gt;
 make install&lt;br /&gt;
&lt;br /&gt;
This will put everything in &amp;lt;tt&amp;gt;$HOME/bdr&amp;lt;/tt&amp;gt;, with the source code and build tree in &amp;lt;tt&amp;gt;$HOME/bdr/postgres-bdr-src&amp;lt;/tt&amp;gt; and the installed PostgreSQL in &amp;lt;tt&amp;gt;$HOME/bdr/postgres-bdr-bin&amp;lt;/tt&amp;gt;. This is a convenient setup for testing and development because it doesn't require you to set up new users, wrangle permissions, run anything as root, etc, but it isn't recommended that you deploy this way in production.&lt;br /&gt;
&lt;br /&gt;
To actually use these new binaries you will need to:&lt;br /&gt;
&lt;br /&gt;
 export PATH=$HOME/bdr/postgres-bdr-bin/bin:$PATH&lt;br /&gt;
&lt;br /&gt;
before running &amp;lt;tt&amp;gt;initdb&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;postgres&amp;lt;/tt&amp;gt;, etc. You don't have to use the &amp;lt;tt&amp;gt;psql&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;libpq&amp;lt;/tt&amp;gt; you compiled but you're likely to get version mismatch warnings if you don't.&lt;br /&gt;
&lt;br /&gt;
=== Parameter Reference ===&lt;br /&gt;
&lt;br /&gt;
The following parameters are new or have been changed in PostgreSQL's new logical streaming replication.&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;shared_preload_libraries = ‘bdr’&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
To load support for receiving changes on a downstream master, the &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; library must be added to the existing ‘shared_preload_libraries’ parameter. This loads the bdr library during postmaster start-up and allows it to create the required background worker(s).&lt;br /&gt;
&lt;br /&gt;
Upstream masters don't need to load the bdr library unless they're also operating as a downstream master as is the case in a BDR configuration.&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;bdr.connections&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
A comma-separated list of upstream master connection names is specified in &amp;lt;tt&amp;gt;bdr.connections&amp;lt;/tt&amp;gt;. These names must be simple alphanumeric strings. They are used when naming the connection in error messages, configuration options and logs, but are otherwise of no special meaning.&lt;br /&gt;
&lt;br /&gt;
A typical two-upstream-master setting might be:&lt;br /&gt;
&lt;br /&gt;
 bdr.connections = ‘upstream1, upstream2’&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;bdr.&amp;amp;lt;connection_name&amp;amp;gt;.dsn&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Each connection name must have at least a data source name specified using the &amp;lt;tt&amp;gt;bdr.&amp;amp;lt;connection_name&amp;amp;gt;.dsn&amp;lt;/tt&amp;gt; parameter. The DSN syntax is the same as that used by libpq so it is not discussed in further detail here. A &amp;lt;tt&amp;gt;dbname&amp;lt;/tt&amp;gt; for the database to connect to must be specified; all other parts of the DSN are optional.&lt;br /&gt;
&lt;br /&gt;
The local (downstream) database name is assumed to be the same as the name of the upstream database being connected to, though future versions will make this configurable.&lt;br /&gt;
&lt;br /&gt;
For the above two-master setting for &amp;lt;tt&amp;gt;bdr.connections&amp;lt;/tt&amp;gt; the DSNs might look like:&lt;br /&gt;
&lt;br /&gt;
 bdr.upstream1.dsn = 'host=10.1.1.2 user=postgres dbname=replicated_db'&lt;br /&gt;
 bdr.upstream2.dsn = 'host=10.1.1.3 user=postgres dbname=replicated_db'&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
The new parameter &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; has been added for use on both upstream and downstream masters. This parameter controls the maximum number of logical replication slots - upstream or downstream - that this cluster may have at a time. It must be set at postmaster start time.&lt;br /&gt;
&lt;br /&gt;
As logical replication slots are persistent, slots are consumed even by replicas that are not currently connected. Slot management is discussed in Starting, Stopping and Managing Replication.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; should be set to the sum of the number of logical replication upstream masters this server will have plus the number of logical replication downstream masters will connect to it it.&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;wal_level = 'logical'&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
A new setting, &amp;lt;tt&amp;gt;'logical'&amp;lt;/tt&amp;gt;, has been added for the existing &amp;lt;tt&amp;gt;wal_level&amp;lt;/tt&amp;gt; parameter. &amp;lt;tt&amp;gt;‘logical’&amp;lt;/tt&amp;gt; includes everything that the existing &amp;lt;tt&amp;gt;hot_standby&amp;lt;/tt&amp;gt; setting does and adds additional details required for logical changeset decoding to the write-ahead logs. &lt;br /&gt;
&lt;br /&gt;
This additional information is consumed by the upstream-master-side xlog decoding worker. Downstream masters that do not also act as upstream masters do not require &amp;lt;tt&amp;gt;wal_level&amp;lt;/tt&amp;gt; to be increased above the default &amp;lt;tt&amp;gt;'minimal'&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;wal_level&amp;lt;/tt&amp;gt;, except for the new &amp;lt;tt&amp;gt;'logical'&amp;lt;/tt&amp;gt; setting, is [http://www.postgresql.org/docs/current/static/runtime-config-wal.html documented in the main PostgreSQL manual].&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Logical replication hasn't altered the &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; parameter, but it is important in upstream masters for logical replication and BDR because every logical sender consumes a &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; entry.&lt;br /&gt;
&lt;br /&gt;
You should configure &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; to the sum of the number of physical and logical replicas you want to allow an upstream master to serve. If you intend to use &amp;lt;tt&amp;gt;pg_basebackup&amp;lt;/tt&amp;gt; you should add at least two more senders to allow for its use.&lt;br /&gt;
&lt;br /&gt;
Like &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; entries don't cost a large amount of memory, so you can overestimate fairly safely.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; is documented in [http://www.postgresql.org/docs/current/static/runtime-config-replication.html the main PostgreSQL documentation].&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Like &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt;, the &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; parameter isn't directly changed by logical replication but is still important for upstream masters. It is not required on downstream-only masters.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; should be set to a value that allows for some downtime or unreachable periods for downstream masters and for heavy bursts of write activity on the upstream master. &lt;br /&gt;
&lt;br /&gt;
Keep in mind that enough disk space must be available for the WAL segments, each of which is 16MB. If you run out of disk space the server will halt until disk space is freed and it may be quite difficult to free space when you can no longer start the server.&lt;br /&gt;
&lt;br /&gt;
If you exceed the required &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; and &amp;quot;Insufficient WAL segments retained&amp;quot; error will be reported. See [[#Troubleshooting|Troubleshooting]].&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; is documented in the [http://www.postgresql.org/docs/current/static/runtime-config-replication.html the main PostgreSQL manual].&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;track_commit_timestamp&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Setting this parameter to &amp;quot;on&amp;quot; enables commit timestamp tracking, which is used to implement last-UPDATE-wins conflict resolution.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Details on individual parameters are described in the [[parameter reference]] section.&lt;br /&gt;
&lt;br /&gt;
The following configuration is an example of a simple one-way LLSR replication setup - a single upstream master to a single downstream master.&lt;br /&gt;
&lt;br /&gt;
The upstream master (sender)'s &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt; should contain settings like:&lt;br /&gt;
&lt;br /&gt;
  wal_level = 'logical'       # Include enough info for logical replication&lt;br /&gt;
  max_logical_slots = X       # Number of LLSR senders + any receivers&lt;br /&gt;
  max_wal_senders = Y         # Y = max_logical_slots plus any physical &lt;br /&gt;
                              # streaming requirements&lt;br /&gt;
  wal_keep_segments = 5000    # Master must retain enough WAL segments to let &lt;br /&gt;
                              # replicas catch up. Correct value depends on&lt;br /&gt;
                              # rate of writes on master, max replica downtime&lt;br /&gt;
                              # allowable. 5000 segments requires 78GB&lt;br /&gt;
                              # in pg_xlog&lt;br /&gt;
&lt;br /&gt;
Downstream (receiver) &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
  shared_preload_libraries = 'bdr'&lt;br /&gt;
  &lt;br /&gt;
  bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection&lt;br /&gt;
                                                 # from downstream to upstream master&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case &lt;br /&gt;
                                                 # where the databasename on upstream &lt;br /&gt;
                                                 # and downstream master differ. &lt;br /&gt;
                                                 # (Not yet implemented)&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.apply_delay                     # optional parameter to delay apply of&lt;br /&gt;
                                                 # transactions, time in milliseconds &lt;br /&gt;
  bdr.synchronous_commit = ...;                  # optional parameter to set the&lt;br /&gt;
                                                 # synchronous_commit parameter the&lt;br /&gt;
                                                 # apply processes will be using&lt;br /&gt;
  max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
Note that a server can be both sender and receiver, either two servers to each other or more complex configurations like replication chains/trees.&lt;br /&gt;
&lt;br /&gt;
The upstream (sender) &amp;lt;tt&amp;gt;pg_hba.conf&amp;lt;/tt&amp;gt; must be configured to allow the downstream master to connect for replication. Otherwise you'll see errors like the following on the downstream master:&lt;br /&gt;
&lt;br /&gt;
 FATAL:  could not connect to the primary server: FATAL:  no pg_hba.conf entry for replication connection from host &amp;quot;[local]&amp;quot;, user &amp;quot;postgres&amp;quot;&lt;br /&gt;
&lt;br /&gt;
A suitable &amp;lt;tt&amp;gt;pg_hba.conf&amp;lt;/tt&amp;gt; entry for a replication connection from the replica server 10.1.4.8 might be:&lt;br /&gt;
&lt;br /&gt;
  host    replication     postgres        10.1.4.8/32            trust&lt;br /&gt;
&lt;br /&gt;
(the user name should match the user name configured in the downstream master's dsn. md5 password authentication is supported.)&lt;br /&gt;
&lt;br /&gt;
For more details on these parameters, see [[#Parameter Reference|Parameter Reference]].&lt;br /&gt;
&lt;br /&gt;
=== Troubleshooting ===&lt;br /&gt;
&lt;br /&gt;
==== Could not access file &amp;quot;bdr&amp;quot;: No such file or directory ====&lt;br /&gt;
&lt;br /&gt;
If you see the error:&lt;br /&gt;
&lt;br /&gt;
 FATAL:  could not access file &amp;quot;bdr&amp;quot;: No such file or directory&lt;br /&gt;
&lt;br /&gt;
when starting a database set up to receive BDR replication, you probably forgot to install &amp;lt;tt&amp;gt;contrib/bdr&amp;lt;/tt&amp;gt;. See above.&lt;br /&gt;
&lt;br /&gt;
==== Invalid value for parameter ====&lt;br /&gt;
&lt;br /&gt;
An error like:&lt;br /&gt;
&lt;br /&gt;
 LOG:  invalid value for parameter ...&lt;br /&gt;
&lt;br /&gt;
when setting one of these parameters means your server doesn't support logical replication and will need to be patched or updated.&lt;br /&gt;
&lt;br /&gt;
==== Insufficient WAL segments retained (&amp;quot;requested WAL segment ... has already been removed&amp;quot;) ====&lt;br /&gt;
&lt;br /&gt;
If &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; is insufficient to meet the requirements of a replica that has fallen far behind, the master will report errors like:&lt;br /&gt;
&lt;br /&gt;
 ERROR:  requested WAL segment 00000001000000010000002D has already been removed&lt;br /&gt;
&lt;br /&gt;
Currently the replica errors look like:&lt;br /&gt;
&lt;br /&gt;
 WARNING:  Starting logical replication&lt;br /&gt;
 LOG:  data stream ended&lt;br /&gt;
 LOG:  worker process: master (PID 23812) exited with exit code 0&lt;br /&gt;
 LOG:  starting background worker process &amp;quot;master&amp;quot;&lt;br /&gt;
 LOG:  master initialized on master, remote dbname=master port=5434 replication=true fallback_application_name=bdr&lt;br /&gt;
 LOG:  local sysid 5873181566046043070, remote: 5873181102189050714&lt;br /&gt;
 LOG:  found valid replication identifier 1&lt;br /&gt;
 LOG:  starting up replication at 1 from 1/2D9CA220&lt;br /&gt;
&lt;br /&gt;
but a more explicit error message for this condition is planned.&lt;br /&gt;
&lt;br /&gt;
The only way to recover from this fault is to re-seed the replica database.&lt;br /&gt;
&lt;br /&gt;
This fault could be prevented with feedback from the replica to the master, but this feature is not planned for the first release of BDR. Another alternative considered for future releases is making wal_keep_segments a dynamic parameter that is sized on demand.&lt;br /&gt;
&lt;br /&gt;
Monitoring of maximum replica lag and appropriate adjustment of wal_keep_segments will prevent this fault from arising.&lt;br /&gt;
&lt;br /&gt;
==== Couldn't find logical slot ====&lt;br /&gt;
&lt;br /&gt;
An error like:&lt;br /&gt;
&lt;br /&gt;
 ERROR:  couldn't find logical slot &amp;quot;bdr: 16384:5873181566046043070-1-24596:&amp;quot;&lt;br /&gt;
&lt;br /&gt;
on the upstream master suggests that a downstream master is trying to connect to a logical replication slot that no longer exists. The slot can not be re-created, so it is necessary to re-seed the downstream replica database.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level (ie SQL visible) ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
The following views are available for monitoring replication activity:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/monitoring-stats.html#MONITORING-STATS-VIEWS-TABLE pg_stat_replication]&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; (described below)&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_bdr&amp;lt;/tt&amp;gt; (described below)&lt;br /&gt;
&lt;br /&gt;
The following configuration and logging parameters are useful for monitoring replication:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LOCK-WAITS log_lock_waits]&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== pg_stat_logical_replication ====&lt;br /&gt;
&lt;br /&gt;
The new &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; view is specific to logical replication. It is based on the underlying &amp;lt;tt&amp;gt;pg_stat_get_logical_replication_slots&amp;lt;/tt&amp;gt; function and has the following structure:&lt;br /&gt;
&lt;br /&gt;
  View &amp;quot;pg_catalog.pg_stat_logical_replication&amp;quot;&lt;br /&gt;
           Column          |  Type   | Modifiers &lt;br /&gt;
 --------------------------+---------+-----------&lt;br /&gt;
  slot_name                | text    | &lt;br /&gt;
  plugin                   | text    | &lt;br /&gt;
  database                 | oid     | &lt;br /&gt;
  active                   | boolean | &lt;br /&gt;
  xmin                     | xid     | &lt;br /&gt;
  last_required_checkpoint | text    | &lt;br /&gt;
&lt;br /&gt;
It contains one row for every connection from a downstream master to the server being queried (the upstream master). On a standalone PostgreSQL server or a downstream-only master this view will contain no rows.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;slot_name&amp;lt;/tt&amp;gt;: An internal name for a given logical replication slot (a connection from a downstream master to this upstream master). This slot name is used by the downstream master to uniquely identify its self and is used with the &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; command when managing logical replication slots. The slot name is composed of the decoding plugin name, the upstream database oid, the downstream system identifier (from &amp;lt;tt&amp;gt;pg_control&amp;lt;/tt&amp;gt;), the downstream slot number, and the downstream database oid.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;plugin&amp;lt;/tt&amp;gt;: The logical replication plugin being used to decode WAL archives. You'll generally only see &amp;lt;tt&amp;gt;bdr_output&amp;lt;/tt&amp;gt; here.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;database&amp;lt;/tt&amp;gt;: The oid of the database being replicated by this slot. You can get the database name by joining on &amp;lt;tt&amp;gt;pg_database.oid&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;active&amp;lt;/tt&amp;gt;: Whether this slot currently has an active connection.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;xmin&amp;lt;/tt&amp;gt;: The lowest transaction ID this replication slot can &amp;quot;see&amp;quot;, like the xmin of a transaction or prepared transaction. xmin should keep on advancing as replication continues.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;last_required_checkpoint&amp;lt;/tt&amp;gt;: The checkpoint identifying the oldest WAL record required to bring this slot up to date with the upstream master. (This column is likely to be removed in a future version).&lt;br /&gt;
&lt;br /&gt;
==== pg_stat_bdr ====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;pg_stat_bdr&amp;lt;/tt&amp;gt; view is supplied by the &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; extension. It provides information on a server's connection(s) to its upstream master(s). It is not present on upstream-only masters.&lt;br /&gt;
&lt;br /&gt;
The primary purpose of this view is to report statistics on the progress of LLSR apply on a per-upstream master connection basis.&lt;br /&gt;
&lt;br /&gt;
View structure:&lt;br /&gt;
&lt;br /&gt;
         View &amp;quot;public.pg_stat_bdr&amp;quot;&lt;br /&gt;
        Column       |  Type  | Modifiers &lt;br /&gt;
 --------------------+--------+-----------&lt;br /&gt;
  rep_node_id        | oid    | &lt;br /&gt;
  riremotesysid      | name   | &lt;br /&gt;
  riremotedb         | oid    | &lt;br /&gt;
  rilocaldb          | oid    | &lt;br /&gt;
  nr_commit          | bigint | &lt;br /&gt;
  nr_rollback        | bigint | &lt;br /&gt;
  nr_insert          | bigint | &lt;br /&gt;
  nr_insert_conflict | bigint | &lt;br /&gt;
  nr_update          | bigint | &lt;br /&gt;
  nr_update_conflict | bigint | &lt;br /&gt;
  nr_delete          | bigint | &lt;br /&gt;
  nr_delete_conflict | bigint | &lt;br /&gt;
  nr_disconnect      | bigint | &lt;br /&gt;
&lt;br /&gt;
Fields:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;rep_node_id&amp;lt;/tt&amp;gt;: An internal identifier for the replication slot.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;riremotesysid&amp;lt;/tt&amp;gt;: The remote database system identifier, as reported by the &amp;lt;tt&amp;gt;Database system identifier&amp;lt;/tt&amp;gt; line of &amp;lt;tt&amp;gt;pg_controldata /path/to/datadir&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;riremotedb&amp;lt;/tt&amp;gt;: The remote database OID, ie the &amp;lt;tt&amp;gt;oid&amp;lt;/tt&amp;gt; column of the remote server's &amp;lt;tt&amp;gt;pg_catalog.pg_database&amp;lt;/tt&amp;gt; entry for the replicated database. You can get the database name with &amp;lt;tt&amp;gt;select datname from pg_database where oid = 12345&amp;lt;/tt&amp;gt; (where '12345' is the &amp;lt;tt&amp;gt;riremotedb&amp;lt;/tt&amp;gt; oid).&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;rilocaldb &amp;lt;/tt&amp;gt;: The local database OID, with the same meaning as &amp;lt;tt&amp;gt;riremotedb&amp;lt;/tt&amp;gt; but with oids from the local system.&lt;br /&gt;
&lt;br /&gt;
''The rest of the rows are statistics about this upstream master slot'':&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_commit&amp;lt;/tt&amp;gt;: Number of commits applied to date from this master&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_rollback&amp;lt;/tt&amp;gt;: Number of rollbacks performed by this apply process due to recoverable errors (deadlock retries, lost races, etc) or unrecoverable errors like mismatched constraint errors.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_insert&amp;lt;/tt&amp;gt;: Number of &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s performed&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_insert_conflict&amp;lt;/tt&amp;gt;:  Number of &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s that resulted in conflicts.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_update&amp;lt;/tt&amp;gt;: Number of &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s performed&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_update_conflict&amp;lt;/tt&amp;gt;: Number of &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s that resulted in conflicts.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_delete&amp;lt;/tt&amp;gt;: Number of deletes performed&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_delete_conflict&amp;lt;/tt&amp;gt;: Number of deletes that resulted in conflicts.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_disconnect&amp;lt;/tt&amp;gt;: Number of times this apply process has lost its connection to the upstream master since it was started.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This view does not contain any information about how far behind the upstream master this downstream master is. The upstream master's &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;pg_stat_replication&amp;lt;/tt&amp;gt; views must be queried to determine replication lag.&lt;br /&gt;
&lt;br /&gt;
==== Monitoring replication status and lag ====&lt;br /&gt;
&lt;br /&gt;
As with any replication setup, it is vital to monitor replication status on all BDR nodes to ensure no node is lagging severely behind the others or is stuck.&lt;br /&gt;
&lt;br /&gt;
In the case of BDR a stuck or crashed node will eventually cause disk space and table bloat problems on other masters so stuck nodes should be detected and removed or repaired in a reasonably timely manner. Exactly how urgent this is depends on the workload of the BDR group.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; view described above may be used to verify that a downstream master is connected to its upstream master - the &amp;lt;tt&amp;gt;active&amp;lt;/tt&amp;gt; boolean column is &amp;lt;tt&amp;gt;t&amp;lt;/tt&amp;gt; if there's a downstream master connected.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;xmin&amp;lt;/tt&amp;gt; column provides an indication of whether replication is advancing; it should increase as replication progresses. There is no simple way to turn &amp;lt;tt&amp;gt;xmin&amp;lt;/tt&amp;gt; into the time the last applied transaction was committed on the master, so it doesn't provide an indication of wall-clock lag.&lt;br /&gt;
&lt;br /&gt;
To determine wall-clock replication lag an application-level ticker may be used to periodically update a timestamp in a replicated table. The difference between this timestamp on the upstream and downstream masters provides the wall-clock replication lag. For BDR one row may be added to the table for each BDR master, giving an indication of how much lag each master has relative to each other master.&lt;br /&gt;
&lt;br /&gt;
=== Table and index usage statistics ===&lt;br /&gt;
&lt;br /&gt;
Statistics on table and index usage are updated normally by the downstream master. This is essential for correct function of auto-vacuum. If there are no local writes on the downstream master and stats have not been reset these two views should show matching results between upstream and downstream:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_user_tables&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_statio_user_tables&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s than non-identifying indexes are. &lt;br /&gt;
&lt;br /&gt;
The built-in index monitoring views are:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_user_indexes&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_statio_user_indexes&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All these views are discussed in [http://www.postgresql.org/docs/current/static/monitoring-stats.html#MONITORING-STATS-VIEWS-TABLE the PostgreSQL documentation on the statistics views].&lt;br /&gt;
&lt;br /&gt;
=== Starting, stopping and managing replication ===&lt;br /&gt;
&lt;br /&gt;
Replication is managed with the &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt; settings described in &amp;quot;Parameter Reference&amp;quot; and &amp;quot;Configuration&amp;quot; above, and using the &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; utility command.&lt;br /&gt;
&lt;br /&gt;
==== Starting a new LLSR connection ====&lt;br /&gt;
&lt;br /&gt;
Logical replication is started automatically when a database is configured as a downstream master in &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt; (see [[#Configuration|Configuration]]) and the postmaster is started. No explicit action is required to start replication, but replication will not actually work unless the upstream and downstream databases are identical within the requirements set by LLSR in the [[#Table definitions and DDL replication||Table definitions and DDL replication]] section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;pg_dump&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;pg_restore&amp;lt;/tt&amp;gt; may be used to set up the new replica's database.&lt;br /&gt;
&lt;br /&gt;
==== Viewing logical replication slots ====&lt;br /&gt;
&lt;br /&gt;
Examining the state of logical replication is discussed in [[#Monitoring|Monitoring]].&lt;br /&gt;
&lt;br /&gt;
==== Temporarily stopping an LLSR replica ====&lt;br /&gt;
&lt;br /&gt;
LLSR replicas can be temporarily stopped by shutting down the downstream master's postmaster.&lt;br /&gt;
&lt;br /&gt;
If the replica is not started back up before the upstream master discards the oldest WAL segment required for the downstream master to resume replay, as identified by the &amp;lt;tt&amp;gt;last_required_checkpoint&amp;lt;/tt&amp;gt; column of &amp;lt;tt&amp;gt;pg_catalog.pg_stat_logical_replication&amp;lt;/tt&amp;gt; then the replica will not resume replay. The error [[#Insufficient_WAL_segments_retained_.28.22requested_WAL_segment_..._has_already_been_removed.22.29|Insufficient WAL segments retained]] will be reported in the upstream master's logs. The replica must be re-created for replication to continue.&lt;br /&gt;
&lt;br /&gt;
==== Removing an LLSR replica permanently ====&lt;br /&gt;
&lt;br /&gt;
To remove a replication connection permanently, remove its entries from the downstream master's &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt;, restart the downstream master, then use &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; to remove the replication slot on the upstream master.&lt;br /&gt;
&lt;br /&gt;
It is important to remove the replication slot from the upstream master(s) to prevent xid wrap-around problems and issues with table bloat caused by delayed vacuum.&lt;br /&gt;
&lt;br /&gt;
==== Cleaning up abandoned replication slots ====&lt;br /&gt;
&lt;br /&gt;
To remove a replication slot that was used for a now-defunct replica, find its slot name from the &amp;lt;tt&amp;gt;[[#pg_stat_logical_replication|pg_stat_logical_replication]]&amp;lt;/tt&amp;gt; view on the upstream master then run:&lt;br /&gt;
&lt;br /&gt;
 pg_receivellog -p 5434 -h master-hostname -d dbname \&lt;br /&gt;
    --slot='bdr: 16384:5873181566046043070-1-16384:' --stop&lt;br /&gt;
&lt;br /&gt;
where the argument to '--slot' is the slot name you found from the view.&lt;br /&gt;
&lt;br /&gt;
You may need to do this if you've created and then deleted several replicas so &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; has filled up with entries for replicas that no longer exist.&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional replication is built directly on LLSR by configuring two or more servers as both upstream ''and'' downstream masters of each other.&lt;br /&gt;
&lt;br /&gt;
All of the Log Level Streaming Replication documentation applies to BDR and should be read before moving on to reading about and configuring BDR.&lt;br /&gt;
&lt;br /&gt;
=== Bi-Directional Replication Use Cases ===&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
==== Simple multi-master pair ====&lt;br /&gt;
&lt;br /&gt;
A simple mulit-master &amp;quot;HA Cluster&amp;quot; with two servers:&lt;br /&gt;
&lt;br /&gt;
* Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
* Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Alpha:&lt;br /&gt;
&lt;br /&gt;
 wal_level = 'logical'&lt;br /&gt;
 max_logical_slots = 3&lt;br /&gt;
 max_wal_senders = 4&lt;br /&gt;
 wal_keep_segments = 5000&lt;br /&gt;
 shared_preload_libraries = 'bdr'&lt;br /&gt;
 bdr.connections=&amp;quot;bravo&amp;quot;&lt;br /&gt;
 bdr.bravo.dsn = 'dbname=dbtoreplicate'&lt;br /&gt;
 track_commit_timestamp = on&lt;br /&gt;
&lt;br /&gt;
Bravo:&lt;br /&gt;
&lt;br /&gt;
 wal_level = 'logical'&lt;br /&gt;
 max_logical_slots = 3&lt;br /&gt;
 max_wal_senders = 4&lt;br /&gt;
 wal_keep_segments = 5000&lt;br /&gt;
 shared_preload_libraries = 'bdr'&lt;br /&gt;
 bdr.connections=&amp;quot;alpha&amp;quot;&lt;br /&gt;
 bdr.alpha.dsn = 'dbname=dbtoreplicate'&lt;br /&gt;
 track_commit_timestamp = on&lt;br /&gt;
&lt;br /&gt;
See [[#Configuration|Configuration]] for an explanation of these parameters.&lt;br /&gt;
&lt;br /&gt;
==== HA and Logical Standby ====&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;HA Cluster&amp;quot;:&lt;br /&gt;
&lt;br /&gt;
* Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
* Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
* Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
==== Very High Availability Multi-Master ====&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site simple Multi-Master Plex ====&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
If you wanted to test this configuration locally you could run three PostgreSQL instances on different ports. Such a configuration would look like the following if the port numbers were used as node names for the sake of notational clarity:&lt;br /&gt;
&lt;br /&gt;
Config for node_5440:&lt;br /&gt;
&lt;br /&gt;
 port = 5440&lt;br /&gt;
 bdr.connections='node_5441,node_5442'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5441:&lt;br /&gt;
&lt;br /&gt;
 port = 5441&lt;br /&gt;
 bdr.connections='node_5440,node_5442'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5442:&lt;br /&gt;
&lt;br /&gt;
 port = 5442&lt;br /&gt;
 bdr.connections='node_5440,node_5441'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
In a typical real-world configuration each server would be on the same port on a different host instead.&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site simple Multi-Master Circular Replication ====&lt;br /&gt;
&lt;br /&gt;
Simpler config uses &amp;quot;circular replication&amp;quot;. This is simpler but results in higher latency for changes as the number of nodes increases. It's also less resilient to network disruptions and node faults.&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
TODO: Regrettably this doesn't actually work yet because we don't cascade logical changes (yet).&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
Config for node_5440:&lt;br /&gt;
&lt;br /&gt;
 port = 5440&lt;br /&gt;
 bdr.connections='node_5441'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5441:&lt;br /&gt;
&lt;br /&gt;
 port = 5441&lt;br /&gt;
 bdr.connections='node_5442'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5442:&lt;br /&gt;
&lt;br /&gt;
 port = 5442&lt;br /&gt;
 bdr.connections='node_5440'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
This would usually be done in the real world with databases on different hosts, all running on the same port.&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site Max Availability Multi-Master Plex ====&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
Config left as an exercise for the reader.&lt;br /&gt;
&lt;br /&gt;
==== N-site symmetric cluster replication ====&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
==== Complex/Assymetric Replication ====&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
=== Conflict Avoidance ===&lt;br /&gt;
&lt;br /&gt;
==== Distributed Locking ====&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications as very low latency is critical for acceptable performance.&lt;br /&gt;
&lt;br /&gt;
Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible but allow some types of conflict to occur and and resolve them when they arise.&lt;br /&gt;
&lt;br /&gt;
==== Global Sequences ====&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
The SQL standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using &amp;lt;tt&amp;gt;DEFAULT nextval('mysequence')&amp;lt;/tt&amp;gt;, as with PostgreSQL's &amp;lt;tt&amp;gt;SERIAL&amp;lt;/tt&amp;gt; pseudo-type.&lt;br /&gt;
&lt;br /&gt;
BDR requires sequences to work together across multiple nodes. This is implemented as a new &amp;lt;tt&amp;gt;SequenceAccessMethod&amp;lt;/tt&amp;gt; API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
=== Conflict Detection &amp;amp; Resolution ===&lt;br /&gt;
&lt;br /&gt;
Because local writes can occur on a master, conflict detection and avoidance is a concern for basic LLSR setups as well as full BDR configurations.&lt;br /&gt;
&lt;br /&gt;
==== Lock Conflicts ====&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LOCK-WAITS log_lock_waits]&amp;lt;/tt&amp;gt; facility to look for issues with apply blocking on locks.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
&lt;br /&gt;
* explicit row level locking (&amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/FOR SHARE&amp;lt;/tt&amp;gt;)&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s, &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s or &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s, either from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
==== Data Conflicts ====&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in a consistent and idempotent manner so that all servers end up with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from &amp;lt;tt&amp;gt;pg_control&amp;lt;/tt&amp;gt; though this may change in a future release.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur. The errors causing the conflict can be seen in the error log of the downstream master with the problem.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; conflicts. These are accepted as successful operations but in the case of &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; the data in the &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; is discarded.&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins. It is not practical to decide when a row should be merged and when a last-update-wins stragegy should be used at the database level; such decision making would require support for application-specific conflict resolution plugins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in the same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
==== Examples ====&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-05-16T13:30:50Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Replication of DML changes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;----&lt;br /&gt;
This page is the users and administrators guide for BDR. If you're looking for technical details on the project plan and implementation, see [[BDR Project]].&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
= BDR User Guide =&lt;br /&gt;
&lt;br /&gt;
BDR (BiDrectional Replication) is a feature being developed for inclusion in PostgreSQL core that provides greatly enhanced replication capabilities.&lt;br /&gt;
&lt;br /&gt;
BDR allows users to create a geographically distributed multi-master database using Logical Log Streaming Replication (LLSR) transport.&lt;br /&gt;
BDR is designed to provide both high availability and geographically distributed disaster recovery capabilities. &lt;br /&gt;
&lt;br /&gt;
BDR is not “clustering” as some vendors use the term, in that it doesn't have a distributed lock manager, global transaction co-ordinator, etc. Each member server is separate yet connected, with design choices that allow separation between nodes that would not be possible with global transaction coordination.&lt;br /&gt;
&lt;br /&gt;
Guidance on getting a testing setup established are in [[#Initial setup]]. Please read the full documentation if you intend to put BDR into production.&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows one PostgreSQL master (the &amp;quot;upstream master&amp;quot;) to stream a sequence of changes to another read/write PostgreSQL server (the &amp;quot;downstream master&amp;quot;). Data is sent in one direction only over a normal libpq connection.&lt;br /&gt;
&lt;br /&gt;
Multiple LLSR connections can be used to set up bi-directional replication as discussed later in this guide.&lt;br /&gt;
&lt;br /&gt;
=== Overview of logical replication ===&lt;br /&gt;
&lt;br /&gt;
In some ways LLSR is similar to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective; both replicate changes from one server to another. However, in LLSR the receiving server is also a full master database that can make changes, unlike the read-only replicas offered by PLSR hot standby. Additionally, LLSR is per-database, whereas PLSR is per-cluster and replicates all databases at once. There are many more differences discussed in the relevant sections of this document.&lt;br /&gt;
&lt;br /&gt;
In LLSR the data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after &amp;lt;tt&amp;gt;CREATE DATABASE&amp;lt;/tt&amp;gt;. A restart of the downstream master is also required. The upstream master only needs restarting if the &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; parameter is too low to allow a new replica to be added. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated. Setup is discussed in more detail below.&lt;br /&gt;
&lt;br /&gt;
Changes are processed by the downstream master using &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; plug-ins. This allows flexible handing of replication input, including:&lt;br /&gt;
&lt;br /&gt;
* BDR apply process - applies logical changes to the downstream master. The apply process makes changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
* Textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_xlogdump&amp;lt;/tt&amp;gt; - examines physical WAL records and produces textual debugging output. This server program is included in PostgreSQL 9.3.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;TRUNCATE&amp;lt;/tt&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
(TRUNCATE is not yet implemented, but will be implemented before the feature goes to final release).&lt;br /&gt;
&lt;br /&gt;
Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though it has overheads that mean that it doesn't always use less bandwidth than PLSR.&lt;br /&gt;
&lt;br /&gt;
Locks taken by &amp;lt;tt&amp;gt;LOCK&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/SHARE&amp;lt;/tt&amp;gt; on the upstream master are not replicated to downstream masters. Locks taken automatically by &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;TRUNCATE&amp;lt;/tt&amp;gt; *are* taken on the downstream master and may delay replication apply or concurrent transactions - see [[#Lock Conflicts|Lock Conflicts]].&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;TEMPORARY&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;UNLOGGED&amp;lt;/tt&amp;gt; tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables. However, temporary tables remain specific to a particular session so creating a temporary table on the upstream master does not create a similar table on the downstream master.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; statements that affect multiple rows on upstream master will cause a series of row changes on downstream master. These are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt; on upstream master do not require a unique constraint in order to replicate correctly. &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s require some form of unique constraint, either &amp;lt;tt&amp;gt;PRIMARY KEY&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;UNIQUE NOT NULL&amp;lt;/tt&amp;gt;. A warning is issued in the downstream master's logs if the expected constraint is absent.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s that change the value of the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the final values from the &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; on the upstream master, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value. Volatile or stable functions are evaluated on the master side and the resulting values are replicated. Consequently any function side-effects (writing files, network socket activity, updating internal PostgreSQL variables, etc) will not occur on the replicas as the functions are not run again on the replica.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master (see &amp;quot;Limitations&amp;quot;).&lt;br /&gt;
&lt;br /&gt;
The current LLSR plugin implementation uses the binary libpq protocol, so it requires that the upstream and downstream master use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication. A textual output option will be added later for passing data between non-identical servers, e.g. laptops or mobile devices communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is efficiently implemented. Parallel apply is a possible future feature, especially for changes made while holding &amp;lt;tt&amp;gt;AccessExclusiveLock&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in the sequence in which they were commited on the upstream master. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions spill to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;SET&amp;lt;/tt&amp;gt; statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. We always update the correct tables, whatever the setting of &amp;lt;tt&amp;gt;search_path&amp;lt;/tt&amp;gt;. Values are replicated correctly irrespective of the values of &amp;lt;tt&amp;gt;bytea_output&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;TimeZone&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DateStyle&amp;lt;/tt&amp;gt;, etc.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; is not supported across log based replication, either physical or logical. &amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;LISTEN&amp;lt;/tt&amp;gt; will work fine on the upstream master but an upstream &amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; will not trigger a downstream &amp;lt;tt&amp;gt;LISTEN&amp;lt;/tt&amp;gt;er.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes an automatic retry of the apply of the replaying transaction and is only an issue if the deadlock recurs repeatedly, delaying replication.&lt;br /&gt;
&lt;br /&gt;
From a performance and concurrency perspective the BDR apply process is similar to a normal backend. Frequent conflicts with locks from other transactions when replaying changes can slow things down and thus increase replication delay, so reducing the frequency of such conflicts can be a good way to speed things up. Any lock held by another transaction on the downstream master - &amp;lt;tt&amp;gt;LOCK&amp;lt;/tt&amp;gt; statements, &amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/FOR SHARE&amp;lt;/tt&amp;gt;, or &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; row locks - can delay replication if the replication apply process needs to change the locked table/row.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching &amp;lt;tt&amp;gt;&amp;quot;Schemaname&amp;quot;.&amp;quot;Tablename&amp;quot;&amp;lt;/tt&amp;gt; on both upstream and downstream masters. e.g. changes from upstream's &amp;lt;tt&amp;gt;public.mytable&amp;lt;/tt&amp;gt; will go to downstream's &amp;lt;tt&amp;gt;public.mytable&amp;lt;/tt&amp;gt; while changes to the upstream &amp;lt;tt&amp;gt;mychema.mytable&amp;lt;/tt&amp;gt; will go to the downstream &amp;lt;tt&amp;gt;myschema.mytable&amp;lt;/tt&amp;gt;. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful synchronization of table definitions on each node otherwise &amp;lt;tt&amp;gt;ERROR&amp;lt;/tt&amp;gt;s will be generated by the replication apply process. In general, tables must be an exact match between upstream and downstream masters. &lt;br /&gt;
&lt;br /&gt;
There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
Tables must meet the following requirements to be compatible for purposes of LLSR:&lt;br /&gt;
&lt;br /&gt;
* The downstream master must only have constraints (&amp;lt;tt&amp;gt;CHECK&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UNIQUE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;EXCLUSION&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;FOREIGN KEY&amp;lt;/tt&amp;gt;, etc) that are also present on the upstream master. Replication may initially work with mismatched constraints but is likely to fail as soon as the downstream master rejects a row the upstream master accepted.&lt;br /&gt;
* The table referenced by a FOREIGN KEY on a downstream master must have all the keys present in the upstream master version of the same table.&lt;br /&gt;
* Storage parameters must match except for as allowed below&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
* Dropped columns on master must be present on replicas&lt;br /&gt;
* Custom types and enum definitions must match exactly&lt;br /&gt;
* Composite types and enums must have the same oids on master and replication target&lt;br /&gt;
* Extensions defining types used in replicated tables must be of the same version or fully SQL-level compatible and the oids of the types they define must match.&lt;br /&gt;
&lt;br /&gt;
The following differences are permissible between tables on different nodes:&lt;br /&gt;
&lt;br /&gt;
* The table's &amp;lt;tt&amp;gt;pg_class&amp;lt;/tt&amp;gt; oid, the oid of its associated TOAST table, and the oid of the table's rowtype in &amp;lt;tt&amp;gt;pg_type&amp;lt;/tt&amp;gt; may differ;&lt;br /&gt;
* Extra or missing non-&amp;lt;tt&amp;gt;UNIQUE&amp;lt;/tt&amp;gt; indexes&lt;br /&gt;
* Extra keys in downstream lookup tables for &amp;lt;tt&amp;gt;FOREIGN KEY&amp;lt;/tt&amp;gt; references that are not present on the upstream master&lt;br /&gt;
* The table-level storage parameters for fillfactor and autovacuum&lt;br /&gt;
* Triggers and rules may differ (they are not executed by replication apply)&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR (see [[#LLSR Limitations|LLSR Limitations]]).&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of &amp;lt;tt&amp;gt;session_replication_role = origin&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
In future it is expected that composite types and enums with non-identical oids will be converted using text output and input functions. This feature is not yet implemented.&lt;br /&gt;
&lt;br /&gt;
=== LLSR limitations ===&lt;br /&gt;
&lt;br /&gt;
The current LLSR implementation is subject to some limitations, which are being progressively removed as work progresses.&lt;br /&gt;
&lt;br /&gt;
==== Data definition compatibility ====&lt;br /&gt;
&lt;br /&gt;
Table definitions, types, extensions, etc must be near identical between upstream and downstream masters. See [[#Table definitions and DDL replication|Table definitions and DDL replication]].&lt;br /&gt;
&lt;br /&gt;
==== DDL Replication ====&lt;br /&gt;
&lt;br /&gt;
DDL replication is not yet supported.&lt;br /&gt;
&lt;br /&gt;
==== Upstream feedback ====&lt;br /&gt;
&lt;br /&gt;
No feedback from downstream masters to the upstream master is implemented for asynchronous LLSR, so upstream masters must be configured to keep enough WAL. See [[#Configuration|Configuration]].&lt;br /&gt;
&lt;br /&gt;
==== TRUNCATE is not replicated ====&lt;br /&gt;
&lt;br /&gt;
TRUNCATE is not yet supported, however workarounds with user-level triggers are possible and a ProcessUtility hook is planned to implement a similar approach globally.&lt;br /&gt;
&lt;br /&gt;
The safest option is to define a user-level BEFORE trigger on each table that RAISEs an ERROR when TRUNCATE is attempted.&lt;br /&gt;
&lt;br /&gt;
A simple truncate-blocking trigger is:&lt;br /&gt;
&lt;br /&gt;
 CREATE OR REPLACE FUNCTION deny_truncate() RETURNS trigger AS $$&lt;br /&gt;
 BEGIN&lt;br /&gt;
   IF tg_op = 'TRUNCATE' THEN&lt;br /&gt;
     RAISE EXCEPTION 'TRUNCATE is not supported on this table. Please use DELETE FROM.';&lt;br /&gt;
   ELSE&lt;br /&gt;
     RAISE EXCEPTION 'This trigger only supports TRUNCATE';&lt;br /&gt;
   END IF;&lt;br /&gt;
 END;&lt;br /&gt;
 $$ LANGUAGE plpgsql;&lt;br /&gt;
&lt;br /&gt;
It can be applied to a table with:&lt;br /&gt;
&lt;br /&gt;
 CREATE TRIGGER deny_truncate_on_&amp;lt;tablename&amp;gt; BEFORE TRUNCATE ON &amp;lt;tablename&amp;gt;&lt;br /&gt;
 FOR EACH STATEMENT EXECUTE PROCEDURE deny_truncate();&lt;br /&gt;
&lt;br /&gt;
A PL/PgSQL DO block that queries &amp;lt;tt&amp;gt;pg_class&amp;lt;/tt&amp;gt; and loops over it to &amp;lt;tt&amp;gt;EXECUTE&amp;lt;/tt&amp;gt; a dynamic SQL &amp;lt;tt&amp;gt;CREATE TRIGGER&amp;lt;/tt&amp;gt; command for each table that does not already have the trigger can be used to apply the trigger to all tables.&lt;br /&gt;
&lt;br /&gt;
=== Initial setup ===&lt;br /&gt;
&lt;br /&gt;
To set up LLSR or BDR you first need a patched PostgreSQL that can support LLSR/BDR, then you need to create one or more LLSR/BDR senders and one or more LLSR/BDR receivers.&lt;br /&gt;
&lt;br /&gt;
==== Installing the patched PostgreSQL binaries ====&lt;br /&gt;
&lt;br /&gt;
Currently BDR is only available in builds of the 'bdr' branch on Andres Freund's git repo on git.postgresql.org. PostgreSQL 9.2 and below do not support BDR, and 9.3 requires patches, so this guide will not work for you if you are trying to use a normal install of PostgreSQL.&lt;br /&gt;
&lt;br /&gt;
First you need to clone, configure, compile and install like normal. Clone the sources from &amp;lt;tt&amp;gt;git://git.postgresql.org/git/users/andresfreund/postgres.git&amp;lt;/tt&amp;gt; and checkout the &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; branch.&lt;br /&gt;
&lt;br /&gt;
If you have an existing local PostgreSQL git tree specify it as &amp;lt;tt&amp;gt;--reference /path/to/existing/tree&amp;lt;/tt&amp;gt; to greatly speed your git clone.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
 mkdir -p $HOME/bdr&lt;br /&gt;
 cd bdr&lt;br /&gt;
 git clone git://git.postgresql.org/git/users/andresfreund/postgres.git $HOME/bdr/postgres-bdr-src&lt;br /&gt;
 cd postgres-bdr-src&lt;br /&gt;
 ./configure --prefix=$HOME/bdr/postgres-bdr-bin&lt;br /&gt;
 make install&lt;br /&gt;
 cd contrib/bdr&lt;br /&gt;
 make install&lt;br /&gt;
&lt;br /&gt;
This will put everything in &amp;lt;tt&amp;gt;$HOME/bdr&amp;lt;/tt&amp;gt;, with the source code and build tree in &amp;lt;tt&amp;gt;$HOME/bdr/postgres-bdr-src&amp;lt;/tt&amp;gt; and the installed PostgreSQL in &amp;lt;tt&amp;gt;$HOME/bdr/postgres-bdr-bin&amp;lt;/tt&amp;gt;. This is a convenient setup for testing and development because it doesn't require you to set up new users, wrangle permissions, run anything as root, etc, but it isn't recommended that you deploy this way in production.&lt;br /&gt;
&lt;br /&gt;
To actually use these new binaries you will need to:&lt;br /&gt;
&lt;br /&gt;
 export PATH=$HOME/bdr/postgres-bdr-bin/bin:$PATH&lt;br /&gt;
&lt;br /&gt;
before running &amp;lt;tt&amp;gt;initdb&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;postgres&amp;lt;/tt&amp;gt;, etc. You don't have to use the &amp;lt;tt&amp;gt;psql&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;libpq&amp;lt;/tt&amp;gt; you compiled but you're likely to get version mismatch warnings if you don't.&lt;br /&gt;
&lt;br /&gt;
=== Parameter Reference ===&lt;br /&gt;
&lt;br /&gt;
The following parameters are new or have been changed in PostgreSQL's new logical streaming replication.&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;shared_preload_libraries = ‘bdr’&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
To load support for receiving changes on a downstream master, the &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; library must be added to the existing ‘shared_preload_libraries’ parameter. This loads the bdr library during postmaster start-up and allows it to create the required background worker(s).&lt;br /&gt;
&lt;br /&gt;
Upstream masters don't need to load the bdr library unless they're also operating as a downstream master as is the case in a BDR configuration.&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;bdr.connections&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
A comma-separated list of upstream master connection names is specified in &amp;lt;tt&amp;gt;bdr.connections&amp;lt;/tt&amp;gt;. These names must be simple alphanumeric strings. They are used when naming the connection in error messages, configuration options and logs, but are otherwise of no special meaning.&lt;br /&gt;
&lt;br /&gt;
A typical two-upstream-master setting might be:&lt;br /&gt;
&lt;br /&gt;
 bdr.connections = ‘upstream1, upstream2’&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;bdr.&amp;amp;lt;connection_name&amp;amp;gt;.dsn&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Each connection name must have at least a data source name specified using the &amp;lt;tt&amp;gt;bdr.&amp;amp;lt;connection_name&amp;amp;gt;.dsn&amp;lt;/tt&amp;gt; parameter. The DSN syntax is the same as that used by libpq so it is not discussed in further detail here. A &amp;lt;tt&amp;gt;dbname&amp;lt;/tt&amp;gt; for the database to connect to must be specified; all other parts of the DSN are optional.&lt;br /&gt;
&lt;br /&gt;
The local (downstream) database name is assumed to be the same as the name of the upstream database being connected to, though future versions will make this configurable.&lt;br /&gt;
&lt;br /&gt;
For the above two-master setting for &amp;lt;tt&amp;gt;bdr.connections&amp;lt;/tt&amp;gt; the DSNs might look like:&lt;br /&gt;
&lt;br /&gt;
 bdr.upstream1.dsn = 'host=10.1.1.2 user=postgres dbname=replicated_db'&lt;br /&gt;
 bdr.upstream2.dsn = 'host=10.1.1.3 user=postgres dbname=replicated_db'&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
The new parameter &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; has been added for use on both upstream and downstream masters. This parameter controls the maximum number of logical replication slots - upstream or downstream - that this cluster may have at a time. It must be set at postmaster start time.&lt;br /&gt;
&lt;br /&gt;
As logical replication slots are persistent, slots are consumed even by replicas that are not currently connected. Slot management is discussed in Starting, Stopping and Managing Replication.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; should be set to the sum of the number of logical replication upstream masters this server will have plus the number of logical replication downstream masters will connect to it it.&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;wal_level = 'logical'&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
A new setting, &amp;lt;tt&amp;gt;'logical'&amp;lt;/tt&amp;gt;, has been added for the existing &amp;lt;tt&amp;gt;wal_level&amp;lt;/tt&amp;gt; parameter. &amp;lt;tt&amp;gt;‘logical’&amp;lt;/tt&amp;gt; includes everything that the existing &amp;lt;tt&amp;gt;hot_standby&amp;lt;/tt&amp;gt; setting does and adds additional details required for logical changeset decoding to the write-ahead logs. &lt;br /&gt;
&lt;br /&gt;
This additional information is consumed by the upstream-master-side xlog decoding worker. Downstream masters that do not also act as upstream masters do not require &amp;lt;tt&amp;gt;wal_level&amp;lt;/tt&amp;gt; to be increased above the default &amp;lt;tt&amp;gt;'minimal'&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;wal_level&amp;lt;/tt&amp;gt;, except for the new &amp;lt;tt&amp;gt;'logical'&amp;lt;/tt&amp;gt; setting, is [http://www.postgresql.org/docs/current/static/runtime-config-wal.html documented in the main PostgreSQL manual].&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Logical replication hasn't altered the &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; parameter, but it is important in upstream masters for logical replication and BDR because every logical sender consumes a &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; entry.&lt;br /&gt;
&lt;br /&gt;
You should configure &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; to the sum of the number of physical and logical replicas you want to allow an upstream master to serve. If you intend to use &amp;lt;tt&amp;gt;pg_basebackup&amp;lt;/tt&amp;gt; you should add at least two more senders to allow for its use.&lt;br /&gt;
&lt;br /&gt;
Like &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; entries don't cost a large amount of memory, so you can overestimate fairly safely.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt; is documented in [http://www.postgresql.org/docs/current/static/runtime-config-replication.html the main PostgreSQL documentation].&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Like &amp;lt;tt&amp;gt;max_wal_senders&amp;lt;/tt&amp;gt;, the &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; parameter isn't directly changed by logical replication but is still important for upstream masters. It is not required on downstream-only masters.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; should be set to a value that allows for some downtime or unreachable periods for downstream masters and for heavy bursts of write activity on the upstream master. &lt;br /&gt;
&lt;br /&gt;
Keep in mind that enough disk space must be available for the WAL segments, each of which is 16MB. If you run out of disk space the server will halt until disk space is freed and it may be quite difficult to free space when you can no longer start the server.&lt;br /&gt;
&lt;br /&gt;
If you exceed the required &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; and &amp;quot;Insufficient WAL segments retained&amp;quot; error will be reported. See [[#Troubleshooting|Troubleshooting]].&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; is documented in the [http://www.postgresql.org/docs/current/static/runtime-config-replication.html the main PostgreSQL manual].&lt;br /&gt;
&lt;br /&gt;
==== &amp;lt;tt&amp;gt;track_commit_timestamp&amp;lt;/tt&amp;gt; ====&lt;br /&gt;
&lt;br /&gt;
Setting this parameter to &amp;quot;on&amp;quot; enables commit timestamp tracking, which is used to implement last-UPDATE-wins conflict resolution.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Details on individual parameters are described in the [[parameter reference]] section.&lt;br /&gt;
&lt;br /&gt;
The following configuration is an example of a simple one-way LLSR replication setup - a single upstream master to a single downstream master.&lt;br /&gt;
&lt;br /&gt;
The upstream master (sender)'s &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt; should contain settings like:&lt;br /&gt;
&lt;br /&gt;
  wal_level = 'logical'       # Include enough info for logical replication&lt;br /&gt;
  max_logical_slots = X       # Number of LLSR senders + any receivers&lt;br /&gt;
  max_wal_senders = Y         # Y = max_logical_slots plus any physical &lt;br /&gt;
                              # streaming requirements&lt;br /&gt;
  wal_keep_segments = 5000    # Master must retain enough WAL segments to let &lt;br /&gt;
                              # replicas catch up. Correct value depends on&lt;br /&gt;
                              # rate of writes on master, max replica downtime&lt;br /&gt;
                              # allowable. 5000 segments requires 78GB&lt;br /&gt;
                              # in pg_xlog&lt;br /&gt;
&lt;br /&gt;
Downstream (receiver) &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
  shared_preload_libraries = 'bdr'&lt;br /&gt;
  &lt;br /&gt;
  bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection&lt;br /&gt;
                                                 # from downstream to upstream master&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case &lt;br /&gt;
                                                 # where the databasename on upstream &lt;br /&gt;
                                                 # and downstream master differ. &lt;br /&gt;
                                                 # (Not yet implemented)&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.apply_delay                     # optional parameter to delay apply of&lt;br /&gt;
                                                 # transactions, time in milliseconds &lt;br /&gt;
  bdr.synchronous_commit = ...;                  # optional parameter to set the&lt;br /&gt;
                                                 # synchronous_commit parameter the&lt;br /&gt;
                                                 # apply processes will be using&lt;br /&gt;
  max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
Note that a server can be both sender and receiver, either two servers to each other or more complex configurations like replication chains/trees.&lt;br /&gt;
&lt;br /&gt;
The upstream (sender) &amp;lt;tt&amp;gt;pg_hba.conf&amp;lt;/tt&amp;gt; must be configured to allow the downstream master to connect for replication. Otherwise you'll see errors like the following on the downstream master:&lt;br /&gt;
&lt;br /&gt;
 FATAL:  could not connect to the primary server: FATAL:  no pg_hba.conf entry for replication connection from host &amp;quot;[local]&amp;quot;, user &amp;quot;postgres&amp;quot;&lt;br /&gt;
&lt;br /&gt;
A suitable &amp;lt;tt&amp;gt;pg_hba.conf&amp;lt;/tt&amp;gt; entry for a replication connection from the replica server 10.1.4.8 might be:&lt;br /&gt;
&lt;br /&gt;
  host    replication     postgres        10.1.4.8/32            trust&lt;br /&gt;
&lt;br /&gt;
(the user name should match the user name configured in the downstream master's dsn. md5 password authentication is supported.)&lt;br /&gt;
&lt;br /&gt;
For more details on these parameters, see [[#Parameter Reference|Parameter Reference]].&lt;br /&gt;
&lt;br /&gt;
=== Troubleshooting ===&lt;br /&gt;
&lt;br /&gt;
==== Could not access file &amp;quot;bdr&amp;quot;: No such file or directory ====&lt;br /&gt;
&lt;br /&gt;
If you see the error:&lt;br /&gt;
&lt;br /&gt;
 FATAL:  could not access file &amp;quot;bdr&amp;quot;: No such file or directory&lt;br /&gt;
&lt;br /&gt;
when starting a database set up to receive BDR replication, you probably forgot to install &amp;lt;tt&amp;gt;contrib/bdr&amp;lt;/tt&amp;gt;. See above.&lt;br /&gt;
&lt;br /&gt;
==== Invalid value for parameter ====&lt;br /&gt;
&lt;br /&gt;
An error like:&lt;br /&gt;
&lt;br /&gt;
 LOG:  invalid value for parameter ...&lt;br /&gt;
&lt;br /&gt;
when setting one of these parameters means your server doesn't support logical replication and will need to be patched or updated.&lt;br /&gt;
&lt;br /&gt;
==== Insufficient WAL segments retained (&amp;quot;requested WAL segment ... has already been removed&amp;quot;) ====&lt;br /&gt;
&lt;br /&gt;
If &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; is insufficient to meet the requirements of a replica that has fallen far behind, the master will report errors like:&lt;br /&gt;
&lt;br /&gt;
 ERROR:  requested WAL segment 00000001000000010000002D has already been removed&lt;br /&gt;
&lt;br /&gt;
Currently the replica errors look like:&lt;br /&gt;
&lt;br /&gt;
 WARNING:  Starting logical replication&lt;br /&gt;
 LOG:  data stream ended&lt;br /&gt;
 LOG:  worker process: master (PID 23812) exited with exit code 0&lt;br /&gt;
 LOG:  starting background worker process &amp;quot;master&amp;quot;&lt;br /&gt;
 LOG:  master initialized on master, remote dbname=master port=5434 replication=true fallback_application_name=bdr&lt;br /&gt;
 LOG:  local sysid 5873181566046043070, remote: 5873181102189050714&lt;br /&gt;
 LOG:  found valid replication identifier 1&lt;br /&gt;
 LOG:  starting up replication at 1 from 1/2D9CA220&lt;br /&gt;
&lt;br /&gt;
but a more explicit error message for this condition is planned.&lt;br /&gt;
&lt;br /&gt;
The only way to recover from this fault is to re-seed the replica database.&lt;br /&gt;
&lt;br /&gt;
This fault could be prevented with feedback from the replica to the master, but this feature is not planned for the first release of BDR. Another alternative considered for future releases is making wal_keep_segments a dynamic parameter that is sized on demand.&lt;br /&gt;
&lt;br /&gt;
Monitoring of maximum replica lag and appropriate adjustment of wal_keep_segments will prevent this fault from arising.&lt;br /&gt;
&lt;br /&gt;
==== Couldn't find logical slot ====&lt;br /&gt;
&lt;br /&gt;
An error like:&lt;br /&gt;
&lt;br /&gt;
 ERROR:  couldn't find logical slot &amp;quot;bdr: 16384:5873181566046043070-1-24596:&amp;quot;&lt;br /&gt;
&lt;br /&gt;
on the upstream master suggests that a downstream master is trying to connect to a logical replication slot that no longer exists. The slot can not be re-created, so it is necessary to re-seed the downstream replica database.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level (ie SQL visible) ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
The following views are available for monitoring replication activity:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/monitoring-stats.html#MONITORING-STATS-VIEWS-TABLE pg_stat_replication]&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; (described below)&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_bdr&amp;lt;/tt&amp;gt; (described below)&lt;br /&gt;
&lt;br /&gt;
The following configuration and logging parameters are useful for monitoring replication:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LOCK-WAITS log_lock_waits]&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== pg_stat_logical_replication ====&lt;br /&gt;
&lt;br /&gt;
The new &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; view is specific to logical replication. It is based on the underlying &amp;lt;tt&amp;gt;pg_stat_get_logical_replication_slots&amp;lt;/tt&amp;gt; function and has the following structure:&lt;br /&gt;
&lt;br /&gt;
  View &amp;quot;pg_catalog.pg_stat_logical_replication&amp;quot;&lt;br /&gt;
           Column          |  Type   | Modifiers &lt;br /&gt;
 --------------------------+---------+-----------&lt;br /&gt;
  slot_name                | text    | &lt;br /&gt;
  plugin                   | text    | &lt;br /&gt;
  database                 | oid     | &lt;br /&gt;
  active                   | boolean | &lt;br /&gt;
  xmin                     | xid     | &lt;br /&gt;
  last_required_checkpoint | text    | &lt;br /&gt;
&lt;br /&gt;
It contains one row for every connection from a downstream master to the server being queried (the upstream master). On a standalone PostgreSQL server or a downstream-only master this view will contain no rows.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;slot_name&amp;lt;/tt&amp;gt;: An internal name for a given logical replication slot (a connection from a downstream master to this upstream master). This slot name is used by the downstream master to uniquely identify its self and is used with the &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; command when managing logical replication slots. The slot name is composed of the decoding plugin name, the upstream database oid, the downstream system identifier (from &amp;lt;tt&amp;gt;pg_control&amp;lt;/tt&amp;gt;), the downstream slot number, and the downstream database oid.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;plugin&amp;lt;/tt&amp;gt;: The logical replication plugin being used to decode WAL archives. You'll generally only see &amp;lt;tt&amp;gt;bdr_output&amp;lt;/tt&amp;gt; here.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;database&amp;lt;/tt&amp;gt;: The oid of the database being replicated by this slot. You can get the database name by joining on &amp;lt;tt&amp;gt;pg_database.oid&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;active&amp;lt;/tt&amp;gt;: Whether this slot currently has an active connection.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;xmin&amp;lt;/tt&amp;gt;: The lowest transaction ID this replication slot can &amp;quot;see&amp;quot;, like the xmin of a transaction or prepared transaction. xmin should keep on advancing as replication continues.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;last_required_checkpoint&amp;lt;/tt&amp;gt;: The checkpoint identifying the oldest WAL record required to bring this slot up to date with the upstream master. (This column is likely to be removed in a future version).&lt;br /&gt;
&lt;br /&gt;
==== pg_stat_bdr ====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;pg_stat_bdr&amp;lt;/tt&amp;gt; view is supplied by the &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; extension. It provides information on a server's connection(s) to its upstream master(s). It is not present on upstream-only masters.&lt;br /&gt;
&lt;br /&gt;
The primary purpose of this view is to report statistics on the progress of LLSR apply on a per-upstream master connection basis.&lt;br /&gt;
&lt;br /&gt;
View structure:&lt;br /&gt;
&lt;br /&gt;
         View &amp;quot;public.pg_stat_bdr&amp;quot;&lt;br /&gt;
        Column       |  Type  | Modifiers &lt;br /&gt;
 --------------------+--------+-----------&lt;br /&gt;
  rep_node_id        | oid    | &lt;br /&gt;
  riremotesysid      | name   | &lt;br /&gt;
  riremotedb         | oid    | &lt;br /&gt;
  rilocaldb          | oid    | &lt;br /&gt;
  nr_commit          | bigint | &lt;br /&gt;
  nr_rollback        | bigint | &lt;br /&gt;
  nr_insert          | bigint | &lt;br /&gt;
  nr_insert_conflict | bigint | &lt;br /&gt;
  nr_update          | bigint | &lt;br /&gt;
  nr_update_conflict | bigint | &lt;br /&gt;
  nr_delete          | bigint | &lt;br /&gt;
  nr_delete_conflict | bigint | &lt;br /&gt;
  nr_disconnect      | bigint | &lt;br /&gt;
&lt;br /&gt;
Fields:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;rep_node_id&amp;lt;/tt&amp;gt;: An internal identifier for the replication slot.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;riremotesysid&amp;lt;/tt&amp;gt;: The remote database system identifier, as reported by the &amp;lt;tt&amp;gt;Database system identifier&amp;lt;/tt&amp;gt; line of &amp;lt;tt&amp;gt;pg_controldata /path/to/datadir&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;riremotedb&amp;lt;/tt&amp;gt;: The remote database OID, ie the &amp;lt;tt&amp;gt;oid&amp;lt;/tt&amp;gt; column of the remote server's &amp;lt;tt&amp;gt;pg_catalog.pg_database&amp;lt;/tt&amp;gt; entry for the replicated database. You can get the database name with &amp;lt;tt&amp;gt;select datname from pg_database where oid = 12345&amp;lt;/tt&amp;gt; (where '12345' is the &amp;lt;tt&amp;gt;riremotedb&amp;lt;/tt&amp;gt; oid).&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;rilocaldb &amp;lt;/tt&amp;gt;: The local database OID, with the same meaning as &amp;lt;tt&amp;gt;riremotedb&amp;lt;/tt&amp;gt; but with oids from the local system.&lt;br /&gt;
&lt;br /&gt;
''The rest of the rows are statistics about this upstream master slot'':&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_commit&amp;lt;/tt&amp;gt;: Number of commits applied to date from this master&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_rollback&amp;lt;/tt&amp;gt;: Number of rollbacks performed by this apply process due to recoverable errors (deadlock retries, lost races, etc) or unrecoverable errors like mismatched constraint errors.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_insert&amp;lt;/tt&amp;gt;: Number of &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s performed&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_insert_conflict&amp;lt;/tt&amp;gt;:  Number of &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s that resulted in conflicts.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_update&amp;lt;/tt&amp;gt;: Number of &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s performed&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_update_conflict&amp;lt;/tt&amp;gt;: Number of &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s that resulted in conflicts.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_delete&amp;lt;/tt&amp;gt;: Number of deletes performed&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_delete_conflict&amp;lt;/tt&amp;gt;: Number of deletes that resulted in conflicts.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;nr_disconnect&amp;lt;/tt&amp;gt;: Number of times this apply process has lost its connection to the upstream master since it was started.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This view does not contain any information about how far behind the upstream master this downstream master is. The upstream master's &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;pg_stat_replication&amp;lt;/tt&amp;gt; views must be queried to determine replication lag.&lt;br /&gt;
&lt;br /&gt;
==== Monitoring replication status and lag ====&lt;br /&gt;
&lt;br /&gt;
As with any replication setup, it is vital to monitor replication status on all BDR nodes to ensure no node is lagging severely behind the others or is stuck.&lt;br /&gt;
&lt;br /&gt;
In the case of BDR a stuck or crashed node will eventually cause disk space and table bloat problems on other masters so stuck nodes should be detected and removed or repaired in a reasonably timely manner. Exactly how urgent this is depends on the workload of the BDR group.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; view described above may be used to verify that a downstream master is connected to its upstream master - the &amp;lt;tt&amp;gt;active&amp;lt;/tt&amp;gt; boolean column is &amp;lt;tt&amp;gt;t&amp;lt;/tt&amp;gt; if there's a downstream master connected.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;xmin&amp;lt;/tt&amp;gt; column provides an indication of whether replication is advancing; it should increase as replication progresses. There is no simple way to turn &amp;lt;tt&amp;gt;xmin&amp;lt;/tt&amp;gt; into the time the last applied transaction was committed on the master, so it doesn't provide an indication of wall-clock lag.&lt;br /&gt;
&lt;br /&gt;
To determine wall-clock replication lag an application-level ticker may be used to periodically update a timestamp in a replicated table. The difference between this timestamp on the upstream and downstream masters provides the wall-clock replication lag. For BDR one row may be added to the table for each BDR master, giving an indication of how much lag each master has relative to each other master.&lt;br /&gt;
&lt;br /&gt;
=== Table and index usage statistics ===&lt;br /&gt;
&lt;br /&gt;
Statistics on table and index usage are updated normally by the downstream master. This is essential for correct function of auto-vacuum. If there are no local writes on the downstream master and stats have not been reset these two views should show matching results between upstream and downstream:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_user_tables&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_statio_user_tables&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s than non-identifying indexes are. &lt;br /&gt;
&lt;br /&gt;
The built-in index monitoring views are:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_user_indexes&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_statio_user_indexes&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All these views are discussed in [http://www.postgresql.org/docs/current/static/monitoring-stats.html#MONITORING-STATS-VIEWS-TABLE the PostgreSQL documentation on the statistics views].&lt;br /&gt;
&lt;br /&gt;
=== Starting, stopping and managing replication ===&lt;br /&gt;
&lt;br /&gt;
Replication is managed with the &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt; settings described in &amp;quot;Parameter Reference&amp;quot; and &amp;quot;Configuration&amp;quot; above, and using the &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; utility command.&lt;br /&gt;
&lt;br /&gt;
==== Starting a new LLSR connection ====&lt;br /&gt;
&lt;br /&gt;
Logical replication is started automatically when a database is configured as a downstream master in &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt; (see [[#Configuration|Configuration]]) and the postmaster is started. No explicit action is required to start replication, but replication will not actually work unless the upstream and downstream databases are identical within the requirements set by LLSR in the [[#Table definitions and DDL replication||Table definitions and DDL replication]] section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;pg_dump&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;pg_restore&amp;lt;/tt&amp;gt; may be used to set up the new replica's database.&lt;br /&gt;
&lt;br /&gt;
==== Viewing logical replication slots ====&lt;br /&gt;
&lt;br /&gt;
Examining the state of logical replication is discussed in [[#Monitoring|Monitoring]].&lt;br /&gt;
&lt;br /&gt;
==== Temporarily stopping an LLSR replica ====&lt;br /&gt;
&lt;br /&gt;
LLSR replicas can be temporarily stopped by shutting down the downstream master's postmaster.&lt;br /&gt;
&lt;br /&gt;
If the replica is not started back up before the upstream master discards the oldest WAL segment required for the downstream master to resume replay, as identified by the &amp;lt;tt&amp;gt;last_required_checkpoint&amp;lt;/tt&amp;gt; column of &amp;lt;tt&amp;gt;pg_catalog.pg_stat_logical_replication&amp;lt;/tt&amp;gt; then the replica will not resume replay. The error [[#Insufficient_WAL_segments_retained_.28.22requested_WAL_segment_..._has_already_been_removed.22.29|Insufficient WAL segments retained]] will be reported in the upstream master's logs. The replica must be re-created for replication to continue.&lt;br /&gt;
&lt;br /&gt;
==== Removing an LLSR replica permanently ====&lt;br /&gt;
&lt;br /&gt;
To remove a replication connection permanently, remove its entries from the downstream master's &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt;, restart the downstream master, then use &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; to remove the replication slot on the upstream master.&lt;br /&gt;
&lt;br /&gt;
It is important to remove the replication slot from the upstream master(s) to prevent xid wrap-around problems and issues with table bloat caused by delayed vacuum.&lt;br /&gt;
&lt;br /&gt;
==== Cleaning up abandoned replication slots ====&lt;br /&gt;
&lt;br /&gt;
To remove a replication slot that was used for a now-defunct replica, find its slot name from the &amp;lt;tt&amp;gt;[[#pg_stat_logical_replication|pg_stat_logical_replication]]&amp;lt;/tt&amp;gt; view on the upstream master then run:&lt;br /&gt;
&lt;br /&gt;
 pg_receivellog -p 5434 -h master-hostname -d dbname \&lt;br /&gt;
    --slot='bdr: 16384:5873181566046043070-1-16384:' --stop&lt;br /&gt;
&lt;br /&gt;
where the argument to '--slot' is the slot name you found from the view.&lt;br /&gt;
&lt;br /&gt;
You may need to do this if you've created and then deleted several replicas so &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; has filled up with entries for replicas that no longer exist.&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional replication is built directly on LLSR by configuring two or more servers as both upstream ''and'' downstream masters of each other.&lt;br /&gt;
&lt;br /&gt;
All of the Log Level Streaming Replication documentation applies to BDR and should be read before moving on to reading about and configuring BDR.&lt;br /&gt;
&lt;br /&gt;
=== Bi-Directional Replication Use Cases ===&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
==== Simple multi-master pair ====&lt;br /&gt;
&lt;br /&gt;
A simple mulit-master &amp;quot;HA Cluster&amp;quot; with two servers:&lt;br /&gt;
&lt;br /&gt;
* Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
* Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Alpha:&lt;br /&gt;
&lt;br /&gt;
 wal_level = 'logical'&lt;br /&gt;
 max_logical_slots = 3&lt;br /&gt;
 max_wal_senders = 4&lt;br /&gt;
 wal_keep_segments = 5000&lt;br /&gt;
 shared_preload_libraries = 'bdr'&lt;br /&gt;
 bdr.connections=&amp;quot;bravo&amp;quot;&lt;br /&gt;
 bdr.bravo.dsn = 'dbname=dbtoreplicate'&lt;br /&gt;
 track_commit_timestamp = on&lt;br /&gt;
&lt;br /&gt;
Bravo:&lt;br /&gt;
&lt;br /&gt;
 wal_level = 'logical'&lt;br /&gt;
 max_logical_slots = 3&lt;br /&gt;
 max_wal_senders = 4&lt;br /&gt;
 wal_keep_segments = 5000&lt;br /&gt;
 shared_preload_libraries = 'bdr'&lt;br /&gt;
 bdr.connections=&amp;quot;alpha&amp;quot;&lt;br /&gt;
 bdr.alpha.dsn = 'dbname=dbtoreplicate'&lt;br /&gt;
 track_commit_timestamp = on&lt;br /&gt;
&lt;br /&gt;
See [[#Configuration|Configuration]] for an explanation of these parameters.&lt;br /&gt;
&lt;br /&gt;
==== HA and Logical Standby ====&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;HA Cluster&amp;quot;:&lt;br /&gt;
&lt;br /&gt;
* Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
* Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
* Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
==== Very High Availability Multi-Master ====&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site simple Multi-Master Plex ====&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
If you wanted to test this configuration locally you could run three PostgreSQL instances on different ports. Such a configuration would look like the following if the port numbers were used as node names for the sake of notational clarity:&lt;br /&gt;
&lt;br /&gt;
Config for node_5440:&lt;br /&gt;
&lt;br /&gt;
 port = 5440&lt;br /&gt;
 bdr.connections='node_5441,node_5442'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5441:&lt;br /&gt;
&lt;br /&gt;
 port = 5441&lt;br /&gt;
 bdr.connections='node_5440,node_5442'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5442:&lt;br /&gt;
&lt;br /&gt;
 port = 5442&lt;br /&gt;
 bdr.connections='node_5440,node_5441'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
In a typical real-world configuration each server would be on the same port on a different host instead.&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site simple Multi-Master Circular Replication ====&lt;br /&gt;
&lt;br /&gt;
Simpler config uses &amp;quot;circular replication&amp;quot;. This is simpler but results in higher latency for changes as the number of nodes increases. It's also less resilient to network disruptions and node faults.&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
TODO: Regrettably this doesn't actually work yet because we don't cascade logical changes (yet).&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
Config for node_5440:&lt;br /&gt;
&lt;br /&gt;
 port = 5440&lt;br /&gt;
 bdr.connections='node_5441'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5441:&lt;br /&gt;
&lt;br /&gt;
 port = 5441&lt;br /&gt;
 bdr.connections='node_5442'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5442:&lt;br /&gt;
&lt;br /&gt;
 port = 5442&lt;br /&gt;
 bdr.connections='node_5440'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
This would usually be done in the real world with databases on different hosts, all running on the same port.&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site Max Availability Multi-Master Plex ====&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
Config left as an exercise for the reader.&lt;br /&gt;
&lt;br /&gt;
==== N-site symmetric cluster replication ====&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
==== Complex/Assymetric Replication ====&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
=== Conflict Avoidance ===&lt;br /&gt;
&lt;br /&gt;
==== Distributed Locking ====&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications as very low latency is critical for acceptable performance.&lt;br /&gt;
&lt;br /&gt;
Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible but allow some types of conflict to occur and and resolve them when they arise.&lt;br /&gt;
&lt;br /&gt;
==== Global Sequences ====&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
The SQL standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using &amp;lt;tt&amp;gt;DEFAULT nextval('mysequence')&amp;lt;/tt&amp;gt;, as with PostgreSQL's &amp;lt;tt&amp;gt;SERIAL&amp;lt;/tt&amp;gt; pseudo-type.&lt;br /&gt;
&lt;br /&gt;
BDR requires sequences to work together across multiple nodes. This is implemented as a new &amp;lt;tt&amp;gt;SequenceAccessMethod&amp;lt;/tt&amp;gt; API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
=== Conflict Detection &amp;amp; Resolution ===&lt;br /&gt;
&lt;br /&gt;
Because local writes can occur on a master, conflict detection and avoidance is a concern for basic LLSR setups as well as full BDR configurations.&lt;br /&gt;
&lt;br /&gt;
==== Lock Conflicts ====&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LOCK-WAITS log_lock_waits]&amp;lt;/tt&amp;gt; facility to look for issues with apply blocking on locks.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
&lt;br /&gt;
* explicit row level locking (&amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/FOR SHARE&amp;lt;/tt&amp;gt;)&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s, &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s or &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s, either from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
==== Data Conflicts ====&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in a consistent and idempotent manner so that all servers end up with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from &amp;lt;tt&amp;gt;pg_control&amp;lt;/tt&amp;gt; though this may change in a future release.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur. The errors causing the conflict can be seen in the error log of the downstream master with the problem.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; conflicts. These are accepted as successful operations but in the case of &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; the data in the &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; is discarded.&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins. It is not practical to decide when a row should be merged and when a last-update-wins stragegy should be used at the database level; such decision making would require support for application-specific conflict resolution plugins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in the same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
==== Examples ====&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting</id>
		<title>PgCon 2013 Developer Meeting</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting"/>
				<updated>2013-05-13T19:16:24Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Proposed Agenda Items */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;A meeting of the most active PostgreSQL developers is being planned for Wednesday 22nd May, 2013 near the University of Ottawa, prior to pgCon 2013. In order to keep the numbers manageable, this meeting is '''by invitation only'''. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org). &lt;br /&gt;
&lt;br /&gt;
Please note that this year the attendee numbers have been kept low in order to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.3 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.&lt;br /&gt;
&lt;br /&gt;
This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.&lt;br /&gt;
 &lt;br /&gt;
== Time &amp;amp; Location ==&lt;br /&gt;
&lt;br /&gt;
The meeting will be from 8:30AM to 5PM, and will be in the &amp;quot;Red Experience&amp;quot; room at:&lt;br /&gt;
&lt;br /&gt;
 Novotel Ottawa&lt;br /&gt;
 33 Nicholas Street&lt;br /&gt;
 Ottawa&lt;br /&gt;
 Ontario&lt;br /&gt;
 K1N 9M7&lt;br /&gt;
 &lt;br /&gt;
Food and drink will be provided throughout the day, including breakfast from 8AM.&lt;br /&gt;
&lt;br /&gt;
[http://maps.google.ca/maps?f=q&amp;amp;source=s_q&amp;amp;hl=en&amp;amp;geocode=&amp;amp;q=novotel+ottawa&amp;amp;aq=&amp;amp;sll=49.891235,-97.15369&amp;amp;sspn=36.237851,79.013672&amp;amp;ie=UTF8&amp;amp;hq=novotel+ottawa&amp;amp;hnear=&amp;amp;ll=45.421528,-75.683699&amp;amp;spn=0.036869,0.077162&amp;amp;z=14&amp;amp;iwloc=A&amp;amp;layer=c&amp;amp;cbll=45.425741,-75.689638&amp;amp;panoid=Z4FUGnkZkdHAOkIxyjjS9Q&amp;amp;cbp=12,25.83,,0,-0.6 View on Google Maps]&lt;br /&gt;
&lt;br /&gt;
== Attendees ==&lt;br /&gt;
&lt;br /&gt;
The following people have RSVPed to the meeting (in alphabetical order, by surname):&lt;br /&gt;
&lt;br /&gt;
* Josh Berkus (secretary)&lt;br /&gt;
* Jeff Davis&lt;br /&gt;
* Andrew Dunstan&lt;br /&gt;
* Peter Eisentraut&lt;br /&gt;
* Dimitri Fontaine&lt;br /&gt;
* Andres Freund&lt;br /&gt;
* Stephen Frost&lt;br /&gt;
* Peter Geoghegan&lt;br /&gt;
* Kevin Grittner&lt;br /&gt;
* Robert Haas&lt;br /&gt;
* Magnus Hagander&lt;br /&gt;
* KaiGai Kohei&lt;br /&gt;
* Alexander Korotkov&lt;br /&gt;
* Tom Lane&lt;br /&gt;
* Fujii Masao&lt;br /&gt;
* Noah Misch&lt;br /&gt;
* Bruce Momjian&lt;br /&gt;
* Dave Page (chair)&lt;br /&gt;
* Simon Riggs&lt;br /&gt;
&lt;br /&gt;
== Proposed Agenda Items ==&lt;br /&gt;
&lt;br /&gt;
Please list proposed agenda items here:&lt;br /&gt;
&lt;br /&gt;
* 9.4 Commitfest schedule&lt;br /&gt;
* [http://wiki.postgresql.org/wiki/Parallel_Query_Execution Parallel Query Execution] (Bruce, Noah)&lt;br /&gt;
* logical changeset generation review &amp;amp; integration (Andres)&lt;br /&gt;
* utilization of upcoming non-volatile RAM device (Kaigai)&lt;br /&gt;
* pluggable plan/exec nodes (Kaigai)&lt;br /&gt;
** to offload targetlist calculation, sorting, aggregates, ...&lt;br /&gt;
* [[GIN generalization]] (Alexander)&lt;br /&gt;
* An Extensibility Roadmap (dim)&lt;br /&gt;
* Representing severity - derive severity from SQLSTATE (Peter Geoghegan - see http://www.postgresql.org/message-id/CA+TgmoZEjq7va+SfDZQwk6E4emEWThENNyxfqEGhB3iuoT1OJw@mail.gmail.com)&lt;br /&gt;
* Error logging infrastructure - store normalized statistics about errors in a circular buffer (Peter Geoghegan). Arguably this could be discussed alongside SQLSTATE item.&lt;br /&gt;
* Failback with backup (Fujii Masao - related discussion is: http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com)&lt;br /&gt;
* Volume Management (Stephen Frost - wiki page will be forthcoming before the meeting)&lt;br /&gt;
* AXLE Project - Big data analytics for Postgres (Simon Riggs) - an overview of the feature plan, how project works and what community can expect&lt;br /&gt;
&lt;br /&gt;
== Agenda ==&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;4&amp;quot; cellspacing=&amp;quot;0&amp;quot;&lt;br /&gt;
!Time&lt;br /&gt;
!Item&lt;br /&gt;
!Presenter&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|08:00&lt;br /&gt;
|Breakfast&lt;br /&gt;
|&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|08:30 - 08:45&lt;br /&gt;
|Welcome and introductions&lt;br /&gt;
|Dave Page&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|08:45 - 09:45&lt;br /&gt;
|Parallel Query Execution&lt;br /&gt;
|Bruce/Noah&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|09:45 - 10:30&lt;br /&gt;
|Logical changeset generation review &amp;amp; integration&lt;br /&gt;
|Andres&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|10:30 - 10:45&lt;br /&gt;
|Coffee break&lt;br /&gt;
|&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|10:45 - 11:00&lt;br /&gt;
|Utilization of upcoming non-volatile RAM devices&lt;br /&gt;
|KaiGai&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|11:00 - 11:30&lt;br /&gt;
|Pluggable plan/exec nodes&lt;br /&gt;
|KaiGai&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|11:30 - 11:50&lt;br /&gt;
|Representing severity&lt;br /&gt;
|Peter G.&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|11:50 - 12:30&lt;br /&gt;
|Error logging infrastructure&lt;br /&gt;
|Peter G.&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|12:30 - 13:30&lt;br /&gt;
|Lunch	&lt;br /&gt;
|&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|13:30 - 14:15&lt;br /&gt;
|GIN generalization&lt;br /&gt;
|Alexander&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|14:15 - 15:00&lt;br /&gt;
|An Extensibility Roadmap&lt;br /&gt;
|Dimitri&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|15:00 - 15:15&lt;br /&gt;
|Tea break&lt;br /&gt;
|&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|15:15 - 15:30&lt;br /&gt;
|9.4 Commitfest schedule&lt;br /&gt;
|All&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|15:30 - 16:45&lt;br /&gt;
|Goals, priorities, and resources for 9.4&lt;br /&gt;
|All&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|16:45 - 17:00&lt;br /&gt;
|Any other business/group photo&lt;br /&gt;
|Dave Page&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|17:00&lt;br /&gt;
|Finish&lt;br /&gt;
|	&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting</id>
		<title>PgCon 2013 Developer Meeting</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting"/>
				<updated>2013-05-13T18:14:17Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;Aplha srot&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;A meeting of the most active PostgreSQL developers is being planned for Wednesday 22nd May, 2013 near the University of Ottawa, prior to pgCon 2013. In order to keep the numbers manageable, this meeting is '''by invitation only'''. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org). &lt;br /&gt;
&lt;br /&gt;
Please note that this year the attendee numbers have been kept low in order to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.3 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.&lt;br /&gt;
&lt;br /&gt;
This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.&lt;br /&gt;
 &lt;br /&gt;
== Time &amp;amp; Location ==&lt;br /&gt;
&lt;br /&gt;
The meeting will be from 8:30AM to 5PM, and will be in the &amp;quot;Red Experience&amp;quot; room at:&lt;br /&gt;
&lt;br /&gt;
 Novotel Ottawa&lt;br /&gt;
 33 Nicholas Street&lt;br /&gt;
 Ottawa&lt;br /&gt;
 Ontario&lt;br /&gt;
 K1N 9M7&lt;br /&gt;
 &lt;br /&gt;
Food and drink will be provided throughout the day, including breakfast from 8AM.&lt;br /&gt;
&lt;br /&gt;
[http://maps.google.ca/maps?f=q&amp;amp;source=s_q&amp;amp;hl=en&amp;amp;geocode=&amp;amp;q=novotel+ottawa&amp;amp;aq=&amp;amp;sll=49.891235,-97.15369&amp;amp;sspn=36.237851,79.013672&amp;amp;ie=UTF8&amp;amp;hq=novotel+ottawa&amp;amp;hnear=&amp;amp;ll=45.421528,-75.683699&amp;amp;spn=0.036869,0.077162&amp;amp;z=14&amp;amp;iwloc=A&amp;amp;layer=c&amp;amp;cbll=45.425741,-75.689638&amp;amp;panoid=Z4FUGnkZkdHAOkIxyjjS9Q&amp;amp;cbp=12,25.83,,0,-0.6 View on Google Maps]&lt;br /&gt;
&lt;br /&gt;
== Attendees ==&lt;br /&gt;
&lt;br /&gt;
The following people have RSVPed to the meeting (in alphabetical order, by surname):&lt;br /&gt;
&lt;br /&gt;
* Josh Berkus (secretary)&lt;br /&gt;
* Jeff Davis&lt;br /&gt;
* Andrew Dunstan&lt;br /&gt;
* Peter Eisentraut&lt;br /&gt;
* Dimitri Fontaine&lt;br /&gt;
* Andres Freund&lt;br /&gt;
* Stephen Frost&lt;br /&gt;
* Peter Geoghegan&lt;br /&gt;
* Kevin Grittner&lt;br /&gt;
* Robert Haas&lt;br /&gt;
* Magnus Hagander&lt;br /&gt;
* KaiGai Kohei&lt;br /&gt;
* Alexander Korotkov&lt;br /&gt;
* Tom Lane&lt;br /&gt;
* Fujii Masao&lt;br /&gt;
* Noah Misch&lt;br /&gt;
* Bruce Momjian&lt;br /&gt;
* Dave Page (chair)&lt;br /&gt;
* Simon Riggs&lt;br /&gt;
&lt;br /&gt;
== Proposed Agenda Items ==&lt;br /&gt;
&lt;br /&gt;
Please list proposed agenda items here:&lt;br /&gt;
&lt;br /&gt;
* 9.4 Commitfest schedule&lt;br /&gt;
* [http://wiki.postgresql.org/wiki/Parallel_Query_Execution Parallel Query Execution] (Bruce, Noah)&lt;br /&gt;
* logical changeset generation review &amp;amp; integration (Andres)&lt;br /&gt;
* utilization of upcoming non-volatile RAM device (Kaigai)&lt;br /&gt;
* pluggable plan/exec nodes (Kaigai)&lt;br /&gt;
** to offload targetlist calculation, sorting, aggregates, ...&lt;br /&gt;
* [[GIN generalization]] (Alexander)&lt;br /&gt;
* An Extensibility Roadmap (dim)&lt;br /&gt;
* Representing severity - derive severity from SQLSTATE (Peter Geoghegan - see http://www.postgresql.org/message-id/CA+TgmoZEjq7va+SfDZQwk6E4emEWThENNyxfqEGhB3iuoT1OJw@mail.gmail.com)&lt;br /&gt;
* Error logging infrastructure - store normalized statistics about errors in a circular buffer (Peter Geoghegan). Arguably this could be discussed alongside SQLSTATE item.&lt;br /&gt;
* Failback with backup (Fujii Masao - related discussion is: http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com)&lt;br /&gt;
* Volume Management (Stephen Frost - wiki page will be forthcoming before the meeting)&lt;br /&gt;
&lt;br /&gt;
== Agenda ==&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;4&amp;quot; cellspacing=&amp;quot;0&amp;quot;&lt;br /&gt;
!Time&lt;br /&gt;
!Item&lt;br /&gt;
!Presenter&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|08:00&lt;br /&gt;
|Breakfast&lt;br /&gt;
|&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|08:30 - 08:45&lt;br /&gt;
|Welcome and introductions&lt;br /&gt;
|Dave Page&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|08:45 - 09:45&lt;br /&gt;
|Parallel Query Execution&lt;br /&gt;
|Bruce/Noah&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|09:45 - 10:30&lt;br /&gt;
|Logical changeset generation review &amp;amp; integration&lt;br /&gt;
|Andres&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|10:30 - 10:45&lt;br /&gt;
|Coffee break&lt;br /&gt;
|&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|10:45 - 11:00&lt;br /&gt;
|Utilization of upcoming non-volatile RAM devices&lt;br /&gt;
|KaiGai&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|11:00 - 11:30&lt;br /&gt;
|Pluggable plan/exec nodes&lt;br /&gt;
|KaiGai&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|11:30 - 11:50&lt;br /&gt;
|Representing severity&lt;br /&gt;
|Peter G.&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|11:50 - 12:30&lt;br /&gt;
|Error logging infrastructure&lt;br /&gt;
|Peter G.&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|12:30 - 13:30&lt;br /&gt;
|Lunch	&lt;br /&gt;
|&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|13:30 - 14:15&lt;br /&gt;
|GIN generalization&lt;br /&gt;
|Alexander&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|14:15 - 15:00&lt;br /&gt;
|An Extensibility Roadmap&lt;br /&gt;
|Dimitri&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|15:00 - 15:15&lt;br /&gt;
|Tea break&lt;br /&gt;
|&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|15:15 - 15:30&lt;br /&gt;
|9.4 Commitfest schedule&lt;br /&gt;
|All&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
|15:30 - 16:45&lt;br /&gt;
|Goals, priorities, and resources for 9.4&lt;br /&gt;
|All&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|16:45 - 17:00&lt;br /&gt;
|Any other business/group photo&lt;br /&gt;
|Dave Page&lt;br /&gt;
&lt;br /&gt;
|- style=&amp;quot;font-style:italic;background-color:lightgray;&amp;quot;&lt;br /&gt;
|17:00&lt;br /&gt;
|Finish&lt;br /&gt;
|	&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-05-08T07:04:59Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;----&lt;br /&gt;
This page is the users and administrators guide for BDR. If you're looking for technical details on the project plan and implementation, see [[BDR Project]].&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
= BDR User Guide =&lt;br /&gt;
&lt;br /&gt;
BDR (BiDrectional Replication) is a feature being developed for inclusion in PostgreSQL core that provides greatly enhanced replication capabilities.&lt;br /&gt;
&lt;br /&gt;
BDR allows users to create a geographically distributed multi-master database using Logical Log Streaming Replication (LLSR) transport.&lt;br /&gt;
BDR is designed to provide both high availability and geographically distributed disaster recovery capabilities. &lt;br /&gt;
&lt;br /&gt;
BDR is not “clustering” as some vendors use the term, in that it doesn't have a distributed lock manager, global transaction co-ordinator, etc. Each member server is separate yet connected, with design choices that allow separation between nodes that would not be possible with global transaction coordination.&lt;br /&gt;
&lt;br /&gt;
Guidance on getting a testing setup established are in [[#Initial setup]]. Please read the full documentation if you intend to put BDR into production.&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows one PostgreSQL master (the &amp;quot;upstream master&amp;quot;) to stream a sequence of changes to another read/write PostgreSQL server (the &amp;quot;downstream master&amp;quot;). Data is sent in one direction only over a normal libpq connection.&lt;br /&gt;
&lt;br /&gt;
Multiple LLSR connections can be used to set up bi-directional replication as discussed later in this guide.&lt;br /&gt;
&lt;br /&gt;
=== Overview of logical replication ===&lt;br /&gt;
&lt;br /&gt;
In some ways LLSR is similar to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective; both replicate changes from one server to another. However, in LLSR the receiving server is also a full master database that can make changes, unlike the read-only replicas offered by PLSR hot standby. Additionally, LLSR is per-database, whereas PLSR is per-cluster and replicates all databases at once. There are many more differences discussed in the relevant sections of this document.&lt;br /&gt;
&lt;br /&gt;
In LLSR the data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after &amp;lt;tt&amp;gt;CREATE DATABASE&amp;lt;/tt&amp;gt;. A restart of the downstream master is also required. The upstream master only needs restarting if the &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; parameter is too low to allow a new replica to be added. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated. Setup is discussed in more detail below.&lt;br /&gt;
&lt;br /&gt;
Changes are processed by the downstream master using &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; plug-ins. This allows flexible handing of replication input, including:&lt;br /&gt;
&lt;br /&gt;
* BDR apply process - applies logical changes to the downstream master. The apply process makes changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
* Textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_xlogdump&amp;lt;/tt&amp;gt; - examines physical WAL records and produces textual debugging output. This server program is included in PostgreSQL 9.3.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;TRUNCATE&amp;lt;/tt&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
(TRUNCATE is not yet implemented, but will be implemented before the feature goes to final release).&lt;br /&gt;
&lt;br /&gt;
Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though it has overheads that mean that it doesn't always use less bandwidth than PLSR.&lt;br /&gt;
&lt;br /&gt;
Locks taken by &amp;lt;tt&amp;gt;LOCK&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/SHARE&amp;lt;/tt&amp;gt; on the upstream master are not replicated to downstream masters. Locks taken automatically by &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;TRUNCATE&amp;lt;/tt&amp;gt; *are* taken on the downstream master and may delay replication apply or concurrent transactions - see [[#Lock Conflicts|Lock Conflicts]].&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;TEMPORARY&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;UNLOGGED&amp;lt;/tt&amp;gt; tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; statements that affect multiple rows on upstream master will cause a series of row changes on downstream master. These are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt; on upstream master do not require a unique constraint in order to replicate correctly. &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s require some form of unique constraint, either &amp;lt;tt&amp;gt;PRIMARY KEY&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;UNIQUE NOT NULL&amp;lt;/tt&amp;gt;. A warning is issued in the downstream master's logs if the expected constraint is absent.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s that change the value of the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the final values from the &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; on the upstream master, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value. Volatile or stable functions are evaluated on the master side and the resulting values are replicated. Consequently any function side-effects (writing files, network socket activity, updating internal PostgreSQL variables, etc) will not occur on the replicas as the functions are not run again on the replica.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master (see &amp;quot;Limitations&amp;quot;).&lt;br /&gt;
&lt;br /&gt;
The current LLSR plugin implementation uses the binary libpq protocol, so it requires that the upstream and downstream master use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication. A textual output option will be added later for passing data between non-identical servers, e.g. laptops or mobile devices communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is efficiently implemented. Parallel apply is a possible future feature, especially for changes made while holding &amp;lt;tt&amp;gt;AccessExclusiveLock&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in the sequence in which they were commited on the upstream master. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions spill to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;SET&amp;lt;/tt&amp;gt; statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. We always update the correct tables, whatever the setting of &amp;lt;tt&amp;gt;search_path&amp;lt;/tt&amp;gt;. Values are replicated correctly irrespective of the values of &amp;lt;tt&amp;gt;bytea_output&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;TimeZone&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;DateStyle&amp;lt;/tt&amp;gt;, etc.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; is not supported across log based replication, either physical or logical. &amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;LISTEN&amp;lt;/tt&amp;gt; will work fine on the upstream master but an upstream &amp;lt;tt&amp;gt;NOTIFY&amp;lt;/tt&amp;gt; will not trigger a downstream &amp;lt;tt&amp;gt;LISTEN&amp;lt;/tt&amp;gt;er.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes an automatic retry of the apply of the replaying transaction and is only an issue if the deadlock recurs repeatedly, delaying replication.&lt;br /&gt;
&lt;br /&gt;
From a performance and concurrency perspective the BDR apply process is similar to a normal backend. Frequent conflicts with locks from other transactions when replaying changes can slow things down and thus increase replication delay, so reducing the frequency of such conflicts can be a good way to speed things up. Any lock held by another transaction on the downstream master - &amp;lt;tt&amp;gt;LOCK&amp;lt;/tt&amp;gt; statements, &amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/FOR SHARE&amp;lt;/tt&amp;gt;, or &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt; row locks - can delay replication if the replication apply process needs to change the locked table/row.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching &amp;lt;tt&amp;gt;&amp;quot;Schemaname&amp;quot;.&amp;quot;Tablename&amp;quot;&amp;lt;/tt&amp;gt; on both upstream and downstream masters. e.g. changes from upstream's &amp;lt;tt&amp;gt;public.mytable&amp;lt;/tt&amp;gt; will go to downstream's &amp;lt;tt&amp;gt;public.mytable&amp;lt;/tt&amp;gt; while changes to the upstream &amp;lt;tt&amp;gt;mychema.mytable&amp;lt;/tt&amp;gt; will go to the downstream &amp;lt;tt&amp;gt;myschema.mytable&amp;lt;/tt&amp;gt;. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful synchronization of table definitions on each node otherwise &amp;lt;tt&amp;gt;ERROR&amp;lt;/tt&amp;gt;s will be generated by the replication apply process. In general, tables must be an exact match between upstream and downstream masters. &lt;br /&gt;
&lt;br /&gt;
There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
Tables must meet the following requirements to be compatible for purposes of LLSR:&lt;br /&gt;
&lt;br /&gt;
* The downstream master must only have constraints (&amp;lt;tt&amp;gt;CHECK&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;UNIQUE&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;EXCLUSION&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;FOREIGN KEY&amp;lt;/tt&amp;gt;, etc) that are also present on the upstream master. Replication may initially work with mismatched constraints but is likely to fail as soon as the downstream master rejects a row the upstream master accepted.&lt;br /&gt;
* The table referenced by a FOREIGN KEY on a downstream master must have all the keys present in the upstream master version of the same table.&lt;br /&gt;
* Storage parameters must match except for as allowed below&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
* Dropped columns on master must be present on replicas&lt;br /&gt;
* Custom types and enum definitions must match exactly&lt;br /&gt;
* Composite types and enums must have the same oids on master and replication target&lt;br /&gt;
* Extensions defining types used in replicated tables must be of the same version or fully SQL-level compatible and the oids of the types they define must match.&lt;br /&gt;
&lt;br /&gt;
The following differences are permissible between tables on different nodes:&lt;br /&gt;
&lt;br /&gt;
* The table's &amp;lt;tt&amp;gt;pg_class&amp;lt;/tt&amp;gt; oid, the oid of its associated TOAST table, and the oid of the table's rowtype in &amp;lt;tt&amp;gt;pg_type&amp;lt;/tt&amp;gt; may differ;&lt;br /&gt;
* Extra or missing non-&amp;lt;tt&amp;gt;UNIQUE&amp;lt;/tt&amp;gt; indexes&lt;br /&gt;
* Extra keys in downstream lookup tables for &amp;lt;tt&amp;gt;FOREIGN KEY&amp;lt;/tt&amp;gt; references that are not present on the upstream master&lt;br /&gt;
* The table-level storage parameters for fillfactor and autovacuum&lt;br /&gt;
* Triggers and rules may differ (they are not executed by replication apply)&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR (see [[#LLSR Limitations|LLSR Limitations]]).&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of &amp;lt;tt&amp;gt;session_replication_role = origin&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
In future it is expected that composite types and enums with non-identical oids will be converted using text output and input functions. This feature is not yet implemented.&lt;br /&gt;
&lt;br /&gt;
=== Parameter Reference ===&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;bdr.connections&amp;lt;/tt&amp;gt; - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;bdr.&amp;lt;nodename&amp;gt;.dsn&amp;lt;/tt&amp;gt; - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;wal_level&amp;lt;/tt&amp;gt; - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== LLSR limitations ===&lt;br /&gt;
&lt;br /&gt;
Table definitions, types, extensions, etc must be near identical between upstream and downstream masters. See [[#Table definitions and DDL replication|Table definitions and DDL replication]].&lt;br /&gt;
&lt;br /&gt;
DDL replication is not yet supported.&lt;br /&gt;
&lt;br /&gt;
No feedback from downstream masters to the upstream master is implemented for asynchronous LLSR, so upstream masters must be configured to keep enough WAL. See [[#Configuration|Configuration]].&lt;br /&gt;
&lt;br /&gt;
=== Initial setup ===&lt;br /&gt;
&lt;br /&gt;
To set up LLSR or BDR you first need a patched PostgreSQL that can support LLSR/BDR, then you need to create one or more LLSR/BDR senders and one or more LLSR/BDR receivers.&lt;br /&gt;
&lt;br /&gt;
==== Installing the patched PostgreSQL binaries ====&lt;br /&gt;
&lt;br /&gt;
Currently BDR is only available in builds of the 'bdr' branch on Andres Freund's git repo on git.postgresql.org. PostgreSQL 9.2 and below do not support BDR, and 9.3 requires patches, so this guide will not work for you if you are trying to use a normal install of PostgreSQL.&lt;br /&gt;
&lt;br /&gt;
First you need to clone, configure, compile and install like normal. Clone the sources from &amp;lt;tt&amp;gt;git://git.postgresql.org/git/users/andresfreund/postgres.git&amp;lt;/tt&amp;gt; and checkout the &amp;lt;tt&amp;gt;bdr&amp;lt;/tt&amp;gt; branch.&lt;br /&gt;
&lt;br /&gt;
If you have an existing local PostgreSQL git tree specify it as &amp;lt;tt&amp;gt;--reference /path/to/existing/tree&amp;lt;/tt&amp;gt; to greatly speed your git clone.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
 mkdir -p $HOME/bdr&lt;br /&gt;
 cd bdr&lt;br /&gt;
 git clone git://git.postgresql.org/git/users/andresfreund/postgres.git $HOME/bdr/postgres-bdr-src&lt;br /&gt;
 cd postgres-bdr-src&lt;br /&gt;
 ./configure --prefix=$HOME/bdr/postgres-bdr-bin&lt;br /&gt;
 make install&lt;br /&gt;
 cd contrib/bdr&lt;br /&gt;
 make install&lt;br /&gt;
&lt;br /&gt;
This will put everything in &amp;lt;tt&amp;gt;$HOME/bdr&amp;lt;/tt&amp;gt;, with the source code and build tree in &amp;lt;tt&amp;gt;$HOME/bdr/postgres-bdr-src&amp;lt;/tt&amp;gt; and the installed PostgreSQL in &amp;lt;tt&amp;gt;$HOME/bdr/postgres-bdr-bin&amp;lt;/tt&amp;gt;. This is a convenient setup for testing and development because it doesn't require you to set up new users, wrangle permissions, run anything as root, etc, but it isn't recommended that you deploy this way in production.&lt;br /&gt;
&lt;br /&gt;
To actually use these new binaries you will need to:&lt;br /&gt;
&lt;br /&gt;
 export PATH=$HOME/bdr/postgres-bdr-bin/bin:$PATH&lt;br /&gt;
&lt;br /&gt;
before running &amp;lt;tt&amp;gt;initdb&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;postgres&amp;lt;/tt&amp;gt;, etc. You don't have to use the &amp;lt;tt&amp;gt;psql&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;libpq&amp;lt;/tt&amp;gt; you compiled but you're likely to get version mismatch warnings if you don't.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
The configuration for a simple single-master to single-replica configuration looks like:&lt;br /&gt;
&lt;br /&gt;
Upstream (sender) &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
  wal_level = 'logical'       # Include enough info for logical replication&lt;br /&gt;
  max_logical_slots = X       # Number of LLSR senders + any receivers&lt;br /&gt;
  max_wal_senders = Y         # Y = max_logical_slots plus any physical &lt;br /&gt;
                              # streaming requirements&lt;br /&gt;
  wal_keep_segments = 5000    # Master must retain enough WAL segments to let &lt;br /&gt;
                              # replicas catch up. Correct value depends on&lt;br /&gt;
                              # rate of writes on master, max replica downtime&lt;br /&gt;
                              # allowable. 5000 segments requires 78GB&lt;br /&gt;
                              # in pg_xlog&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Downstream (receiver) &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
  shared_preload_libraries = 'bdr'&lt;br /&gt;
  &lt;br /&gt;
  bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection&lt;br /&gt;
                                                 # from downstream to upstream master&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case &lt;br /&gt;
                                                 # where the databasename on upstream &lt;br /&gt;
                                                 # and downstream master differ. &lt;br /&gt;
                                                 # (Not yet implemented)&lt;br /&gt;
  bdr.&amp;lt;nodename&amp;gt;.apply_delay                     # optional parameter to delay apply of&lt;br /&gt;
                                                 # transactions, time in milliseconds &lt;br /&gt;
  bdr.synchronous_commit = ...;                  # optional parameter to set the&lt;br /&gt;
                                                 # synchronous_commit parameter the&lt;br /&gt;
                                                 # apply processes will be using&lt;br /&gt;
  max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network and for heavy bursts of write activity on the master. Keep in mind that enough disk space must be available for the WAL segments, each of which is 16MB. See &amp;quot;Insufficient WAL segments retained&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
Note that a server can be both sender and receiver, either two servers to each other or more complex configurations like replication chains/trees.&lt;br /&gt;
&lt;br /&gt;
The upstream (sender) &amp;lt;tt&amp;gt;pg_hba.conf&amp;lt;/tt&amp;gt; must be configured to allow the downstream master to connect for replication. Otherwise you'll see errors like the following on the downstream master:&lt;br /&gt;
&lt;br /&gt;
 FATAL:  could not connect to the primary server: FATAL:  no pg_hba.conf entry for replication connection from host &amp;quot;[local]&amp;quot;, user &amp;quot;postgres&amp;quot;&lt;br /&gt;
&lt;br /&gt;
A suitable &amp;lt;tt&amp;gt;pg_hba.conf&amp;lt;/tt&amp;gt; entry for a replication connection from the replica server 10.1.4.8 might be:&lt;br /&gt;
&lt;br /&gt;
  host    replication     postgres        10.1.4.8/32            trust&lt;br /&gt;
&lt;br /&gt;
(the user name should match the user name configured in the downstream master's dsn. md5 password authentication is supported.)&lt;br /&gt;
&lt;br /&gt;
For more details on these parameters, see [[#Parameter Reference|Parameter Reference]].&lt;br /&gt;
&lt;br /&gt;
=== Troubleshooting ===&lt;br /&gt;
&lt;br /&gt;
==== Could not access file &amp;quot;bdr&amp;quot;: No such file or directory ====&lt;br /&gt;
&lt;br /&gt;
If you see the error:&lt;br /&gt;
&lt;br /&gt;
 FATAL:  could not access file &amp;quot;bdr&amp;quot;: No such file or directory&lt;br /&gt;
&lt;br /&gt;
when starting a database set up to receive BDR replication, you probably forgot to install &amp;lt;tt&amp;gt;contrib/bdr&amp;lt;/tt&amp;gt;. See above.&lt;br /&gt;
&lt;br /&gt;
==== Invalid value for parameter ====&lt;br /&gt;
&lt;br /&gt;
An error like:&lt;br /&gt;
&lt;br /&gt;
 LOG:  invalid value for parameter ...&lt;br /&gt;
&lt;br /&gt;
when setting one of these parameters means your server doesn't support logical replication and will need to be patched or updated.&lt;br /&gt;
&lt;br /&gt;
==== Insufficient WAL segments retained (&amp;quot;requested WAL segment ... has already been removed&amp;quot;) ====&lt;br /&gt;
&lt;br /&gt;
If &amp;lt;tt&amp;gt;wal_keep_segments&amp;lt;/tt&amp;gt; is insufficient to meet the requirements of a replica that has fallen far behind, the master will report errors like:&lt;br /&gt;
&lt;br /&gt;
 ERROR:  requested WAL segment 00000001000000010000002D has already been removed&lt;br /&gt;
&lt;br /&gt;
Currently the replica errors look like:&lt;br /&gt;
&lt;br /&gt;
 WARNING:  Starting logical replication&lt;br /&gt;
 LOG:  data stream ended&lt;br /&gt;
 LOG:  worker process: master (PID 23812) exited with exit code 0&lt;br /&gt;
 LOG:  starting background worker process &amp;quot;master&amp;quot;&lt;br /&gt;
 LOG:  master initialized on master, remote dbname=master port=5434 replication=true fallback_application_name=bdr&lt;br /&gt;
 LOG:  local sysid 5873181566046043070, remote: 5873181102189050714&lt;br /&gt;
 LOG:  found valid replication identifier 1&lt;br /&gt;
 LOG:  starting up replication at 1 from 1/2D9CA220&lt;br /&gt;
&lt;br /&gt;
but a more explicit error message for this condition is planned.&lt;br /&gt;
&lt;br /&gt;
The only way to recover from this fault is to re-seed the replica database.&lt;br /&gt;
&lt;br /&gt;
This fault could be prevented with feedback from the replica to the master, but this feature is not planned for the first release of BDR. Another alternative considered for future releases is making wal_keep_segments a dynamic parameter that is sized on demand.&lt;br /&gt;
&lt;br /&gt;
Monitoring of maximum replica lag and appropriate adjustment of wal_keep_segments will prevent this fault from arising.&lt;br /&gt;
&lt;br /&gt;
==== Couldn't find logical slot ====&lt;br /&gt;
&lt;br /&gt;
An error like:&lt;br /&gt;
&lt;br /&gt;
 ERROR:  couldn't find logical slot &amp;quot;bdr: 16384:5873181566046043070-1-24596:&amp;quot;&lt;br /&gt;
&lt;br /&gt;
on the upstream master suggests that a downstream master is trying to connect to a logical replication slot that no longer exists. The slot can not be re-created, so it is necessary to re-seed the downstream replica database.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level (ie SQL visible) ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following views are available for monitoring replication activity:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/monitoring-stats.html#MONITORING-STATS-VIEWS-TABLE pg_stat_replication]&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; (described below)&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_bdr&amp;lt;/tt&amp;gt; (described below)&lt;br /&gt;
&lt;br /&gt;
The following configuration and logging parameters are useful for monitoring replication:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LOCK-WAITS log_lock_waits]&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== pg_stat_logical_replication ====&lt;br /&gt;
&lt;br /&gt;
The new &amp;lt;tt&amp;gt;pg_stat_logical_replication&amp;lt;/tt&amp;gt; view is specific to logical replication. It is based on the underlying &amp;lt;tt&amp;gt;pg_stat_get_logical_replication_slots&amp;lt;/tt&amp;gt; function and has the following structure:&lt;br /&gt;
&lt;br /&gt;
  View &amp;quot;pg_catalog.pg_stat_logical_replication&amp;quot;&lt;br /&gt;
           Column          |  Type   | Modifiers &lt;br /&gt;
 --------------------------+---------+-----------&lt;br /&gt;
  slot_name                | text    | &lt;br /&gt;
  plugin                   | text    | &lt;br /&gt;
  database                 | oid     | &lt;br /&gt;
  active                   | boolean | &lt;br /&gt;
  xmin                     | xid     | &lt;br /&gt;
  last_required_checkpoint | text    | &lt;br /&gt;
&lt;br /&gt;
It contains one row for every connection from a downstream master to the server being queried (the upstream master).&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;slot_name&amp;lt;/tt&amp;gt;: An internal name for a given logical replication slot (a connection from a downstream master to this upstream master). This slot name is used by the downstream master to uniquely identify its self and is used with the &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; command when managing logical replication slots.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;plugin&amp;lt;/tt&amp;gt;: The logical replication plugin being used to decode WAL archives. You'll generally only see &amp;lt;tt&amp;gt;bdr_output&amp;lt;/tt&amp;gt; here.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;database&amp;lt;/tt&amp;gt;: The oid of the database being replicated by this slot. You can get the database name by joining on &amp;lt;tt&amp;gt;pg_database.oid&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;active&amp;lt;/tt&amp;gt;: Whether this slot currently has an active connection.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;xmin&amp;lt;/tt&amp;gt;: The lowest transaction ID this replication slot can &amp;quot;see&amp;quot;. (TODO)&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;last_required_checkpoint&amp;lt;/tt&amp;gt;: The checkpoint identifying the oldest WAL record required to bring this slot up to date with the upstream master. (TODO)&lt;br /&gt;
&lt;br /&gt;
==== pg_stat_bdr ====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;pg_catalog.pg_stat_bdr&amp;lt;/tt&amp;gt; view ... (TODO)&lt;br /&gt;
&lt;br /&gt;
View structure:&lt;br /&gt;
&lt;br /&gt;
(TODO)&lt;br /&gt;
&lt;br /&gt;
=== Table and index usage statistics ===&lt;br /&gt;
&lt;br /&gt;
Statistics on table and index usage are updated normally by the downstream master. This is essential for correct function of auto-vacuum. If there are no local writes on the downstream master and stats have not been reset these two views should show matching results between upstream and downstream:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_user_tables&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_statio_user_tables&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s than non-identifying indexes are. &lt;br /&gt;
&lt;br /&gt;
The built-in index monitoring views are:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_stat_user_indexes&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;pg_statio_user_indexes&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All these views are discussed in [http://www.postgresql.org/docs/current/static/monitoring-stats.html#MONITORING-STATS-VIEWS-TABLE the PostgreSQL documentation on the statistics views].&lt;br /&gt;
&lt;br /&gt;
=== Starting, stopping and managing replication ===&lt;br /&gt;
&lt;br /&gt;
TODO: Extension to improve this?&lt;br /&gt;
&lt;br /&gt;
==== Starting a new LLSR connection ====&lt;br /&gt;
&lt;br /&gt;
Logical replication is started automatically when a database is configured as a downstream master in &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt; (see [[#Configuration|Configuration]]) and the postmaster is started. No explicit action is required to start replication, but replication will not actually work unless the upstream and downstream databases are identical within the requirements set by LLSR in the [[#Table definitions and DDL replication||Table definitions and DDL replication]] section.&lt;br /&gt;
&lt;br /&gt;
==== Viewing logical replication slots ====&lt;br /&gt;
&lt;br /&gt;
Examining the state of logical replication is discussed in [[#Monitoring|Monitoring]].&lt;br /&gt;
&lt;br /&gt;
==== Temporarily stopping an LLSR replica ====&lt;br /&gt;
&lt;br /&gt;
LLSR replicas can be temporarily stopped by shutting down the downstream master's postmaster.&lt;br /&gt;
&lt;br /&gt;
If the replica is not started back up before the upstream master discards the oldest WAL segment required for the downstream master to resume replay, as identified by the &amp;lt;tt&amp;gt;last_required_checkpoint&amp;lt;/tt&amp;gt; column of &amp;lt;tt&amp;gt;pg_catalog.pg_stat_logical_replication&amp;lt;/tt&amp;gt; then the replica will not resume replay. The error [[#Insufficient_WAL_segments_retained_.28.22requested_WAL_segment_..._has_already_been_removed.22.29|Insufficient WAL segments retained]] will be reported in the upstream master's logs. The replica must be re-seeded for replication to continue.&lt;br /&gt;
&lt;br /&gt;
TODO: Discuss any SQL-level, per-database functions for managing replication.&lt;br /&gt;
&lt;br /&gt;
==== Removing an LLSR replica permanently ====&lt;br /&gt;
&lt;br /&gt;
To remove a replication connection permanently, remove its entries from the downstream master's &amp;lt;tt&amp;gt;postgresql.conf&amp;lt;/tt&amp;gt;, restart the downstream master, then use &amp;lt;tt&amp;gt;pg_receivellog&amp;lt;/tt&amp;gt; to remove the replication slot on the upstream master.&lt;br /&gt;
&lt;br /&gt;
TODO pending merge of downstream control functions.&lt;br /&gt;
&lt;br /&gt;
==== Cleaning up abandoned replication slots ====&lt;br /&gt;
&lt;br /&gt;
To remove a replication slot that was used for a now-defunct replica, find its slot name from the &amp;lt;tt&amp;gt;[[#pg_stat_logical_replication|pg_stat_logical_replication]]&amp;lt;/tt&amp;gt; view on the upstream master then run:&lt;br /&gt;
&lt;br /&gt;
 pg_receivellog -p 5434 -h master-hostname -d dbname \&lt;br /&gt;
    --slot='bdr: 16384:5873181566046043070-1-16384:' --stop&lt;br /&gt;
&lt;br /&gt;
where the argument to '--slot' is the slot name you found from the view.&lt;br /&gt;
&lt;br /&gt;
You may need to do this if you've created and then deleted several replicas so &amp;lt;tt&amp;gt;max_logical_slots&amp;lt;/tt&amp;gt; has filled up with entries for replicas that no longer exist.&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional replication is built directly on LLSR by configuring two or more servers as both upstream ''and'' downstream masters of each other.&lt;br /&gt;
&lt;br /&gt;
All of the Log Level Streaming Replication documentation applies to BDR and should be read before moving on to reading about and configuring BDR.&lt;br /&gt;
&lt;br /&gt;
=== Bi-Directional Replication Use Cases ===&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
==== Simple multi-master pair ====&lt;br /&gt;
&lt;br /&gt;
A simple mulit-master &amp;quot;HA Cluster&amp;quot; with two servers:&lt;br /&gt;
&lt;br /&gt;
* Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
* Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Alpha:&lt;br /&gt;
&lt;br /&gt;
 wal_level = 'logical'&lt;br /&gt;
 max_logical_slots = 3&lt;br /&gt;
 max_wal_senders = 4&lt;br /&gt;
 wal_keep_segments = 5000&lt;br /&gt;
 shared_preload_libraries = 'bdr'&lt;br /&gt;
 bdr.connections=&amp;quot;bravo&amp;quot;&lt;br /&gt;
 bdr.bravo.dsn = 'dbname=dbtoreplicate'&lt;br /&gt;
&lt;br /&gt;
Bravo:&lt;br /&gt;
&lt;br /&gt;
 wal_level = 'logical'&lt;br /&gt;
 max_logical_slots = 3&lt;br /&gt;
 max_wal_senders = 4&lt;br /&gt;
 wal_keep_segments = 5000&lt;br /&gt;
 shared_preload_libraries = 'bdr'&lt;br /&gt;
 bdr.connections=&amp;quot;alpha&amp;quot;&lt;br /&gt;
 bdr.alpha.dsn = 'dbname=dbtoreplicate'&lt;br /&gt;
&lt;br /&gt;
See [[#Configuration|Configuration]] for an explanation of these parameters.&lt;br /&gt;
&lt;br /&gt;
==== HA and Logical Standby ====&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;HA Cluster&amp;quot;:&lt;br /&gt;
&lt;br /&gt;
* Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
* Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
* Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
==== Very High Availability Multi-Master ====&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site simple Multi-Master Plex ====&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
If you wanted to test this configuration locally you could run three PostgreSQL instances on different ports. Such a configuration would look like the following if the port numbers were used as node names for the sake of notational clarity:&lt;br /&gt;
&lt;br /&gt;
Config for node_5440:&lt;br /&gt;
&lt;br /&gt;
 port = 5440&lt;br /&gt;
 bdr.connections='node_5441,node_5442'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5441:&lt;br /&gt;
&lt;br /&gt;
 port = 5441&lt;br /&gt;
 bdr.connections='node_5440,node_5442'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5442:&lt;br /&gt;
&lt;br /&gt;
 port = 5442&lt;br /&gt;
 bdr.connections='node_5440,node_5441'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
In a typical real-world configuration each server would be on the same port on a different host instead.&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site simple Multi-Master Circular Replication ====&lt;br /&gt;
&lt;br /&gt;
Simpler config uses &amp;quot;circular replication&amp;quot;. This is simpler but results in higher latency for changes as the number of nodes increases. It's also less resilient to network disruptions and node faults.&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
TODO: Regrettably this doesn't actually work yet because we don't cascade logical changes (yet).&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
Config for node_5440:&lt;br /&gt;
&lt;br /&gt;
 port = 5440&lt;br /&gt;
 bdr.connections='node_5441'&lt;br /&gt;
 bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5441:&lt;br /&gt;
&lt;br /&gt;
 port = 5441&lt;br /&gt;
 bdr.connections='node_5442'&lt;br /&gt;
 bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Config for node_5442:&lt;br /&gt;
&lt;br /&gt;
 port = 5442&lt;br /&gt;
 bdr.connections='node_5440'&lt;br /&gt;
 bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
This would usually be done in the real world with databases on different hosts, all running on the same port.&lt;br /&gt;
&lt;br /&gt;
==== 3-remote site Max Availability Multi-Master Plex ====&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
Config left as an exercise for the reader.&lt;br /&gt;
&lt;br /&gt;
==== N-site symmetric cluster replication ====&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
==== Complex/Assymetric Replication ====&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
=== Conflict Avoidance ===&lt;br /&gt;
&lt;br /&gt;
==== Distributed Locking ====&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications as very low latency is critical for acceptable performance.&lt;br /&gt;
&lt;br /&gt;
Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible but allow some types of conflict to occur and and resolve them when they arise.&lt;br /&gt;
&lt;br /&gt;
==== Global Sequences ====&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
The SQL standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using &amp;lt;tt&amp;gt;DEFAULT nextval('mysequence')&amp;lt;/tt&amp;gt;, as with PostgreSQL's &amp;lt;tt&amp;gt;SERIAL&amp;lt;/tt&amp;gt; pseudo-type.&lt;br /&gt;
&lt;br /&gt;
BDR requires sequences to work together across multiple nodes. This is implemented as a new &amp;lt;tt&amp;gt;SequenceAccessMethod&amp;lt;/tt&amp;gt; API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
=== Conflict Detection &amp;amp; Resolution ===&lt;br /&gt;
&lt;br /&gt;
Because local writes can occur on a master, conflict detection and avoidance is a concern for basic LLSR setups as well as full BDR configurations.&lt;br /&gt;
&lt;br /&gt;
==== Lock Conflicts ====&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the &amp;lt;tt&amp;gt;[http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LOCK-WAITS log_lock_waits]&amp;lt;/tt&amp;gt; facility to look for issues with apply blocking on locks.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
&lt;br /&gt;
* explicit row level locking (&amp;lt;tt&amp;gt;SELECT ... FOR UPDATE/FOR SHARE&amp;lt;/tt&amp;gt;)&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s, &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s or &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;s, either from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
==== Data Conflicts ====&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in a consistent and idempotent manner so that all servers end up with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from &amp;lt;tt&amp;gt;pg_control&amp;lt;/tt&amp;gt; though this may change in a future release.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt;s and &amp;lt;tt&amp;gt;INSERT&amp;lt;/tt&amp;gt;s may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur. The errors causing the conflict can be seen in the error log of the downstream master with the problem.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be &amp;lt;tt&amp;gt;DELETE&amp;lt;/tt&amp;gt;/&amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; conflicts. These are accepted as successful operations but in the case of &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; the data in the &amp;lt;tt&amp;gt;UPDATE&amp;lt;/tt&amp;gt; is discarded.&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins. It is not practical to decide when a row should be merged and when a last-update-wins stragegy should be used at the database level; such decision making would require support for application-specific conflict resolution plugins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in the same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
==== Examples ====&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/GSoC_2013</id>
		<title>GSoC 2013</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/GSoC_2013"/>
				<updated>2013-04-08T21:04:39Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Project Ideas */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Projects ==&lt;br /&gt;
&lt;br /&gt;
The GSoC projects for 2013 will be listed here when selected:&lt;br /&gt;
&lt;br /&gt;
#&lt;br /&gt;
&lt;br /&gt;
== What is GSoC? ==&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code (GSoC) is a global program that offers student developers stipends to write code for various open source software projects. We have worked with several open source, free software, and technology-related groups to identify and fund several projects over a three month period. Since its inception in 2005, the program has brought together over 4,500 students and more than more than 4,000 mentors &amp;amp; co-mentors from over 85 countries worldwide, all for the love of code. Through Google Summer of Code, accepted student applicants are paired with a mentor or mentors from the participating projects, thus gaining exposure to real-world software development scenarios and the opportunity for employment in areas related to their academic pursuits. In turn, the participating projects are able to more easily identify and bring in new developers. Best of all, more source code is created and released for the use and benefit of all.&lt;br /&gt;
&lt;br /&gt;
PostgreSQL has an official summer of code page: http://www.postgresql.org/developer/summerofcode.html&lt;br /&gt;
&lt;br /&gt;
== Advice for Students ==&lt;br /&gt;
&lt;br /&gt;
We have developed the following [http://www.postgresql.org/developer/summerofcodeadvice Advice for Students] page to get you started.&lt;br /&gt;
&lt;br /&gt;
Also, students who discuss their proposals with Postgres project members *before* the application deadline are much more likely to be successful.&lt;br /&gt;
&lt;br /&gt;
== Mailing list for student questions ==&lt;br /&gt;
&lt;br /&gt;
For GSoC program questions and discussion, please subscribe to:&lt;br /&gt;
&lt;br /&gt;
http://archives.postgresql.org/pgsql-students/&lt;br /&gt;
&lt;br /&gt;
We can have non-code related discussions on this list, and help answer questions about proposal writing. &lt;br /&gt;
&lt;br /&gt;
Discussion of code should be done on the list for the specific project a student is working on. &lt;br /&gt;
&lt;br /&gt;
== IRC ==&lt;br /&gt;
&lt;br /&gt;
We have an IRC channel at #postgresql and you are welcome to answer questions. Our GSoC Admins are: darkixion and agliodbs. Do not DM them without asking first! &lt;br /&gt;
&lt;br /&gt;
You are welcome to ping us in the channel, or ask the channel general questions about a proposal. If you do not find help in the channel, feel free to send an email to pgsql-students@postgresql.org for help.&lt;br /&gt;
&lt;br /&gt;
== Proposal Format ==&lt;br /&gt;
&lt;br /&gt;
Students are responsible for writing a proposal and submitting it to Google before the application deadline. The following outline was adapted from the [http://www.perlfoundation.org/how_to_write_a_proposal Perl Foundation open source proposal HOWTO]. A strong proposal will include:&lt;br /&gt;
&lt;br /&gt;
* Project Title&lt;br /&gt;
* Name of proposer and email&lt;br /&gt;
* Synopsis&lt;br /&gt;
* Benefits to the PostgreSQL Community&lt;br /&gt;
* Quantifiable results &lt;br /&gt;
* Project Details&lt;br /&gt;
* Inch-stones (project broken into small, distinct chunks)&lt;br /&gt;
* Project Schedule&lt;br /&gt;
* Completeness Criteria&lt;br /&gt;
* Bio&lt;br /&gt;
**Blog&lt;br /&gt;
**Github&lt;br /&gt;
**@Twitter&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Ideas ==&lt;br /&gt;
&lt;br /&gt;
Project ideas are to be added here.&lt;br /&gt;
&lt;br /&gt;
(Please can we have visibility of ideas on Hackers please to avoid overreaching what is possible in the time, and also working on dubious projects.)&lt;br /&gt;
&lt;br /&gt;
=== Core ===&lt;br /&gt;
* UPDATE ... RETURNING OLD [http://www.postgresql.org/message-id/20130218171259.GA26999@fetter.org link]&lt;br /&gt;
* Add RETURNING to DDL (CREATE, ALTER, DROP) and possibly DCL (GRANT, REVOKE) [http://www.postgresql.org/message-id/20130218171259.GA26999@fetter.org link]&lt;br /&gt;
&lt;br /&gt;
=== Extensions ===&lt;br /&gt;
* cube extension improvements (indexing, type support, new KNN search metrics) [http://www.postgresql.org/message-id/6A7E75B1-64DD-4F5F-A991-435E3E5A24FB@gmail.com link]&lt;br /&gt;
&lt;br /&gt;
=== Tools ===&lt;br /&gt;
* Rewrite (add) pg_dump and pg_restore utilities as libraries (.so, .dll &amp;amp; .dylib) [http://www.postgresql.org/message-id/1811491181.20130215163950@gf.microolap.com link]&lt;br /&gt;
* Extending MADlib functions to fill in (extrapolate) missing values in data sets [http://www.postgresql.org/message-id/B654BEBE-32D9-4670-BBB7-BC846AE5B785@gmail.com link1] [http://www.postgresql.org/message-id/511E7193.4020907@agliodbs.com link2]&lt;br /&gt;
* pg_upgrade support for Debian's pg_upgradecluster [http://www.postgresql.org/message-id/20130218213711.GA1005@awork2.anarazel.de link]&lt;br /&gt;
&lt;br /&gt;
== Project Admins ==&lt;br /&gt;
&lt;br /&gt;
* Thom Brown&lt;br /&gt;
* Josh Berkus&lt;br /&gt;
&lt;br /&gt;
== 2013 Mentors ==&lt;br /&gt;
&lt;br /&gt;
Mentors volunteered who have been active on -hackers list:&lt;br /&gt;
* Alvaro Herrera&lt;br /&gt;
* Stephen Frost (maybe)&lt;br /&gt;
* Dimitri Fontaine&lt;br /&gt;
* Alexander Korotkov&lt;br /&gt;
* Pavel Golub&lt;br /&gt;
* David Fetter&lt;br /&gt;
* Magnus Hagander&lt;br /&gt;
* Christoph Berg&lt;br /&gt;
* Tomas Vondra&lt;br /&gt;
&lt;br /&gt;
Other volunteers who can potentially act as assitants to mentors:&lt;br /&gt;
* Atri Sharma&lt;br /&gt;
* Gilberto Castillo&lt;br /&gt;
&lt;br /&gt;
== Past Success ==&lt;br /&gt;
&lt;br /&gt;
Need to add&lt;br /&gt;
&lt;br /&gt;
== GOALS: ==&lt;br /&gt;
* usable code&lt;br /&gt;
** useful/novel ideas&lt;br /&gt;
** research projects&lt;br /&gt;
* longer term contributors&lt;br /&gt;
&lt;br /&gt;
== TODOs: ==&lt;br /&gt;
&lt;br /&gt;
* Kick-off Meeting for Community Members&lt;br /&gt;
* Update GSOC page&lt;br /&gt;
* Advertising?&lt;br /&gt;
* Blog that we're participating and seeking students&lt;br /&gt;
* Round of private emails to people who have participated in the past: Heikki, Simon, Mark, Stephen, Merlin&lt;br /&gt;
** request interest, and then follow up in asking about possible topics for students&lt;br /&gt;
* Mentor recruitment and then email to -hackers&lt;br /&gt;
** do this much later when we have some proposals in?&lt;br /&gt;
&lt;br /&gt;
* Recruitment -- no organized group effort?&lt;br /&gt;
** -announce, -general, -hackers&lt;br /&gt;
** user group lists&lt;br /&gt;
** phppgadmin/pgadmin&lt;br /&gt;
** berkeley&lt;br /&gt;
** Univ. of Maryland -- contact them?&lt;br /&gt;
&lt;br /&gt;
* Identify the commitfest that the code will be submitted to&lt;br /&gt;
&lt;br /&gt;
== Expectations ==&lt;br /&gt;
&lt;br /&gt;
* Stuff to keep students together:&lt;br /&gt;
** Regular blogging from students&lt;br /&gt;
** weekly group IRC checkin? -- two checkin times maybe?&lt;br /&gt;
&lt;br /&gt;
* Have students communicate on -hackers where appropriate (didn't really work?)&lt;br /&gt;
** Or other relevant -devel lists&lt;br /&gt;
&lt;br /&gt;
* Mailing list&lt;br /&gt;
** pgsql-students (?)  vs. -hackers (?)  maybe up to mentor?&lt;br /&gt;
** mentors mailing list -admin mailing list, berkus said?&lt;br /&gt;
** students mailing list via gsoc&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-12T16:04:19Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* 3-remote site simple Multi-Master Circular Replication */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
NOTIFY is not supported across log based replication, either physical or logical.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
* bdr.synchronous_commit = ...;                  # optional parameter to set the synchronous_commit parameter the apply processes will be using&lt;br /&gt;
* max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master Plex ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5441:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5442:&lt;br /&gt;
** port = 5442&lt;br /&gt;
** bdr.connections='node_5440,node_5441'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master Circular Replication ===&lt;br /&gt;
&lt;br /&gt;
Simpler config uses &amp;quot;circular replication&amp;quot;. This is simpler but results in higher latency for changes as the number of nodes increases.&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5441:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5442'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5442:&lt;br /&gt;
** port = 5442&lt;br /&gt;
** bdr.connections='node_5440'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
Regrettably this doesn't actually work yet because we don't cascade logical changes (yet).&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master Plex ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
Config left as an exercise for the reader.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-12T16:01:53Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* 3-remote site Max Availability Multi-Master Plex */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
NOTIFY is not supported across log based replication, either physical or logical.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
* bdr.synchronous_commit = ...;                  # optional parameter to set the synchronous_commit parameter the apply processes will be using&lt;br /&gt;
* max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master Plex ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5441:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5442:&lt;br /&gt;
** port = 5442&lt;br /&gt;
** bdr.connections='node_5440,node_5441'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master Circular Replication ===&lt;br /&gt;
&lt;br /&gt;
Simpler config uses &amp;quot;circular replication&amp;quot;. This is simpler but results in higher latency for changes as the number of nodes increases.&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5441:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5442'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5442:&lt;br /&gt;
** port = 5442&lt;br /&gt;
** bdr.connections='node_5440'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master Plex ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
Config left as an exercise for the reader.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-12T16:01:12Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* 3-remote site simple Multi-Master Plex */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
NOTIFY is not supported across log based replication, either physical or logical.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
* bdr.synchronous_commit = ...;                  # optional parameter to set the synchronous_commit parameter the apply processes will be using&lt;br /&gt;
* max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master Plex ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5441:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5442:&lt;br /&gt;
** port = 5442&lt;br /&gt;
** bdr.connections='node_5440,node_5441'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master Circular Replication ===&lt;br /&gt;
&lt;br /&gt;
Simpler config uses &amp;quot;circular replication&amp;quot;. This is simpler but results in higher latency for changes as the number of nodes increases.&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5441:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5442'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5442:&lt;br /&gt;
** port = 5442&lt;br /&gt;
** bdr.connections='node_5440'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master Plex ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-12T15:59:35Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* 3-remote site simple Multi-Master Circular Replication */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
NOTIFY is not supported across log based replication, either physical or logical.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
* bdr.synchronous_commit = ...;                  # optional parameter to set the synchronous_commit parameter the apply processes will be using&lt;br /&gt;
* max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master Plex ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master Circular Replication ===&lt;br /&gt;
&lt;br /&gt;
Simpler config uses &amp;quot;circular replication&amp;quot;. This is simpler but results in higher latency for changes as the number of nodes increases.&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Echo using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha using logical streaming replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5441:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5442'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5442:&lt;br /&gt;
** port = 5442&lt;br /&gt;
** bdr.connections='node_5440'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master Plex ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-12T15:55:14Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Bi-Directional Replication Use Cases */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
NOTIFY is not supported across log based replication, either physical or logical.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
* bdr.synchronous_commit = ...;                  # optional parameter to set the synchronous_commit parameter the apply processes will be using&lt;br /&gt;
* max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master Plex ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master Circular Replication ===&lt;br /&gt;
&lt;br /&gt;
Simpler config uses &amp;quot;circular replication&amp;quot;. This is simpler but results in higher latency for changes as the number of nodes increases.&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master Plex ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-12T15:49:10Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Data Conflicts */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
NOTIFY is not supported across log based replication, either physical or logical.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
* bdr.synchronous_commit = ...;                  # optional parameter to set the synchronous_commit parameter the apply processes will be using&lt;br /&gt;
* max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors because of primary keys, unique indexes and exclusion constraints when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-12T15:45:16Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;unlogged&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
NOTIFY is not supported across log based replication, either physical or logical.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
* bdr.synchronous_commit = ...;                  # optional parameter to set the synchronous_commit parameter the apply processes will be using&lt;br /&gt;
* max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins.&lt;br /&gt;
&lt;br /&gt;
Changing unlogged and logged tables in same transaction can result in apparently strange outcomes since the unlogged tables aren't replicated.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-12T15:41:23Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
NOTIFY is not supported across log based replication, either physical or logical.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
* bdr.synchronous_commit = ...;                  # optional parameter to set the synchronous_commit parameter the apply processes will be using&lt;br /&gt;
* max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns can result in &amp;quot;false conflicts&amp;quot;, where there is conflict in terms of the data, just in terms of the row update. Such conflicts will result in just one of those changes being made, the other discarded according to last update wins.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-12T15:32:56Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;NOTIFY&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
NOTIFY is not supported across log based replication, either physical or logical.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx'            # optional parameter to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
* bdr.synchronous_commit = ...;                  # optional parameter to set the synchronous_commit parameter the apply processes will be using&lt;br /&gt;
* max_logical_slots = X                          # set to the number of remotes&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns will result in just one of those changes being made, the other discarded.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T16:27:02Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Very High Availability Multi-Master */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns will result in just one of those changes being made, the other discarded.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T16:26:29Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Very High Availability Multi-Master */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns will result in just one of those changes being made, the other discarded.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T16:26:12Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;bravo&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.bravo.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Bravo&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Bravo using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Bravo&amp;quot; - Physical Standby - feeds changes to Charlie, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Charlie&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Charlie using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns will result in just one of those changes being made, the other discarded.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T14:09:46Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Data Conflicts */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
All conflicts are resolved at row level. Concurrent updates that touch completely separate columns will result in just one of those changes being made, the other discarded.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T14:08:44Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Replication of DML changes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
The values applied are the resulting values from the original UPDATE, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value and volatile or stable functions carry their original values with them.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T14:04:28Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Data Conflicts */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
Conflicts are logged if we specify bdr.log_conflicts = on&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T14:02:11Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Data Conflicts */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution. It is important that these conflicts are resolved in an idempotent, similar manner so that all servers end with identical results.&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using system identifier from pg_control (currently).&lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T13:54:21Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Data Conflicts */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
Concurrent updates are resolved using last-update-wins strategy using timestamps. Should timestamps be identical, the tie is broken using node priority. &lt;br /&gt;
&lt;br /&gt;
Updates and Inserts may cause uniqueness violation errors when changes are applied at remote nodes. These are not easily resolvable and represent severe application errors that cause the database contents of multiple servers to diverge from each other. Hence these are known as &amp;quot;divergent conflicts&amp;quot;. Currently, replication stops should a divergent conflict occur.&lt;br /&gt;
&lt;br /&gt;
Updates which cannot locate a row are presumed to be Delete/Update conflicts. These are counted but the Update is discarded.&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T13:48:34Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Global Sequences */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. This is important with optimistic conflict resolution schemes because uniqueness violations are &amp;quot;divergent errors&amp;quot; and are not easily resolvable.&lt;br /&gt;
&lt;br /&gt;
SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T13:46:29Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Global Sequences */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Conflict Avoidance ==&lt;br /&gt;
&lt;br /&gt;
=== Distributed Locking ===&lt;br /&gt;
&lt;br /&gt;
Some clustering systems use distributed lock mechanisms to prevent concurrent access to data. These can perform reasonably when servers are very close but cannot support geographically distributed applications. Distributed locking is essentially a pessimistic approach, whereas BDR advocates an optimistic approach: avoid conflicts where possible though allow some types of conflict to occur and then resolve them when that happens.&lt;br /&gt;
&lt;br /&gt;
=== Global Sequences ===&lt;br /&gt;
&lt;br /&gt;
Many applications require unique values be assigned to database entries. Some applications use GUIDs generated by external programs, some use database-supplied values. SQL Standard requires Sequence objects which provide unique values, though these are isolated to a single node. These can then used to supply default values using DEFAULT nextval('mysequence'), as with the SERIAL datatype.&lt;br /&gt;
&lt;br /&gt;
BDR requires Sequences to work together across multiple nodes. This is implemented as a new SequenceAccessMethod API (SeqAM), which allows plugins that provide get/set functions for sequences. Global Sequences are then implemented as a plugin which implements the SeqAM API and communicates across nodes to allow new ranges of values to be stored for each sequence.&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T13:31:28Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Conflict Detection &amp;amp; Resolution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Global Sequences ==&lt;br /&gt;
&lt;br /&gt;
MORE DOCS REQUIRED HERE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
=== Examples ===&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T13:23:13Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Lock Conflicts */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Global Sequences ==&lt;br /&gt;
&lt;br /&gt;
MORE DOCS REQUIRED HERE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include &lt;br /&gt;
* explicit row level locking&lt;br /&gt;
* locking from foreign keys&lt;br /&gt;
* implicit locking because of row updates or deletes, from local activity or apply from other servers&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T13:18:26Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Operational Issues and Debugging */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which are automatically re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Global Sequences ==&lt;br /&gt;
&lt;br /&gt;
MORE DOCS REQUIRED HERE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include explicit row level locking, locking from foreign keys, implicit locking because of row updates or deletes.&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T09:29:57Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLSR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which may be re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Global Sequences ==&lt;br /&gt;
&lt;br /&gt;
MORE DOCS REQUIRED HERE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include explicit row level locking, locking from foreign keys, implicit locking because of row updates or deletes.&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T08:35:13Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Data Conflicts */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLDR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLDR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which may be re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Global Sequences ==&lt;br /&gt;
&lt;br /&gt;
MORE DOCS REQUIRED HERE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include explicit row level locking, locking from foreign keys, implicit locking because of row updates or deletes.&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. &lt;br /&gt;
&lt;br /&gt;
* We update a row on Customer table on NodeA. The change from NodeA is applied to NodeB just as we are inserting an activity on NodeB. The inserted activity causes a FK check.... &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T08:24:38Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Logical Log Streaming Replication */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to &amp;quot;streaming replication&amp;quot; i.e. physical log streaming replication (PLSR) from a user perspective - the main and big difference is that the receiving server is also a full master database that is non-readonly and can make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLDR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLDR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which may be re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Global Sequences ==&lt;br /&gt;
&lt;br /&gt;
MORE DOCS REQUIRED HERE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include explicit row level locking, locking from foreign keys, implicit locking because of row updates or deletes.&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. We update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T08:23:13Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* New/Changed Parameter Reference */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to physical log streaming replication from a user perspective - the main and big difference is that the receiving server is also a full master database that can also make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLDR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLDR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master. Security for LLSR is identical to physical log streaming replication&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which may be re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Global Sequences ==&lt;br /&gt;
&lt;br /&gt;
MORE DOCS REQUIRED HERE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include explicit row level locking, locking from foreign keys, implicit locking because of row updates or deletes.&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. We update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T08:21:32Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* User Guide */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to physical log streaming replication from a user perspective - the main and big difference is that the receiving server is also a full master database that can also make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLDR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLDR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
==== New/Changed Parameter Reference ====&lt;br /&gt;
&lt;br /&gt;
bdr.connections - list of nodes that this server will connect to. For each name listed here there must be one bdr.&amp;lt;nodename&amp;gt;.dsn entry&lt;br /&gt;
&lt;br /&gt;
bdr.&amp;lt;nodename&amp;gt;.dsn - &amp;quot;data source name&amp;quot; - connection info for connecting to upstream master&lt;br /&gt;
&lt;br /&gt;
max_logical_slots - LLSR uses persistent slots in memory which are reserved for each node at server start&lt;br /&gt;
&lt;br /&gt;
wal_level - allows a new setting of 'logical' which produces mildly enhanced WAL contents to allow decoding of the WAL back into a logical change stream.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which may be re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;alpha&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.alpha.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
=== HA and Logical Standby ===&lt;br /&gt;
Downstream masters allow users to create temporary tables, so they can be used as reporting servers.&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Current Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - unused, apart from as failover target for Alpha - potentially specified in synchronous_standby_names&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - &amp;quot;Logical Standby&amp;quot; - downstream master&lt;br /&gt;
&lt;br /&gt;
=== Very High Availability Multi-Master ===&lt;br /&gt;
A typical configuration for remote multi-master would then be:&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth between Site 1 and Site 2 is minimised&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site simple Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
BDR supports &amp;quot;all to all&amp;quot; connections, so the latency for any change being applied on other masters is minimised. (Note that early designs of multi-master were arranged for circular replication, which has latency issues with larger numbers of nodes)&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
Using node names that match port numbers, for clarity&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
**port = 5440&lt;br /&gt;
** bdr.connections='node_5441,node_5442'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5442.dsn='port=5442 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
* config for 5440:&lt;br /&gt;
** port = 5441&lt;br /&gt;
** bdr.connections='node_5440,node_5442'&lt;br /&gt;
** bdr.node_5440.dsn='port=5440 dbname=postgres'&lt;br /&gt;
** bdr.node_5441.dsn='port=5441 dbname=postgres'&lt;br /&gt;
&lt;br /&gt;
=== 3-remote site Max Availability Multi-Master ===&lt;br /&gt;
&lt;br /&gt;
* Site 1&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master - feeds changes to Beta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Physical Standby - feeds changes to Gamma, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 2&lt;br /&gt;
** Server &amp;quot;Gamma&amp;quot; - Master - feeds changes to Delta using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Delta&amp;quot; - Physical Standby - feeds changes to Alpha, Echo using logical streaming&lt;br /&gt;
&lt;br /&gt;
* Site 3&lt;br /&gt;
** Server &amp;quot;Echo&amp;quot; - Master - feeds changes to Foxtrot using physical streaming with sync replication&lt;br /&gt;
** Server &amp;quot;Foxtrot&amp;quot; - Physical Standby - feeds changes to Alpha, Gamma using logical streaming&lt;br /&gt;
&lt;br /&gt;
Bandwidth and latency between sites is minimised.&lt;br /&gt;
&lt;br /&gt;
=== N-site symmetric cluster replication ===&lt;br /&gt;
&lt;br /&gt;
Symmetric cluster is where all masters are connected to each other.&lt;br /&gt;
&lt;br /&gt;
N=19 has been tested and works fine.&lt;br /&gt;
&lt;br /&gt;
N masters requires N-1 connections to other masters, so practical limits are &amp;lt;100 servers, or less if you have many separate databases.&lt;br /&gt;
&lt;br /&gt;
The amount of work caused by each change is O(N), so there is a much lower practical limit based upon resource limits. A future option to limit to filter rows/tables for replication becomes essential with larger or more heavily updated databases, which is planned.&lt;br /&gt;
&lt;br /&gt;
=== Complex/Assymetric Replication ===&lt;br /&gt;
&lt;br /&gt;
Variety of options are possible.&lt;br /&gt;
&lt;br /&gt;
== Global Sequences ==&lt;br /&gt;
&lt;br /&gt;
MORE DOCS REQUIRED HERE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conflict Detection &amp;amp; Resolution ==&lt;br /&gt;
&lt;br /&gt;
=== Lock Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Changes from the upstream master are applied on the downstream master by a single apply process. That process needs to RowExclusiveLock on the changing table and be able to write lock the changing tuple(s). Concurrent activity will prevent those changes from being immediately applied because of lock waits. Use the log_lock_waits facility to look for issues there.&lt;br /&gt;
&lt;br /&gt;
By concurrent activity on a row, we include explicit row level locking, locking from foreign keys, implicit locking because of row updates or deletes.&lt;br /&gt;
&lt;br /&gt;
=== Data Conflicts ===&lt;br /&gt;
&lt;br /&gt;
Concurrent updates and deletes may also cause data-level conflicts to occur, which then require conflict resolution (see below).&lt;br /&gt;
&lt;br /&gt;
As an example, lets say we have two tables Activity and Customer. There is a Foreign Key from Activity to Customer, constraining us to only record activity rows that have a matching customer row. We update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Replication]]&lt;/div&gt;</summary>
		<author><name>Simon</name></author>	</entry>

	<entry>
		<id>http://wiki.postgresql.org/wiki/BDR_User_Guide</id>
		<title>BDR User Guide</title>
		<link rel="alternate" type="text/html" href="http://wiki.postgresql.org/wiki/BDR_User_Guide"/>
				<updated>2013-03-08T08:07:35Z</updated>
		
		<summary type="html">&lt;p&gt;Simon:&amp;#32;/* Conflict Detection &amp;amp; Resolution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BDR stands for '''B'''i'''D'''rectional '''R'''eplication. &lt;br /&gt;
&lt;br /&gt;
Design work began in late 2011 to look at ways of adding new features to PostgreSQL core to support a flexible new infrastructure for replication that built upon and enhanced the existing streaming replication features added in 9.1-9.2. Initial design and project planning was by Simon Riggs; implementation lead is now Andres Freund, both from [http://www.2ndQuadrant.com/ 2ndQuadrant]. Various additional development contributions from the wider 2ndQuadrant team as well as reviews and input from other community devs.&lt;br /&gt;
&lt;br /&gt;
At the [[PgCon2012CanadaInCoreReplicationMeeting]] an inital version of the design was presented. A presentation containing reasons leading to the current design and a prototype of it, including preliminary performance results, is [[:File:BDR_Presentation_PGCon2012.pdf|available here]].&lt;br /&gt;
&lt;br /&gt;
= Project Overview and Plans =&lt;br /&gt;
== Project Aims ==&lt;br /&gt;
* in core&lt;br /&gt;
* fast&lt;br /&gt;
* reusable individual parts (see below), usable by other projects (slony, ...)&lt;br /&gt;
* basis for easier sharding/write scalability&lt;br /&gt;
* wide geographic distribution of replicated nodes&lt;br /&gt;
&lt;br /&gt;
== High Level Planning ==&lt;br /&gt;
=== 9.3 ===&lt;br /&gt;
Fundamental changes have been made as part of 9.3 to support BDR; total of 16 separate commits on these and other smaller aspects&lt;br /&gt;
&lt;br /&gt;
* background workers&lt;br /&gt;
* xlogreader implementation&lt;br /&gt;
* pg_xlogdump&lt;br /&gt;
&lt;br /&gt;
Fully working implementation will be available for production use in 2013. At this stage, probably more than 50% of code exists out of core.&lt;br /&gt;
&lt;br /&gt;
Exact mechanism for dissemination is not yet announced; key objective is to develop code with the objective of being core/contrib modules. There is no long term plan for existence of code outside of core.&lt;br /&gt;
&lt;br /&gt;
=== 9.4 ===&lt;br /&gt;
Objective to implement main BDR features into core Postgres.&lt;br /&gt;
&lt;br /&gt;
=== 9.5 ===&lt;br /&gt;
Additional features based upon feedback&lt;br /&gt;
&lt;br /&gt;
== Aspects of BDR ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication consists of a number of related features&lt;br /&gt;
&lt;br /&gt;
* Logical Log Streaming Replication - getting data from one master to another.&lt;br /&gt;
* Global Sequences - ability to support sequences that work globally across a set of nodes&lt;br /&gt;
* Conflict Detection &amp;amp; Resolution (options)&lt;br /&gt;
* DDL Replication via Event Triggers&lt;br /&gt;
&lt;br /&gt;
Taken together these features will allow replication in both directions for any pair of servers. We could call this &amp;quot;multi-master replication&amp;quot;, but the possibilities for constructing complex networks of servers allow much more than that, so we use the more general term bi-directional replication.&lt;br /&gt;
&lt;br /&gt;
Note that these features aren't &amp;quot;clustering&amp;quot; in the sense that Oracle RAC uses the term. There is no distributed lock manager, global transaction coordinator etc.. The vision here is interconnected yet still separate servers, allowing each server to have radically different workloads and yet still work together, even across global scale and large geographic separation.&lt;br /&gt;
&lt;br /&gt;
= User Guide =&lt;br /&gt;
&lt;br /&gt;
== Logical Log Streaming Replication ==&lt;br /&gt;
&lt;br /&gt;
Logical log streaming replication (LLSR) allows us to send changes from one master server to another master server. This is similar in many ways to physical log streaming replication from a user perspective - the main and big difference is that the receiving server is also a full master database that can also make changes. The master that sends data is also known as the upstream master and the master that receives data is also known as the downstream master. Data is sent in one direction only; setting up a configuration with data passing in both directions is called Bi-Directional Replication, discussed later.&lt;br /&gt;
&lt;br /&gt;
The data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this &amp;quot;logical log replication&amp;quot;. Changes are passed from upstream to downstream using the libpq protocol, just as with physical log streaming replication.&lt;br /&gt;
&lt;br /&gt;
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database connection must be specified, so it is possible to filter out unwanted databases simply by avoiding configuring replication for those databases.&lt;br /&gt;
&lt;br /&gt;
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE and this also requires a server restart. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated.&lt;br /&gt;
&lt;br /&gt;
Changes are handled by means of a BDR plugin, allowing multiple options. Current options are:&lt;br /&gt;
&lt;br /&gt;
* pg_xlogdump - examines physical WAL records and produces textual debugging output (server program included in 9.3)&lt;br /&gt;
* textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)&lt;br /&gt;
* BDR apply process - applies logical changes to downstream master, making changes directly rather than generating SQL text and then parse/plan/executing SQL.&lt;br /&gt;
&lt;br /&gt;
=== Replication of DML changes ===&lt;br /&gt;
&lt;br /&gt;
All changes are replicated: INSERT, UPDATE, DELETE, TRUNCATE. Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLDR avoids much of the overhead from physical WAL, though does have message header overheads also, so bandwidth can be reduced in some, but not all cases.&lt;br /&gt;
&lt;br /&gt;
(TRUNCATE currently not implemented yet)&lt;br /&gt;
&lt;br /&gt;
LOCK statements are not replicated (possible future feature).&lt;br /&gt;
&lt;br /&gt;
Temporary and Unlogged tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables.&lt;br /&gt;
&lt;br /&gt;
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master - these are likely to go at same speed as on origin, as long as an index is defined on the Primary Key of the table on the downstream master. INSERTs on upstream master do not require a unique constraint in order to replicate correctly. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL.&lt;br /&gt;
&lt;br /&gt;
UPDATEs that change the Primary Key of a table will be replicated correctly.&lt;br /&gt;
&lt;br /&gt;
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.&lt;br /&gt;
&lt;br /&gt;
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master.&lt;br /&gt;
&lt;br /&gt;
Current plugin is binary only, requiring upstream and downstream master to use same CPU architecture and word-length, i.e. &amp;quot;identical servers&amp;quot;, as with physical replication.&lt;br /&gt;
&lt;br /&gt;
A textual output option will be available for passing data between non-identical servers, e.g. laptops communicating with a central server.&lt;br /&gt;
&lt;br /&gt;
Changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is effeciently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.&lt;br /&gt;
&lt;br /&gt;
Changes are applied to the downstream master in commit sequence. This is a known-good serialization ordering of changes, so no replication failures are possible, as can happen with statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).&lt;br /&gt;
&lt;br /&gt;
Larger transactions scroll to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.&lt;br /&gt;
&lt;br /&gt;
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. This means that we always update the correct tables, whatever the setting of search_path.&lt;br /&gt;
&lt;br /&gt;
In some cases, additional deadlocks can occur on apply. This causes a retry of the apply of the replaying transaction.&lt;br /&gt;
&lt;br /&gt;
Lock waits would cause latency problems/apply delays. This only applies to LOCK statements and DDL.&lt;br /&gt;
&lt;br /&gt;
=== Table definitions and DDL replication ===&lt;br /&gt;
&lt;br /&gt;
DML changes are replicated between tables with matching Schemaname.Tablename on both upstream and downstream masters. e.g. changes from Public.MyTable will go to Public.MyTable and MySchema.MyTable will go to MySchema.MyTable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.&lt;br /&gt;
&lt;br /&gt;
This requires careful and exact synchronisation of table definitions on each node otherwise ERRORs will be generated. There are no plans to implement working replication between dissimilar table definitions.&lt;br /&gt;
&lt;br /&gt;
In general, &amp;quot;exact match&amp;quot; is the best guide. Current details (subject to change) are&lt;br /&gt;
* Secondary indexes may differ between nodes&lt;br /&gt;
* Constraints must match for BDR.&lt;br /&gt;
* Storage parameters must match.&lt;br /&gt;
* Table-level parameters, e.g. fillfactor, autovacuum may differ&lt;br /&gt;
* Inheritance must be the same&lt;br /&gt;
&lt;br /&gt;
Triggers and Rules are NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.&lt;br /&gt;
&lt;br /&gt;
Replication of DDL changes between nodes will be possible using event triggers, but is not yet integrated with LLSR.&lt;br /&gt;
&lt;br /&gt;
=== Selective Replication (Table/Row-level filtering) ===&lt;br /&gt;
&lt;br /&gt;
LLSR doesn't yet support selection of data at table or row level, only at database level. It is a design goal to be able to support this in the future.&lt;br /&gt;
&lt;br /&gt;
=== Other Terminology ===&lt;br /&gt;
&lt;br /&gt;
(Physical) Streaming replication talks about Master and Standby, so we could also talk about Master and Physical Standby, and then use Master and Logical Standby to describe LLDR. That terminology doesn't work when we consider that replication might be bi-directional, or at could be reconfigured that way in the future.&lt;br /&gt;
&lt;br /&gt;
Similarly, the terms Origin, Provider and Subcriber only work with one Origin.&lt;br /&gt;
&lt;br /&gt;
=== Configuration ===&lt;br /&gt;
&lt;br /&gt;
Upstream master&lt;br /&gt;
&lt;br /&gt;
* wal_level = 'logical'&lt;br /&gt;
* max_logical_slots = X&lt;br /&gt;
* max_wal_senders = Y                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
&lt;br /&gt;
Downstream master&lt;br /&gt;
&lt;br /&gt;
* shared_preload_libraries = 'bdr'&lt;br /&gt;
&lt;br /&gt;
* bdr.connections=&amp;quot;name_of_upstream_master&amp;quot;      # list of upstream master nodenames&lt;br /&gt;
* bdr.&amp;lt;nodename&amp;gt;.dsn = 'dbname=postgres'         # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* (Also need a parameter like bdr.&amp;lt;nodename&amp;gt;.local_dbname = 'xxx' to cover the case where the databasename on upstream and downstream master differ. Not yet implemented)&lt;br /&gt;
&lt;br /&gt;
wal_keep_segments should be set to a value that allows for some downtime of server/network.&lt;br /&gt;
&lt;br /&gt;
=== Tuning ===&lt;br /&gt;
&lt;br /&gt;
As a result of the architecture there are few physical tuning parameters. That may grow as the implementation matures, but not significantly.&lt;br /&gt;
&lt;br /&gt;
There are no parameters for tuning transfer latency.&lt;br /&gt;
&lt;br /&gt;
The only likely tunable is the amount of memory used to accumulate changes before we send them downstream. Similar in many ways to setting of shared_buffers and should be increased on larger machines.&lt;br /&gt;
&lt;br /&gt;
A variant of hot_standby_feedback could be implemented also, though would likely need renaming.&lt;br /&gt;
&lt;br /&gt;
The CRC check while reading WAL is not useful in this context and there will likely be an option to skip that for logical decoding since it can be a CPU bottleneck.&lt;br /&gt;
&lt;br /&gt;
=== Operational Issues and Debugging ===&lt;br /&gt;
&lt;br /&gt;
In LLSR there are no user-level ERRORs that have special meaning. Any ERRORs generated are likely to be serious problems of some kind, apart from apply deadlocks, which may be re-tried.&lt;br /&gt;
&lt;br /&gt;
=== Monitoring ===&lt;br /&gt;
&lt;br /&gt;
Some new/changed views are available for monitoring activity&lt;br /&gt;
&lt;br /&gt;
* pg_stat_replication&lt;br /&gt;
* pg_stat_logical_decoding&lt;br /&gt;
* pg_stat_logical_replication&lt;br /&gt;
&lt;br /&gt;
Object statistics are updated normally on downstream side, which is essential to maintain autovacuum operating normally. If there are no local writes, these two views should show matching results (unless stats have been reset).&lt;br /&gt;
* pg_stat_user_tables&lt;br /&gt;
* pg_statio_user_tables&lt;br /&gt;
&lt;br /&gt;
Since indexes are used to apply changes, the identifying indexes on downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs by non-identfying indexes.&lt;br /&gt;
* pg_stat_user_indexes&lt;br /&gt;
* pg_statio_user_indexes&lt;br /&gt;
&lt;br /&gt;
== Bi-Directional Replication Use Cases ==&lt;br /&gt;
&lt;br /&gt;
Bi-Directional Replication is designed to allow a very wide range of server connection topologies. The simplest to understand would be two servers each sending their changes to the other, which would be produced by making each server the downstream master of the other and so using two connections for each database.&lt;br /&gt;
&lt;br /&gt;
Logical and physical streaming replication are designed to work side-by-side. This means that a master can be replicating using physical streaming replication to a local standby server, while at the same time replicating logical changes to a remote downstream master. Logical replication works alongside cascading replication also, so a physical standby can feed changes to a downstream master, allowing upstream master sending to physical standby sending to downstream master.&lt;br /&gt;
&lt;br /&gt;
=== Simple multi-master pair ===&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;HA Cluster&amp;quot;&lt;br /&gt;
** Server &amp;quot;Alpha&amp;quot; - Master&lt;br /&gt;
** Server &amp;quot;Beta&amp;quot; - Master&lt;br /&gt;
&lt;br /&gt;
===== Configuration =====&lt;br /&gt;
&lt;br /&gt;
* Alpha&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;
** max_logical_slots = 3&lt;br /&gt;
** max_wal_senders = 4                       # Y = max_logical_slots plus any physical streaming requirements&lt;br /&gt;
** wal_keep_segments = 5000&lt;br /&gt;
** shared_preload_libraries = 'bdr'&lt;br /&gt;
** bdr.connections=&amp;quot;beta&amp;quot;                    # list of upstream master nodenames&lt;br /&gt;
** bdr.beta.dsn = 'dbname=postgres'   # connection string for connection from downstream to upstream master&lt;br /&gt;
&lt;br /&gt;
* Beta&lt;br /&gt;
** wal_level = 'logical'&lt;br /&gt;