Logical Log Streaming Replication
From PostgreSQL wiki
Logical log streaming replication (LLSR) allows one PostgreSQL master (the "upstream master") to stream a sequence of changes to another read/write PostgreSQL server (the "downstream master"). Data is sent in one direction only over a normal libpq connection.
Multiple LLSR connections can be used to set up bi-directional replication using the BDR extension.
Overview of logical replication
In some ways LLSR is similar to "streaming replication" i.e. physical log streaming replication (PLSR) from a user perspective; both replicate changes from one server to another. However, in LLSR the receiving server is also a full master database that can make changes, unlike the read-only replicas offered by PLSR hot standby. Additionally, LLSR is per-database, whereas PLSR is per-cluster and replicates all databases at once. There are many more differences discussed in the relevant sections of this document.
In LLSR the data that is replicated is change data in a special format that allows the changes to be logically reconstructed on the downstream master. The changes are generated by reading transaction log (WAL) data, making change capture on the upstream master much more efficient than trigger based replication, hence why we call this "logical log replication". Changes are passed from upstream to downstream using the libpq protocol, just like with physical log streaming replication.
One connection is required for each PostgreSQL database that is replicated. If two servers are connected, each of which has 50 databases then it would require 50 connections to send changes in one direction, from upstream to downstream. Each database to replicate must be explicitly specified so it is possible to filter out unwanted databases by not configuring replication for those databases.
Setting up replication for new databases is not (yet?) automatic, so additional configuration steps are required after CREATE DATABASE. A restart of the downstream master is also required. The upstream master only needs restarting if the max_replication_slots parameter is too low to allow a new replica to be added. Adding replication for databases that do not exist yet will cause an ERROR, as will dropping a database that is being replicated. Setup is discussed in more detail below.
Changes are processed by the downstream master using bdr plug-ins. This allows flexible handing of replication input, including:
- BDR apply process - applies logical changes to the downstream master. The apply process makes changes directly rather than generating SQL text and then parse/plan/executing SQL.
- Textual output plugin - a demo plugin that generates SQL text (but doesn't apply changes)
- pg_xlogdump - examines physical WAL records and produces textual debugging output. This server program is included in PostgreSQL 9.3.
Replication of DML changes
All changes are replicated: INSERT, UPDATE, DELETE and TRUNCATE.
(TRUNCATE is not yet implemented, but will be implemented before the feature goes to final release).
Actions that generate WAL data but don't represent logical changes do not result in data transfer, e.g. full page writes, VACUUMs, hint bit setting. LLSR avoids much of the overhead from physical WAL, though it has overheads that mean that it doesn't always use less bandwidth than PLSR.
Locks taken by LOCK and SELECT ... FOR UPDATE/SHARE on the upstream master are not replicated to downstream masters. Locks taken automatically by INSERT, UPDATE, DELETE or TRUNCATE *are* taken on the downstream master and may delay replication apply or concurrent transactions - see Lock Conflicts.
TEMPORARY and UNLOGGED tables are not replicated. In contrast to physical standby servers, downstream masters can use temporary and unlogged tables. However, temporary tables remain specific to a particular session so creating a temporary table on the upstream master does not create a similar table on the downstream master.
DELETE and UPDATE statements that affect multiple rows on upstream master will cause a series of row changes on downstream master. These are likely to go at same speed as on the origin, as long as an index is defined on the Primary Key of the table on the downstream master. UPDATEs and DELETEs require some form of unique constraint, either PRIMARY KEY or UNIQUE NOT NULL. A warning is issued in the downstream master's logs if the expected constraint is absent. INSERT on upstream master do not require a unique constraint in order to replicate correctly, though such usage would prevent conflict detection between multiple masters, if that was considered important.
UPDATEs that change the value of the Primary Key of a table will be replicated correctly.
The values applied are the final values from the UPDATE on the upstream master, including any modifications from before-row triggers, rules or functions. Any reflexive conditions, such as N = N+ 1 are resolved to their final value. Volatile or stable functions are evaluated on the master side and the resulting values are replicated. Consequently any function side-effects (writing files, network socket activity, updating internal PostgreSQL variables, etc) will not occur on the replicas as the functions are not run again on the replica.
All columns are replicated on each table. Large column values that would be placed in TOAST tables are replicated without problem, avoiding de-compression and re-compression. If we update a row but do not change a TOASTed column value, then that data is not sent downstream.
All data types are handled, not just the built-in datatypes of PostgreSQL core. The only requirement is that user-defined types are installed identically in both upstream and downstream master (see "Limitations").
The LLSR plugin uses the binary libpq protocol where the upstream and downstream masters are binary-compatible, i.e. they have the same PostgreSQL major version, same processor architecture and compatible compilation options. Where the upstream and downstream masters are not binary compatible, replication will fall back to the text protocol normally used for PostgreSQL client/server communication. In case of version differences it may also be necessary to upgrade the bdr extension on the older server to match the newer server.
Sets of changes are accumulated in memory (spilling to disk where required) and then sent to the downstream server at commit time. Aborted transactions are never sent. Application of changes on downstream master is currently single-threaded, though this process is efficiently implemented. Parallel apply is a possible future feature, especially for changes made while holding AccessExclusiveLock.
Changes are applied to the downstream master in the sequence in which they were committed on the upstream master. This is a known-good serialized ordering of changes, so replication serialization failures are not theoretically possible. Such failures are common in systems that use statement based replication (e.g. MySQL) or trigger based replication (e.g. Slony version 2.0). Users should note that this means the original order of locking of tables is not maintained. Although lock order is provably not an issue for the set of locks held on upstream master, additional locking on downstream side could cause lock waits or deadlocking in some cases. (Discussed in further detail later).
Larger transactions spill to disk on the upstream master once they reach a certain size. Currently, large transactions can cause increased latency. Future enhancement will be to stream changes to downstream master once they fill the upstream memory buffer, though this is likely to be implemented in 9.5.
SET statements and parameter settings are not replicated. This has no effect on replication since we only replicate actual changes, not anything at SQL statement level. We always update the correct tables, whatever the setting of search_path. Values are replicated correctly irrespective of the values of bytea_output, TimeZone, DateStyle, etc.
NOTIFY is not supported across log based replication, either physical or logical. NOTIFY and LISTEN will work fine on the upstream master but an upstream NOTIFY will not trigger a downstream LISTENer.
In some cases, additional deadlocks can occur on apply. This causes an automatic retry of the apply of the replaying transaction and is only an issue if the deadlock recurs repeatedly, delaying replication.
From a performance and concurrency perspective the BDR apply process is similar to a normal backend. Frequent conflicts with locks from other transactions when replaying changes can slow things down and thus increase replication delay, so reducing the frequency of such conflicts can be a good way to speed things up. Any lock held by another transaction on the downstream master - LOCK statements, SELECT ... FOR UPDATE/FOR SHARE, or INSERT/UPDATE/DELETE row locks - can delay replication if the replication apply process needs to change the locked table/row.
Table definitions and DDL replication
DML changes are replicated between tables with matching "Schemaname"."Tablename" on both upstream and downstream masters. e.g. changes from upstream's public.mytable will go to downstream's public.mytable while changes to the upstream mychema.mytable will go to the downstream myschema.mytable. This works even when no schema is specified on the original SQL since we identify the changed table from its internal OIDs in WAL records and then map that to whatever internal identifier is used on the downstream node.
This requires careful synchronization of table definitions on each node otherwise ERRORs will be generated by the replication apply process. In general, tables must be an exact match between upstream and downstream masters.
There are no plans to implement working replication between dissimilar table definitions.
Tables must meet the following requirements to be compatible for purposes of LLSR. The best way to ensure an exact match is to define the table on one node and allow DDL replication to copy its definition to the other nodes, or to use init_replica to copy the definitions when bringing up a new BDR node. If you don't define a table / type / etc manually on multiple nodes then you won't have to worry, BDR takes care of ensuring compatibility for you.
The requirements for compatibility are:
- The downstream master must only have constraints (CHECK, UNIQUE, EXCLUSION, FOREIGN KEY, etc) that are also present on the upstream master. Replication may initially work with mismatched constraints but is likely to fail as soon as the downstream master rejects a row the upstream master accepted.
- The table referenced by a FOREIGN KEY on a downstream master must have all the keys present in the upstream master version of the same table.
- Storage parameters must match except for as allowed below
- Inheritance must be the same
- Dropped columns on master must be present on replicas
- Custom types and enum definitions must match exactly
- Composite types and enums must have the same oids on master and replication target
- Extensions defining types used in replicated tables must be of the same version or fully SQL-level compatible and the oids of the types they define must match.
The following differences are permissible between tables on different nodes:
- The table's pg_class oid, the oid of its associated TOAST table, and the oid of the table's rowtype in pg_type may differ;
- Extra or missing non-UNIQUE indexes
- Extra keys in downstream lookup tables for FOREIGN KEY references that are not present on the upstream master
- The table-level storage parameters for fillfactor and autovacuum
- Triggers and rules may differ (they are not executed by replication apply)
Replication of DDL changes between nodes is performed using event triggers, with partial support integrated in bdr-next (see LLSR limitations).
CREATE TRIGGER will be replicated to other nodes. Once created, Triggers and Rules are executed only on the origin side, NOT executed by apply on downstream side, equivalent to an enforced setting of session_replication_role = origin.
In future it is expected that composite types and enums with non-identical oids will be converted using text output and input functions. This feature is not yet implemented.
The current LLSR implementation is subject to some limitations, which are being progressively removed as work progresses.
Data definition compatibility
Table definitions, types, extensions, etc must be near identical between upstream and downstream masters. See Table definitions and DDL replication.
Most DDL is replicated between nodes automatically. Unsupported DDL is detected by the replication extension (part of BDR, not logical changeset extraction) and rejected with an error.
TRUNCATE is not replicated
TRUNCATE is not yet supported.
The safest option is to define a user-level BEFORE trigger on each table that RAISEs an ERROR when TRUNCATE is attempted.
A simple truncate-blocking trigger is:
CREATE OR REPLACE FUNCTION deny_truncate() RETURNS trigger AS $$ BEGIN IF tg_op = 'TRUNCATE' THEN RAISE EXCEPTION 'TRUNCATE is not supported on this table. Please use DELETE FROM.'; ELSE RAISE EXCEPTION 'This trigger only supports TRUNCATE'; END IF; END; $$ LANGUAGE plpgsql;
It can be applied to a table with:
CREATE TRIGGER deny_truncate_on_<tablename> BEFORE TRUNCATE ON <tablename> FOR EACH STATEMENT EXECUTE PROCEDURE deny_truncate();
A PL/PgSQL DO block that queries pg_class and loops over it to EXECUTE a dynamic SQL CREATE TRIGGER command for each table that does not already have the trigger can be used to apply the trigger to all tables.
Alternately, there will be a ProcessUtility_hook available in the BDR extension to automatically prevent unsupported operations like TRUNCATE.
The following configuration is an example of a simple one-way LLSR replication setup - a single upstream master to a single downstream master.
(This example was valid as of BDR 0.5. BDR 0.6 has not been tested with this configuration and may not work due to the addition of DDL replication, inter-node messaging, etc. The unidirectional variant of BDR, when available, will support this kind of configuration.)
The upstream master (sender)'s postgresql.conf should contain settings like:
wal_level = 'logical' # Include enough info for logical replication max_replication_slots = X # Number of LLSR senders + any receivers max_wal_senders = Y # Y = max_replication_slots plus any physical # streaming requirements track_commit_timestamp = on # Not strictly required for LLSR, only for BDR # conflict resolution.
Downstream (receiver) postgresql.conf:
shared_preload_libraries = 'bdr' bdr.connections="name_of_upstream_master" # list of upstream master nodenames bdr.<nodename>_dsn = 'dbname=postgres' # connection string for connection # from downstream to upstream master bdr.<nodename>_local_dbname = 'xxx' # optional parameter to cover the case # where the databasename on upstream # and downstream master differ. # (Not yet implemented) bdr.<nodename>_apply_delay # optional parameter to delay apply of # transactions, time in milliseconds bdr.synchronous_commit = off; # optional parameter to set the # synchronous_commit parameter the # apply processes will be using. # Safe to set to 'off' unless you're # doing synchronous replication. max_replication_slots = X # set to the number of remotes track_commit_timestamp = on # Not strictly required for LLSR, # only for BDR conflict resolution.
Note that a server can be both sender and receiver, either two servers to each other or more complex configurations like replication chains/trees.
The upstream (sender) pg_hba.conf must be configured to allow the downstream master to connect for replication. Otherwise you'll see errors like the following on the downstream master:
FATAL: could not connect to the primary server: FATAL: no pg_hba.conf entry for replication connection from host "[local]", user "postgres"
A suitable pg_hba.conf entry for a replication connection from the replica server 10.1.4.8 might be:
host replication postgres 10.1.4.8/32 trust
(the user name should match the user name configured in the downstream master's dsn. md5 password authentication is supported.)
For more details on these parameters, see Parameter Reference.
Starting a new LLSR connection
Logical replication is started automatically when a database is configured as a downstream master in postgresql.conf (see Configuration) and the postmaster is started. No explicit action is required to start replication, but replication will not actually work unless the upstream and downstream databases are identical within the requirements set by LLSR in the Table definitions and DDL replication section.
pg_dump and pg_restore may be used to set up the new replica's database.
The development version in the bdr-next branch will automatically dump the upstream database and populate the local database if the bdr.<nodename>.init_replica setting is configured; see the parameter reference above.
Pausing and resuming logical replication
You can execute the bdr_apply_pause() function to temporarily pause logical replication. Changes will once again be applied once you execute bdr_apply_resume().
Temporarily stopping an LLSR replica
LLSR replicas can be temporarily stopped by shutting down the downstream master's postmaster.
A stopped replica will still cause the upstream master to retain WAL for it, eventually causing the upstream master to run out of disk space in pg_xlog. Do not leave a replica shut down for too long - if it's going to be out of service for an extended period, consider dropping the upstream slot and retiring the replica, then creating a new one later.
Once you remove an upstream slot you cannot simply rejoin a replica and have it catch up. It must be rebuilt.
Removing an LLSR replica permanently
To remove a replication connection permanently, remove its entries from the downstream master's postgresql.conf, restart the downstream master, then remove its slot from the upstream master with SELECT pg_drop_replication_slot('slotname') as a superuser. See the main PostgreSQL documentation.
Alternately, you can use pg_receivellog:
pg_receivellog -p 5434 -h master-hostname -d dbname \ --slot='bdr: 16384:5873181566046043070-1-16384:' --stop
It is important to remove the replication slot from the upstream master(s) to prevent xid wrap-around problems and issues with table bloat caused by delayed vacuum, and to prevent the upstream master from retaining WAL for the dead replica until it runs out of pg_xlog space.