Synchronous replication

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
m (CODE)
(Remove the notes describing Simon's 9/2010 patch, and put a link to the wiki page where they have been moved.)
Line 1: Line 1:
 +
= Synchronous Replication =
 
Synchronous replication is available starting in PostgreSQL 9.1 by enabling the [http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#GUC-SYNCHRONOUS-STANDBY-NAMES synchronous_standby_names] parameter.  It includes user-controlled durability specified on the master using the [http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT synchronous_commit] parameter.  The design also provides high throughput by allowing concurrent processes to handle the WAL stream.  
 
Synchronous replication is available starting in PostgreSQL 9.1 by enabling the [http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#GUC-SYNCHRONOUS-STANDBY-NAMES synchronous_standby_names] parameter.  It includes user-controlled durability specified on the master using the [http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT synchronous_commit] parameter.  The design also provides high throughput by allowing concurrent processes to handle the WAL stream.  
  
=WHAT'S DIFFERENT ABOUT THIS PATCH?=
+
== Design Notes ==
 
+
See also [[Synchronous Replication 9/2010 Proposal]], though those notes pertain to a patch different than what has been committed.
The implementation in 9.1 includes several innovations, beyond [http://wiki.postgresql.org/wiki/Streaming_Replication Fujii Masao's work] providing an earlier synchronous replication implementation for PostgreSQL 9.0:
+
 
+
* Low complexity of code on Standby
+
* User control: All decisions to wait take place on master, allowing fine-grained control of synchronous replication. Max replication level can also be set on the standby.
+
* Low bandwidth: Very small response packet size with no increase in number of responses when system is under high load means very little additional bandwidth required
+
* Performance: Standby processes work concurrently to give good overall throughput on standby and minimal latency in all modes. 4 performance options don't interfere with each other, so offer different levels of performance/durability alongside each other.
+
 
+
These are major wins for PostgreSQL project over and above the basic sync rep feature.
+
 
+
=SYNCHRONOUS REPLICATION OVERVIEW=
+
 
+
Synchronous replication offers the guarantee that all changes made by a
+
transaction have been transferred to remote standby nodes. This is an
+
extension to the standard level of durability offered by a transaction
+
commit.
+
 
+
When synchronous replication is requested the transaction will wait
+
after it commits until it receives confirmation that the transfer has
+
been successful. Waiting for confirmation increases the user's certainty
+
that the transfer has taken place but it also necessarily increases the
+
response time for the requesting transaction. Synchronous replication
+
usually requires carefully planned and placed standby servers to ensure
+
applications perform acceptably. Waiting doesn't utilise system
+
resources, but transaction locks continue to be held until the transfer
+
is confirmed. As a result, incautious use of synchronous replication
+
will lead to reduced performance for database applications.
+
 
+
It may seem that there is a simple choice between durability and
+
performance. However, there is often a close relationship between the
+
importance of data and how busy the database needs to be, so this is
+
seldom a simple choice. With this patch, PostgreSQL now provides a range
+
of features designed to allow application architects to design a system
+
that has both good overall performance and yet good durability of the
+
most important data assets.
+
 
+
PostgreSQL allows the application designer to specify the durability
+
level required via replication. This can be specified for the system
+
overall, though it can also be specified for individual transactions.
+
This allows to selectively provide highest levels of protection for
+
critical data.
+
 
+
For example we, an application might consist of two types of work:
+
* 10% of changes are changes to important customer details
+
* 90% of changes are less important data that the business can more easily survive if it is lost, such as chat messages between users.
+
 
+
With sync replication options specified at the application level (on the
+
master) we can offer sync rep for the most important changes, without
+
slowing down the bulk of the total workload. Application level options
+
are an important and practical tool for allowing the benefits of
+
synchronous replication for high performance applications.
+
 
+
Without sync rep options specified at app level, we would have a choice
+
of either slowing down 90% of the workload because 10% of it is
+
important. Or giving up our durability goals because of performance. Or
+
splitting those two functions onto separate database servers so that we
+
can set options differently on each. None of those 3 options is truly
+
attractive.
+
 
+
PostgreSQL also allows the system administrator the ability to specify
+
the service levels offered by standby servers. This allows multiple
+
standby servers to work together in various roles within a server farm.
+
 
+
''Note:  the information about the parameters used here reflects and earlier version of this feature, and needs to be updated to reflect the form it was committed into 9.1 as''
+
 
+
Control of this feature relies on just 3 parameters:
+
On the master we can set
+
 
+
* synchronous_replication
+
* synchronous_replication_timeout
+
 
+
On the standby we can set
+
 
+
* synchronous_replication_service
+
 
+
These are explained in more detail in the following sections.
+
 
+
=USER'S OVERVIEW=
+
 
+
Two new USERSET parameters on the master control this
+
* synchronous_replication = async (default) | recv | fsync | apply
+
* synchronous_replication_timeout = 0+ (0 means never timeout)
+
(default timeout 10sec)
+
 
+
synchronous_replication = async is the default and means that no
+
synchronisaton is requested and so the commit will not wait. This is the
+
fastest setting. The word async is short for "asynchronous" and you may
+
see the term asynchronous replication discussed.
+
 
+
Other settings refer to progressively higher levels of durability. The
+
higher the level of durability requested, the longer the wait for that
+
level of durability to be achieved.
+
 
+
The precise meaning of the synchronous_replication settings is
+
* async  - commit does not wait for a standby before replying to user
+
* recv - commit waits until standby has received WAL
+
* fsync - commit waits until standby has received and fsynced WAL
+
* apply - commit waits until standby has received, fsynced and applied
+
This provides a simple, easily understood mechanism - and one that in
+
its default form is very similar to other RDBMS (e.g. Oracle).
+
 
+
Note that in apply mode it is possible that the changes could be
+
accessible on the standby before the transaction that made the change
+
has been notified that the change is complete. Minor issue.
+
 
+
Network delays may occur and the standby may also crash. If no reply is
+
received within the timeout we raise a NOTICE and then return successful
+
commit (no other action is possible). Note that it is possible to
+
request that we never timeout, so if no standby is available we wait for
+
it one to appear.
+
 
+
When user commits, if the master does not have a currently connected
+
standby offering the required level of replication it will pick the next
+
best available level of replication. It is up to the sysadmin to provide
+
sufficient range of standby nodes to ensure at least one is available to
+
meet the requested service levels.
+
 
+
If multiple standbys exist, the first standby to reply that the desired
+
level of durability has been achieved will release the waiting commit on
+
the master. Other options are available also via a plugin.
+
 
+
==ADMINISTRATOR'S OVERVIEW==
+
 
+
On the standby we specify the highest type of replication service
+
offered by this standby server. This information is passed to the master
+
server when the standby connects for replication.
+
 
+
This allows sysadmins to designate preferred standbys. It also allows
+
sysadmins to completely refuse to offer a synchronous replication
+
service, allowing a master to explicitly avoid synchronisation across
+
low bandwidth or high latency links.
+
 
+
An additional parameter can be set in recovery.conf on the standby
+
 
+
* synchronous_replication_service = async (def) | recv | fsync | apply
+
 
+
 
+
= IMPLEMENTATION =
+
 
+
Some aspects can be changed without significantly altering basic
+
proposal, for example master-specified standby registration wouldn't
+
really alter this very much.
+
 
+
== STANDBY ==
+
 
+
Master-controlled sync rep means that all user wait logic is centred on
+
the master. The details of sync rep requests on the master are not sent
+
to the standby, so there is no additional master to standby traffic nor
+
standby-side bookkeeping overheads. It also reduces complexity of
+
standby code.
+
 
+
On the standby side the WAL Writer now operates during recovery. This
+
frees the WALReceiver to spend more time sending and receiving messages,
+
thereby minimising latency for users choosing the "recv" option. We now
+
have 3 processes handling WAL in an asynchronous pipeline: WAL Receiver
+
reads WAL data from the libpq connection then writes it to the WAL file,
+
the WAL Writer then fsyncs the WAL file and then the Startup process
+
replays the WAL. These processes act independently, so WAL pointers
+
(LSNs) are defined as WALReceiverLSN >= WALWriterLSN >= StartupLSN
+
 
+
For each new message WALReceiver gets from master we issue a reply. Each
+
reply sends the current state of the 3 LSNs, so the reply message size
+
is only 28 bytes. Replies are sent half-duplex, i.e. we don't reply
+
while a new message is arriving.
+
 
+
Note that there is absolutely not one reply per transaction on the
+
master. The standby knows nothing about what has been requested on the
+
master - replies always refer to the latest standby state and
+
effectively batch the responses.
+
 
+
We act according to the requested synchronous_replication_service
+
* async -  no replies are sent
+
* recv -  replies are sent upon receipt only
+
* fsync -  replies are sent upon receipt and following fsync only
+
* apply -  replies are sent following receipt, fsync and apply.
+
 
+
Replies are sent at the next available opportunity.
+
 
+
In apply mode, when the WALReceiver is completely quiet this means we
+
send 3 reply messages - one at recv, one at fsync and one at apply. When
+
WALreceiver is busy the volume of messages does *not* increase since the
+
reply can't be sent until the current incoming message has been
+
received, after which we were going to reply anyway so it is not an
+
additional message. This means we piggyback an "apply" response onto a
+
later "recv" reply. As a result we get minimum response times in *all*
+
modes and maximum throughput is not impaired at all.
+
 
+
When each new messages arrives from master the WALreceiver will write
+
the new data to the WAL file, wake the WALwriter and then reply. Each
+
new message from master receives a reply. If no further WAL data has
+
been received the WALreceiver waits on the latch. If the WALReceiver is
+
woken by WALWriter or Startup then it will reply to master with a
+
message, even if no new WAL has been received.
+
 
+
So in both recv, fsync and apply cases a message as soon as possible to
+
master, so in all cases the wait time is minimised.
+
 
+
When WALwriter is woken it sees if there is outstanding WAL data and if
+
so fsyncs it and wakes both WALreceiver and Startup. When no WAL remains
+
it waits on the latch.
+
 
+
Startup process will wake WALreceiver when it has got to the end of the
+
latest chunk of WAL. If no further WAL is available then it waits on its
+
latch.
+
 
+
== MASTER ==
+
 
+
When user backends request sync rep they wait in a queue ordered by
+
requested LSN. A separate queue exists for each request mode.
+
 
+
WALSender receives the 3 LSNs from the standby. It then wakes backends
+
in sequence from each queue.
+
 
+
We provide a single wakeup rule: first WALSender to reply with the
+
requested XLogRecPtr will wake the backend. This guarantees that the WAL
+
data for the commit is transferred as requested to at least one standby.
+
That is sufficient for the use cases we have discussed.
+
 
+
More complex wakeup rules would be possible via a plugin.
+
 
+
Wait timeout would be set by individual backends with a timer, just as
+
we do for statement_timeout.
+
 
+
= CODE =
+
 
+
Total code to implement this is low. Breaks down into 5 areas
+
* Zoltan's libpq changes, included almost verbatim; fairly modular, so easy to replace with something we like better
+
* A new module syncrep.c and syncrep.h handle the backend wait/wakeup
+
* Light changes to allow streaming rep to make appropriate calls
+
* Small amount of code to allow WALWriter to be active in recovery
+
* Parameter code
+
No docs yet.
+
 
+
The patch works on top of latches, though does not rely upon them for
+
its bulk performance characteristics. Latches only improve response time
+
for very low transaction rates; latches provide no additional throughput
+
for medium to high transaction rates.
+
 
+
= PERFORMANCE ANALYSIS =
+
 
+
Since we reply to each new chunk sent from master, "recv" mode has
+
absolutely minimal latency, especially since WALreceiver no longer
+
performs majority of fsyncs, as in 9.0 code. WALreceiver does not wait
+
for fsync or apply actions to complete before we reply, so fsync and
+
apply modes will always wait at least 2 standby->master messages which
+
is appropriate because those actions will typically occur much later.
+
 
+
This response mechanism offers highest responsive performance achievable
+
in "recv" mode and very good throughput under load. Note that the
+
different modes do not interfere with each other and can co-exist
+
happily while providing highest performance.
+
 
+
Starting WALWriter is helpful, no matter what the
+
synchronous_replication_service specified.
+
 
+
Can we optimise the sending of reply messages so that only chunks that
+
contain a commit deserve a reply? We could, but then we'd need to do
+
extra work on the master to do bookkeeping of that. It would need to be
+
demonstrated that there is a performance issue big enough to be worth
+
the overhead on master and extra code.
+
 
+
Is there an optimisation from reducing the number of options the standby
+
provides? The architecture on the standby side doesn't rely heavily on
+
the service level specified, nor does it rely in any way on the actual
+
sync rep mode specified on master. No further simplification is
+
possible.
+
 
+
 
+
= NOT YET IMPLEMENTED =
+
 
+
* Timeout code & NOTICE
+
* Code and test plugin
+
* Loops in walsender, walwriter and receiver treat shutdown incorrectly
+
 
+
I haven't yet looked at Fujii's code for this, not even sure where it
+
is, though hope to do so in the future. Zoltan's libpq code is the only
+
part of that patch used.
+
 
+
So far I have spent 3.5 days on this and expect to complete tomorrow. I
+
think that throws out the argument that this proposal is too complex to
+
develop in this release.
+
 
+
= OTHER ISSUES =
+
 
+
* How should master behave when we shut it down?
+
* How should standby behave when we shut it down?
+
  
 
[[Category:Replication]]
 
[[Category:Replication]]

Revision as of 17:36, 5 January 2013

Synchronous Replication

Synchronous replication is available starting in PostgreSQL 9.1 by enabling the synchronous_standby_names parameter. It includes user-controlled durability specified on the master using the synchronous_commit parameter. The design also provides high throughput by allowing concurrent processes to handle the WAL stream.

Design Notes

See also Synchronous Replication 9/2010 Proposal, though those notes pertain to a patch different than what has been committed.

Personal tools