Synchronous Replication 9/2010 Proposal
This page serves as documentation for a Synchronous Replication patch posted in September 2010 by Simon Riggs. Note, this proposal is somewhat different than the version which ended up being committed: see Synchronous replication for more details.
WHAT'S DIFFERENT ABOUT THIS PATCH?
The implementation in 9.1 includes several innovations, beyond Fujii Masao's work providing an earlier synchronous replication implementation for PostgreSQL 9.0:
- Low complexity of code on Standby
- User control: All decisions to wait take place on master, allowing fine-grained control of synchronous replication. Max replication level can also be set on the standby.
- Low bandwidth: Very small response packet size with no increase in number of responses when system is under high load means very little additional bandwidth required
- Performance: Standby processes work concurrently to give good overall throughput on standby and minimal latency in all modes. 4 performance options don't interfere with each other, so offer different levels of performance/durability alongside each other.
These are major wins for PostgreSQL project over and above the basic sync rep feature.
SYNCHRONOUS REPLICATION OVERVIEW
Synchronous replication offers the guarantee that all changes made by a transaction have been transferred to remote standby nodes. This is an extension to the standard level of durability offered by a transaction commit.
When synchronous replication is requested the transaction will wait after it commits until it receives confirmation that the transfer has been successful. Waiting for confirmation increases the user's certainty that the transfer has taken place but it also necessarily increases the response time for the requesting transaction. Synchronous replication usually requires carefully planned and placed standby servers to ensure applications perform acceptably. Waiting doesn't utilise system resources, but transaction locks continue to be held until the transfer is confirmed. As a result, incautious use of synchronous replication will lead to reduced performance for database applications.
It may seem that there is a simple choice between durability and performance. However, there is often a close relationship between the importance of data and how busy the database needs to be, so this is seldom a simple choice. With this patch, PostgreSQL now provides a range of features designed to allow application architects to design a system that has both good overall performance and yet good durability of the most important data assets.
PostgreSQL allows the application designer to specify the durability level required via replication. This can be specified for the system overall, though it can also be specified for individual transactions. This allows to selectively provide highest levels of protection for critical data.
For example we, an application might consist of two types of work:
- 10% of changes are changes to important customer details
- 90% of changes are less important data that the business can more easily survive if it is lost, such as chat messages between users.
With sync replication options specified at the application level (on the master) we can offer sync rep for the most important changes, without slowing down the bulk of the total workload. Application level options are an important and practical tool for allowing the benefits of synchronous replication for high performance applications.
Without sync rep options specified at app level, we would have a choice of either slowing down 90% of the workload because 10% of it is important. Or giving up our durability goals because of performance. Or splitting those two functions onto separate database servers so that we can set options differently on each. None of those 3 options is truly attractive.
PostgreSQL also allows the system administrator the ability to specify the service levels offered by standby servers. This allows multiple standby servers to work together in various roles within a server farm.
Note: the information about the parameters used here reflects and earlier version of this feature, and needs to be updated to reflect the form it was committed into 9.1 as
Control of this feature relies on just 3 parameters: On the master we can set
On the standby we can set
These are explained in more detail in the following sections.
Two new USERSET parameters on the master control this
- synchronous_replication = async (default) | recv | fsync | apply
- synchronous_replication_timeout = 0+ (0 means never timeout)
(default timeout 10sec)
synchronous_replication = async is the default and means that no synchronisaton is requested and so the commit will not wait. This is the fastest setting. The word async is short for "asynchronous" and you may see the term asynchronous replication discussed.
Other settings refer to progressively higher levels of durability. The higher the level of durability requested, the longer the wait for that level of durability to be achieved.
The precise meaning of the synchronous_replication settings is
- async - commit does not wait for a standby before replying to user
- recv - commit waits until standby has received WAL
- fsync - commit waits until standby has received and fsynced WAL
- apply - commit waits until standby has received, fsynced and applied
This provides a simple, easily understood mechanism - and one that in its default form is very similar to other RDBMS (e.g. Oracle).
Note that in apply mode it is possible that the changes could be accessible on the standby before the transaction that made the change has been notified that the change is complete. Minor issue.
Network delays may occur and the standby may also crash. If no reply is received within the timeout we raise a NOTICE and then return successful commit (no other action is possible). Note that it is possible to request that we never timeout, so if no standby is available we wait for it one to appear.
When user commits, if the master does not have a currently connected standby offering the required level of replication it will pick the next best available level of replication. It is up to the sysadmin to provide sufficient range of standby nodes to ensure at least one is available to meet the requested service levels.
If multiple standbys exist, the first standby to reply that the desired level of durability has been achieved will release the waiting commit on the master. Other options are available also via a plugin.
On the standby we specify the highest type of replication service offered by this standby server. This information is passed to the master server when the standby connects for replication.
This allows sysadmins to designate preferred standbys. It also allows sysadmins to completely refuse to offer a synchronous replication service, allowing a master to explicitly avoid synchronisation across low bandwidth or high latency links.
An additional parameter can be set in recovery.conf on the standby
- synchronous_replication_service = async (def) | recv | fsync | apply
Some aspects can be changed without significantly altering basic proposal, for example master-specified standby registration wouldn't really alter this very much.
Master-controlled sync rep means that all user wait logic is centred on the master. The details of sync rep requests on the master are not sent to the standby, so there is no additional master to standby traffic nor standby-side bookkeeping overheads. It also reduces complexity of standby code.
On the standby side the WAL Writer now operates during recovery. This frees the WALReceiver to spend more time sending and receiving messages, thereby minimising latency for users choosing the "recv" option. We now have 3 processes handling WAL in an asynchronous pipeline: WAL Receiver reads WAL data from the libpq connection then writes it to the WAL file, the WAL Writer then fsyncs the WAL file and then the Startup process replays the WAL. These processes act independently, so WAL pointers (LSNs) are defined as WALReceiverLSN >= WALWriterLSN >= StartupLSN
For each new message WALReceiver gets from master we issue a reply. Each reply sends the current state of the 3 LSNs, so the reply message size is only 28 bytes. Replies are sent half-duplex, i.e. we don't reply while a new message is arriving.
Note that there is absolutely not one reply per transaction on the master. The standby knows nothing about what has been requested on the master - replies always refer to the latest standby state and effectively batch the responses.
We act according to the requested synchronous_replication_service
- async - no replies are sent
- recv - replies are sent upon receipt only
- fsync - replies are sent upon receipt and following fsync only
- apply - replies are sent following receipt, fsync and apply.
Replies are sent at the next available opportunity.
In apply mode, when the WALReceiver is completely quiet this means we send 3 reply messages - one at recv, one at fsync and one at apply. When WALreceiver is busy the volume of messages does *not* increase since the reply can't be sent until the current incoming message has been received, after which we were going to reply anyway so it is not an additional message. This means we piggyback an "apply" response onto a later "recv" reply. As a result we get minimum response times in *all* modes and maximum throughput is not impaired at all.
When each new messages arrives from master the WALreceiver will write the new data to the WAL file, wake the WALwriter and then reply. Each new message from master receives a reply. If no further WAL data has been received the WALreceiver waits on the latch. If the WALReceiver is woken by WALWriter or Startup then it will reply to master with a message, even if no new WAL has been received.
So in both recv, fsync and apply cases a message as soon as possible to master, so in all cases the wait time is minimised.
When WALwriter is woken it sees if there is outstanding WAL data and if so fsyncs it and wakes both WALreceiver and Startup. When no WAL remains it waits on the latch.
Startup process will wake WALreceiver when it has got to the end of the latest chunk of WAL. If no further WAL is available then it waits on its latch.
When user backends request sync rep they wait in a queue ordered by requested LSN. A separate queue exists for each request mode.
WALSender receives the 3 LSNs from the standby. It then wakes backends in sequence from each queue.
We provide a single wakeup rule: first WALSender to reply with the requested XLogRecPtr will wake the backend. This guarantees that the WAL data for the commit is transferred as requested to at least one standby. That is sufficient for the use cases we have discussed.
More complex wakeup rules would be possible via a plugin.
Wait timeout would be set by individual backends with a timer, just as we do for statement_timeout.
Total code to implement this is low. Breaks down into 5 areas
- Zoltan's libpq changes, included almost verbatim; fairly modular, so easy to replace with something we like better
- A new module syncrep.c and syncrep.h handle the backend wait/wakeup
- Light changes to allow streaming rep to make appropriate calls
- Small amount of code to allow WALWriter to be active in recovery
- Parameter code
No docs yet.
The patch works on top of latches, though does not rely upon them for its bulk performance characteristics. Latches only improve response time for very low transaction rates; latches provide no additional throughput for medium to high transaction rates.
Since we reply to each new chunk sent from master, "recv" mode has absolutely minimal latency, especially since WALreceiver no longer performs majority of fsyncs, as in 9.0 code. WALreceiver does not wait for fsync or apply actions to complete before we reply, so fsync and apply modes will always wait at least 2 standby->master messages which is appropriate because those actions will typically occur much later.
This response mechanism offers highest responsive performance achievable in "recv" mode and very good throughput under load. Note that the different modes do not interfere with each other and can co-exist happily while providing highest performance.
Starting WALWriter is helpful, no matter what the synchronous_replication_service specified.
Can we optimise the sending of reply messages so that only chunks that contain a commit deserve a reply? We could, but then we'd need to do extra work on the master to do bookkeeping of that. It would need to be demonstrated that there is a performance issue big enough to be worth the overhead on master and extra code.
Is there an optimisation from reducing the number of options the standby provides? The architecture on the standby side doesn't rely heavily on the service level specified, nor does it rely in any way on the actual sync rep mode specified on master. No further simplification is possible.
NOT YET IMPLEMENTED
- Timeout code & NOTICE
- Code and test plugin
- Loops in walsender, walwriter and receiver treat shutdown incorrectly
I haven't yet looked at Fujii's code for this, not even sure where it is, though hope to do so in the future. Zoltan's libpq code is the only part of that patch used.
So far I have spent 3.5 days on this and expect to complete tomorrow. I think that throws out the argument that this proposal is too complex to develop in this release.
- How should master behave when we shut it down?
- How should standby behave when we shut it down?