m |
|
| Line 1: |
Line 1: |
| | + | = Synchronous Replication = |
| | Synchronous replication is available starting in PostgreSQL 9.1 by enabling the [http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#GUC-SYNCHRONOUS-STANDBY-NAMES synchronous_standby_names] parameter. It includes user-controlled durability specified on the master using the [http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT synchronous_commit] parameter. The design also provides high throughput by allowing concurrent processes to handle the WAL stream. | | Synchronous replication is available starting in PostgreSQL 9.1 by enabling the [http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#GUC-SYNCHRONOUS-STANDBY-NAMES synchronous_standby_names] parameter. It includes user-controlled durability specified on the master using the [http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT synchronous_commit] parameter. The design also provides high throughput by allowing concurrent processes to handle the WAL stream. |
| | | | |
| - | =WHAT'S DIFFERENT ABOUT THIS PATCH?= | + | == Design Notes == |
| - | | + | See also [[Synchronous Replication 9/2010 Proposal]], though those notes pertain to a patch different than what has been committed. |
| - | The implementation in 9.1 includes several innovations, beyond [http://wiki.postgresql.org/wiki/Streaming_Replication Fujii Masao's work] providing an earlier synchronous replication implementation for PostgreSQL 9.0:
| + | |
| - | | + | |
| - | * Low complexity of code on Standby
| + | |
| - | * User control: All decisions to wait take place on master, allowing fine-grained control of synchronous replication. Max replication level can also be set on the standby.
| + | |
| - | * Low bandwidth: Very small response packet size with no increase in number of responses when system is under high load means very little additional bandwidth required
| + | |
| - | * Performance: Standby processes work concurrently to give good overall throughput on standby and minimal latency in all modes. 4 performance options don't interfere with each other, so offer different levels of performance/durability alongside each other.
| + | |
| - | | + | |
| - | These are major wins for PostgreSQL project over and above the basic sync rep feature.
| + | |
| - | | + | |
| - | =SYNCHRONOUS REPLICATION OVERVIEW= | + | |
| - | | + | |
| - | Synchronous replication offers the guarantee that all changes made by a
| + | |
| - | transaction have been transferred to remote standby nodes. This is an
| + | |
| - | extension to the standard level of durability offered by a transaction
| + | |
| - | commit.
| + | |
| - | | + | |
| - | When synchronous replication is requested the transaction will wait
| + | |
| - | after it commits until it receives confirmation that the transfer has
| + | |
| - | been successful. Waiting for confirmation increases the user's certainty
| + | |
| - | that the transfer has taken place but it also necessarily increases the
| + | |
| - | response time for the requesting transaction. Synchronous replication
| + | |
| - | usually requires carefully planned and placed standby servers to ensure
| + | |
| - | applications perform acceptably. Waiting doesn't utilise system
| + | |
| - | resources, but transaction locks continue to be held until the transfer
| + | |
| - | is confirmed. As a result, incautious use of synchronous replication
| + | |
| - | will lead to reduced performance for database applications.
| + | |
| - | | + | |
| - | It may seem that there is a simple choice between durability and
| + | |
| - | performance. However, there is often a close relationship between the
| + | |
| - | importance of data and how busy the database needs to be, so this is
| + | |
| - | seldom a simple choice. With this patch, PostgreSQL now provides a range
| + | |
| - | of features designed to allow application architects to design a system
| + | |
| - | that has both good overall performance and yet good durability of the
| + | |
| - | most important data assets.
| + | |
| - | | + | |
| - | PostgreSQL allows the application designer to specify the durability
| + | |
| - | level required via replication. This can be specified for the system
| + | |
| - | overall, though it can also be specified for individual transactions.
| + | |
| - | This allows to selectively provide highest levels of protection for
| + | |
| - | critical data.
| + | |
| - | | + | |
| - | For example we, an application might consist of two types of work:
| + | |
| - | * 10% of changes are changes to important customer details
| + | |
| - | * 90% of changes are less important data that the business can more easily survive if it is lost, such as chat messages between users.
| + | |
| - | | + | |
| - | With sync replication options specified at the application level (on the
| + | |
| - | master) we can offer sync rep for the most important changes, without
| + | |
| - | slowing down the bulk of the total workload. Application level options
| + | |
| - | are an important and practical tool for allowing the benefits of
| + | |
| - | synchronous replication for high performance applications.
| + | |
| - | | + | |
| - | Without sync rep options specified at app level, we would have a choice
| + | |
| - | of either slowing down 90% of the workload because 10% of it is
| + | |
| - | important. Or giving up our durability goals because of performance. Or
| + | |
| - | splitting those two functions onto separate database servers so that we
| + | |
| - | can set options differently on each. None of those 3 options is truly
| + | |
| - | attractive.
| + | |
| - | | + | |
| - | PostgreSQL also allows the system administrator the ability to specify
| + | |
| - | the service levels offered by standby servers. This allows multiple
| + | |
| - | standby servers to work together in various roles within a server farm.
| + | |
| - | | + | |
| - | ''Note: the information about the parameters used here reflects and earlier version of this feature, and needs to be updated to reflect the form it was committed into 9.1 as''
| + | |
| - | | + | |
| - | Control of this feature relies on just 3 parameters:
| + | |
| - | On the master we can set
| + | |
| - | | + | |
| - | * synchronous_replication
| + | |
| - | * synchronous_replication_timeout
| + | |
| - | | + | |
| - | On the standby we can set
| + | |
| - | | + | |
| - | * synchronous_replication_service
| + | |
| - | | + | |
| - | These are explained in more detail in the following sections.
| + | |
| - | | + | |
| - | =USER'S OVERVIEW=
| + | |
| - | | + | |
| - | Two new USERSET parameters on the master control this
| + | |
| - | * synchronous_replication = async (default) | recv | fsync | apply
| + | |
| - | * synchronous_replication_timeout = 0+ (0 means never timeout)
| + | |
| - | (default timeout 10sec)
| + | |
| - | | + | |
| - | synchronous_replication = async is the default and means that no
| + | |
| - | synchronisaton is requested and so the commit will not wait. This is the
| + | |
| - | fastest setting. The word async is short for "asynchronous" and you may
| + | |
| - | see the term asynchronous replication discussed.
| + | |
| - | | + | |
| - | Other settings refer to progressively higher levels of durability. The
| + | |
| - | higher the level of durability requested, the longer the wait for that
| + | |
| - | level of durability to be achieved.
| + | |
| - | | + | |
| - | The precise meaning of the synchronous_replication settings is
| + | |
| - | * async - commit does not wait for a standby before replying to user
| + | |
| - | * recv - commit waits until standby has received WAL
| + | |
| - | * fsync - commit waits until standby has received and fsynced WAL
| + | |
| - | * apply - commit waits until standby has received, fsynced and applied
| + | |
| - | This provides a simple, easily understood mechanism - and one that in
| + | |
| - | its default form is very similar to other RDBMS (e.g. Oracle).
| + | |
| - | | + | |
| - | Note that in apply mode it is possible that the changes could be
| + | |
| - | accessible on the standby before the transaction that made the change
| + | |
| - | has been notified that the change is complete. Minor issue.
| + | |
| - | | + | |
| - | Network delays may occur and the standby may also crash. If no reply is
| + | |
| - | received within the timeout we raise a NOTICE and then return successful
| + | |
| - | commit (no other action is possible). Note that it is possible to
| + | |
| - | request that we never timeout, so if no standby is available we wait for
| + | |
| - | it one to appear.
| + | |
| - | | + | |
| - | When user commits, if the master does not have a currently connected
| + | |
| - | standby offering the required level of replication it will pick the next
| + | |
| - | best available level of replication. It is up to the sysadmin to provide
| + | |
| - | sufficient range of standby nodes to ensure at least one is available to
| + | |
| - | meet the requested service levels.
| + | |
| - | | + | |
| - | If multiple standbys exist, the first standby to reply that the desired
| + | |
| - | level of durability has been achieved will release the waiting commit on
| + | |
| - | the master. Other options are available also via a plugin.
| + | |
| - | | + | |
| - | ==ADMINISTRATOR'S OVERVIEW==
| + | |
| - | | + | |
| - | On the standby we specify the highest type of replication service
| + | |
| - | offered by this standby server. This information is passed to the master
| + | |
| - | server when the standby connects for replication.
| + | |
| - | | + | |
| - | This allows sysadmins to designate preferred standbys. It also allows
| + | |
| - | sysadmins to completely refuse to offer a synchronous replication
| + | |
| - | service, allowing a master to explicitly avoid synchronisation across
| + | |
| - | low bandwidth or high latency links.
| + | |
| - | | + | |
| - | An additional parameter can be set in recovery.conf on the standby
| + | |
| - | | + | |
| - | * synchronous_replication_service = async (def) | recv | fsync | apply
| + | |
| - | | + | |
| - | | + | |
| - | = IMPLEMENTATION =
| + | |
| - | | + | |
| - | Some aspects can be changed without significantly altering basic
| + | |
| - | proposal, for example master-specified standby registration wouldn't
| + | |
| - | really alter this very much.
| + | |
| - | | + | |
| - | == STANDBY ==
| + | |
| - | | + | |
| - | Master-controlled sync rep means that all user wait logic is centred on
| + | |
| - | the master. The details of sync rep requests on the master are not sent
| + | |
| - | to the standby, so there is no additional master to standby traffic nor
| + | |
| - | standby-side bookkeeping overheads. It also reduces complexity of
| + | |
| - | standby code.
| + | |
| - | | + | |
| - | On the standby side the WAL Writer now operates during recovery. This
| + | |
| - | frees the WALReceiver to spend more time sending and receiving messages,
| + | |
| - | thereby minimising latency for users choosing the "recv" option. We now
| + | |
| - | have 3 processes handling WAL in an asynchronous pipeline: WAL Receiver
| + | |
| - | reads WAL data from the libpq connection then writes it to the WAL file,
| + | |
| - | the WAL Writer then fsyncs the WAL file and then the Startup process
| + | |
| - | replays the WAL. These processes act independently, so WAL pointers
| + | |
| - | (LSNs) are defined as WALReceiverLSN >= WALWriterLSN >= StartupLSN
| + | |
| - | | + | |
| - | For each new message WALReceiver gets from master we issue a reply. Each
| + | |
| - | reply sends the current state of the 3 LSNs, so the reply message size
| + | |
| - | is only 28 bytes. Replies are sent half-duplex, i.e. we don't reply
| + | |
| - | while a new message is arriving.
| + | |
| - | | + | |
| - | Note that there is absolutely not one reply per transaction on the
| + | |
| - | master. The standby knows nothing about what has been requested on the
| + | |
| - | master - replies always refer to the latest standby state and
| + | |
| - | effectively batch the responses.
| + | |
| - | | + | |
| - | We act according to the requested synchronous_replication_service
| + | |
| - | * async - no replies are sent
| + | |
| - | * recv - replies are sent upon receipt only
| + | |
| - | * fsync - replies are sent upon receipt and following fsync only
| + | |
| - | * apply - replies are sent following receipt, fsync and apply.
| + | |
| - | | + | |
| - | Replies are sent at the next available opportunity.
| + | |
| - | | + | |
| - | In apply mode, when the WALReceiver is completely quiet this means we
| + | |
| - | send 3 reply messages - one at recv, one at fsync and one at apply. When
| + | |
| - | WALreceiver is busy the volume of messages does *not* increase since the
| + | |
| - | reply can't be sent until the current incoming message has been
| + | |
| - | received, after which we were going to reply anyway so it is not an
| + | |
| - | additional message. This means we piggyback an "apply" response onto a
| + | |
| - | later "recv" reply. As a result we get minimum response times in *all*
| + | |
| - | modes and maximum throughput is not impaired at all.
| + | |
| - | | + | |
| - | When each new messages arrives from master the WALreceiver will write
| + | |
| - | the new data to the WAL file, wake the WALwriter and then reply. Each
| + | |
| - | new message from master receives a reply. If no further WAL data has
| + | |
| - | been received the WALreceiver waits on the latch. If the WALReceiver is
| + | |
| - | woken by WALWriter or Startup then it will reply to master with a
| + | |
| - | message, even if no new WAL has been received.
| + | |
| - | | + | |
| - | So in both recv, fsync and apply cases a message as soon as possible to
| + | |
| - | master, so in all cases the wait time is minimised.
| + | |
| - | | + | |
| - | When WALwriter is woken it sees if there is outstanding WAL data and if
| + | |
| - | so fsyncs it and wakes both WALreceiver and Startup. When no WAL remains
| + | |
| - | it waits on the latch.
| + | |
| - | | + | |
| - | Startup process will wake WALreceiver when it has got to the end of the
| + | |
| - | latest chunk of WAL. If no further WAL is available then it waits on its
| + | |
| - | latch.
| + | |
| - | | + | |
| - | == MASTER ==
| + | |
| - | | + | |
| - | When user backends request sync rep they wait in a queue ordered by
| + | |
| - | requested LSN. A separate queue exists for each request mode.
| + | |
| - | | + | |
| - | WALSender receives the 3 LSNs from the standby. It then wakes backends
| + | |
| - | in sequence from each queue.
| + | |
| - | | + | |
| - | We provide a single wakeup rule: first WALSender to reply with the
| + | |
| - | requested XLogRecPtr will wake the backend. This guarantees that the WAL
| + | |
| - | data for the commit is transferred as requested to at least one standby.
| + | |
| - | That is sufficient for the use cases we have discussed.
| + | |
| - | | + | |
| - | More complex wakeup rules would be possible via a plugin.
| + | |
| - | | + | |
| - | Wait timeout would be set by individual backends with a timer, just as
| + | |
| - | we do for statement_timeout.
| + | |
| - | | + | |
| - | = CODE =
| + | |
| - | | + | |
| - | Total code to implement this is low. Breaks down into 5 areas
| + | |
| - | * Zoltan's libpq changes, included almost verbatim; fairly modular, so easy to replace with something we like better
| + | |
| - | * A new module syncrep.c and syncrep.h handle the backend wait/wakeup
| + | |
| - | * Light changes to allow streaming rep to make appropriate calls
| + | |
| - | * Small amount of code to allow WALWriter to be active in recovery
| + | |
| - | * Parameter code
| + | |
| - | No docs yet.
| + | |
| - | | + | |
| - | The patch works on top of latches, though does not rely upon them for
| + | |
| - | its bulk performance characteristics. Latches only improve response time
| + | |
| - | for very low transaction rates; latches provide no additional throughput
| + | |
| - | for medium to high transaction rates.
| + | |
| - | | + | |
| - | = PERFORMANCE ANALYSIS =
| + | |
| - | | + | |
| - | Since we reply to each new chunk sent from master, "recv" mode has
| + | |
| - | absolutely minimal latency, especially since WALreceiver no longer
| + | |
| - | performs majority of fsyncs, as in 9.0 code. WALreceiver does not wait
| + | |
| - | for fsync or apply actions to complete before we reply, so fsync and
| + | |
| - | apply modes will always wait at least 2 standby->master messages which
| + | |
| - | is appropriate because those actions will typically occur much later.
| + | |
| - | | + | |
| - | This response mechanism offers highest responsive performance achievable
| + | |
| - | in "recv" mode and very good throughput under load. Note that the
| + | |
| - | different modes do not interfere with each other and can co-exist
| + | |
| - | happily while providing highest performance.
| + | |
| - | | + | |
| - | Starting WALWriter is helpful, no matter what the
| + | |
| - | synchronous_replication_service specified.
| + | |
| - | | + | |
| - | Can we optimise the sending of reply messages so that only chunks that
| + | |
| - | contain a commit deserve a reply? We could, but then we'd need to do
| + | |
| - | extra work on the master to do bookkeeping of that. It would need to be
| + | |
| - | demonstrated that there is a performance issue big enough to be worth
| + | |
| - | the overhead on master and extra code.
| + | |
| - | | + | |
| - | Is there an optimisation from reducing the number of options the standby
| + | |
| - | provides? The architecture on the standby side doesn't rely heavily on
| + | |
| - | the service level specified, nor does it rely in any way on the actual
| + | |
| - | sync rep mode specified on master. No further simplification is
| + | |
| - | possible.
| + | |
| - | | + | |
| - | | + | |
| - | = NOT YET IMPLEMENTED =
| + | |
| - | | + | |
| - | * Timeout code & NOTICE
| + | |
| - | * Code and test plugin
| + | |
| - | * Loops in walsender, walwriter and receiver treat shutdown incorrectly
| + | |
| - | | + | |
| - | I haven't yet looked at Fujii's code for this, not even sure where it
| + | |
| - | is, though hope to do so in the future. Zoltan's libpq code is the only
| + | |
| - | part of that patch used.
| + | |
| - | | + | |
| - | So far I have spent 3.5 days on this and expect to complete tomorrow. I
| + | |
| - | think that throws out the argument that this proposal is too complex to
| + | |
| - | develop in this release.
| + | |
| - | | + | |
| - | = OTHER ISSUES =
| + | |
| - | | + | |
| - | * How should master behave when we shut it down?
| + | |
| - | * How should standby behave when we shut it down?
| + | |
| | | | |
| | [[Category:Replication]] | | [[Category:Replication]] |