Hot Standby

From PostgreSQL wiki

Revision as of 16:03, 17 May 2009 by Robe (Talk | contribs)

Jump to: navigation, search

Hot Standby is the name for the capability to run queries on a database that is currently performing archive recovery. Log Shipping replication allows you to create one or more standby nodes that are replicas of the primary node (or master node). Standby nodes can then be used for read-only query access.

Hot Standby is being developed for inclusion in PostgreSQL 8.5 by 2ndQuadrant via public sponsorship.

This feature will augment the capabilities of streaming replication that have also been discussed for inclusion in r8.4, though does not rely on it and has minimal interaction with it.

Changes will appear on this page, as and when design changes or clarifications are required. Please reply to posts on pgsql-hackers, or email


Project Status Tracking

Version History

Latest version is always at top of this section

Remaining Work Items for Nov 2008 CommitFest

  • Latest commit that is being tested: 4fe66...

Reported problems (Identified by) (Target release)

  • (none)

Infrastructure changes

  • (none)

Main work/required additional items (Proposed by) (Target release)

  • (none)

Performance tuning opportunities (Proposed by) (Target release)

  • (none)

Usability enhancements (Proposed by)

  • max_standby_delay set as a normal GUC, so it can be changed during normal execution (Simon)
  • more accurately determine users to disconnect if we drop a tablespace that contains temp files currently in use
  • more work possible on btree deletes to reduce impact
  • input datatypes for functions (Simon)
  • main docs required (put Wiki into SGML) (Simon)

Items deferred until next release

  • make pg_start_backup() work on standby - synchronisation is already possible using pg_latest_recovered_lsn()
  • connection from standby to master to prevent xmin horizon moving forwards on master
  • buffer manager optimization targeted to reduce I/O for replay of btree vacuum when scanning uncleaned blocks (Simon)
  • Canceled queries when idle in transaction cause client-side mismatch [slightly unrelated) (Simon) This only effects transactions that are idle-in-transaction AND either hold locks, are serializable or have open cursors. Current solution is upgrade to a FATAL error. The same problem occurs if we have an error while processing invalidation messages.

Tests needed

  • Is there a performance regression on normal running from any overheads introduced? Patch v HEAD comparison
  • Is there a performance regression on recovery? Patch v HEAD comparison (4% degradation on simple test at v4, need retest for later versions)
  • Usability tests: How does query conflict handling work in practice? User experience needed

Code has been reviewed in various parts by Pavan, Heikki and Tom. Further detailed review/ers are still required.

Resolved Items/Issues

Reported problems (Reported by/Requested by) (Resolved in version...) (Most recent at top)

Bugs present on Nov 1

  • minSafeStart point reset to earlier in some cases (Heikki) (Infra)
  • signal postmaster to refresh password cache if auth file changes (Simon) (v9g)
  • fix query cancel because of locks so it includes AbortOutOfAnyTransaction(); (Simon) (v9e)
  • fix memory leak in GetRunningXactData() (Simon) (v9)
  • conflict resolution for redo of database drop/move/rename while users are connected to that database (Simon) (v9)
  • conflict resolution for redo of drop tablespace while users are using temp files in that tablespace - correct but needs work (Simon) (v9)
  • Handle out of procs state gracefully, prevent snapshots and throw user query errors (Heikki) (v8d)
  • AccessExclusiveLock must be held by subtransaction, not top-level transaction (Heikki/Simon) (v8)
  • Block unlocking required during VACUUM scan (Heikki) (v6b)
  • XLOG_BTREE_DELETE records handled correctly (Heikki) (v6)
  • btree VACUUM code - must scan every block of index (Heikki) (v6)
  • Must ignore_killed_tuples and never kill_prior_tuple during index scans in recovery (Greg) (v6) same fix, subtlely different scope (v8)
  • uint unsupported on Windows (Jaime) (v6b)
  • SIGINT cancels idle-in-transaction sessions (Pavan) (v6b)
  • WALInsertLock still held at failover (Pavan) (v5)
  • flat file updates for DDL incorrect (Mark) (v5)

Refactoring, bugs, tuning, usability tweaks

Functionality changes since Nov 1

  • Complete Prepared Transaction support (Simon)
  • fix bugs with recovery_starts_paused(), allow shutdown while paused (Simon)
  • Compile error with WAL_DEBUG set (Bernd)
  • removed reliance on pg_subtrans marking optimisation (Heikki/Simon)
  • merged changes with Heikki's most recent review of recoveryinfra changes (Simon)
  • tuning of RecordKnownAssignedTransactionIds() to avoid taking ProcArrayLock in most common case (Simon) (v9p)
  • tuning of RecordKnownAssignedTransactionIds() using hash table (Heikki) (v9m)
  • deferred buffer recovery conflict cache (Simon) (v9k)
  • rename parameter used to turn on/off hot standby (Merlin) (v9i)
  • Allow option to start recovery in paused mode (Kevin) (v9i)
  • startup process waits, even when issuing deferred query cancellation (Heikki/Gianni) (v9i)
  • Workaround: Canceled queries when idle in transaction cause FATAL errors because of client-side mismatch bug (Simon) (v9i)
  • XidCacheRemoveXid (Simon) (v9i)
  • rare assertion failure in pg_subtrans - caused by invalid assumption about when to mark subtrans (Gianni)
  • snapshot bug (v9h) (Gianni & Mark)
  • using wrong LSN in conflict handling causing more cancels than required (Gianni) (v9f)
  • add parameter to disable standby query capability ("hot_standby") (Heikki) (v9e)
  • fix race condition for SetBufferRecoveryConflictLSN(), recheck MyProc->xmin (Simon) (v9e)
  • streamline commit/abort processing to bare minimum (Simon) (v9d)
  • improved debug error messages for unobserved xids (Gianni) (v9c)
  • missed some calls to RecordKnownAssignedTransactions() during last refactoring (Simon) (v9c)
  • pg_recovery_advance() regression - fixed (Simon) (v9b)
  • alter location of debug messages for AssignTransactionId (Simon) (v9a)
  • remove docs for pg_recovery_pause_cleanup() (Simon) (v9)
  • reduce use of ProcArrayLock to once per record, plus one at commit/abort (Simon) (v9)
  • RecoveryConflictLSN, deferred cancellation of xmin horizon conflicts and skip cancellation if idle-in-xact (Andreas) (v9)
  • make btree delete watertight (can't use RecentGlobalXmin) - correct but needs work (Heikki) (v9)
  • must not do conflict processing when snapshots disabled - need to refactor some more, groan (Simon) (v9)
  • rearrange call to UnobservedTransactionsPruneXids() to speed up calls to UnobservedTransactionsRemoveXid() (Simon) (v9)
  • remove pg_recovery_pause_cleanup() which stopped working in v8e because of refactoring (Simon) (v9)
  • downgrade "WARNING: proc array entry was missing for transaction" to DEBUG2 (Happens because of the race condition between XidGenLock and WALInsertLock. Occurred 4 times in 10 hour bash test on dual CPU server, but all cases have been normal and explainable since the xid was the latestRunningXid in all cases.) (Gianni) (v9)
  • refactor rmgr "isCleanupRecord" code (Heikki) (v8e)
  • remove databaseId, roleId and vacFlags from RunningXactData (Simon) (v8d)
  • complete refactoring required change of ProcArrayUpdateRecoveryTransactions() (Heikki), plus fix one stupid and one very subtle bug while doing so (Simon) (v8c)
  • not updating pg_auth when appropriate (Gianni) (v8b) -- trivial logic error
  • optimize UnobservedTransactionsPruneXids() and fix corner case bug (Gianni) (v8b)
  • split recovery procs into different pools, allowing error messages when overflowed (Simon) (v8a)
  • refactor pre-allocation of Recovery Procs, need to trace dependency that prevents removal (Simon) (v8a)
  • refactor XactCompletionHasUnmarkedSubxids() (Simon) (v8a)
  • fix bug caused by earlier refactoring in ProcArrayClearRecoveryProcs() (Simon) (v8a)
  • UnobservedXids leak warning - trivial fix (Gianni) (v8a)
  • avoid starting autovac launcher for wraparound avoidance during recovery (Simon) (v8a)
  • clarify meaning of last_recovered... functions during/after recovery (Simon) (v8)
  • add function last_recovered_xlog_location() to allow synchronisation between master and standby based on an LSN (Simon) (v8)
  • Document that index locking is correct and why (Heikki) (v8)
  • add README comments to access/transam/README (Simon) (v8)
  • document need to set max_connections correctly on standby (Simon) (v8)
  • refactor/remove special case logic in GetRunningTransactionData() (Heikki) (v8)
  • pg_subtrans could not open file (Heikki) (v8)
  • added pg_current_recovery_target() (Gianni) (v8)
  • allow startup process to delay at start to allow debugger attach where pg_ctl -o "-W n" sets a delay of n secs (Simon) (v8)
  • adjust all XXX comments related to Hot Standby to be XXXHS (Simon) (v8)
  • renamed last_completed... functions to last_recovered... (Simon) (v8)
  • slotid refactoring, comments changed (Heikki) (v7)
  • Cannot Assert(pmState == PM_STARTUP) when postmaster catches PMSIGNAL_RECOVERY_START (regression from earlier report) (v6b)
  • Use of #define may mask bugs in 1-2 locations in procarray.c (Pavan) (v6b)

Do we need/want? (Current answer)

  • recovery_safe_start_location; I think we do (Not implemented)
  • to automatically reset default_transaction_read_only following startup (No)
  • Creating temp tables in recovery mode. Not sure if that is spec compliant? We may want to relax that restriction. If we do, we already have a design in place for changes to make this work: (Not in 8,4)

Resolved Issues/Questions

  • No need to send invalidations when index metapages change - already handled at xact commit
  • Added further user/admin documentation via this Wiki

Excluded items

  • hash index handling (or any index type that doesn't write WAL)
  • Cancelled queries to show errcode(ERRCODE_T_R_SERIALIZATION_FAILURE) (cannot do, unable to identify source of cancellation)


Users Overview

Users will be able to connect to the database while the postmaster is in recovery state and perform read-only queries. Read-only access to catalogs and views will also occur as normal.

When you connect in this mode, default_transaction_read_only will be forced to be true, whatever its setting in postgresql.conf. As a result, all transactions started during this time will be limited to read-only actions only. In all other ways, connected sessions will appear identical to sessions initiated during normal processing mode. There are no special commands required to initiate a connection at this time, so all interfaces will work normally without change.

Read-only here means "no writes to the permanent database tables". So temporary sort and work files will be used, just as with normal SQL execution. Temporary tables cannot be created and therefore cannot be used at all in recovery mode.

Actions that are allowed

  • Transaction management commands: BEGIN, END, ABORT, START TRANSACTION
  • Cursor commands: DECLARE, FETCH, CLOSE,
  • Parameters: SHOW, SET, RESET, DISCARD
  • LOAD
  • LOCK (see later)
  • EXCEPTION blocks in internal subtransactions
  • Activation of Select RULEs

Actions not allowed on Standby are:

  • Transaction management commands which explicitly set non-read only state: SET TRANSACTION, SET SESSION CHARACTERISTICS, BEGIN, START TRANSACTION, direct SET
  • Two-phase commit commands (PREPARE TRANSACTION, COMMIT PREPARED, ROLLBACK PREPARED) because even read-only transactions need to write WAL in the prepare (first phase of two phase commit).
  • SELECT ... FOR SHARE | UPDATE which cause row locks to be written
  • DML - Insert, Update, Delete, COPY FROM, Truncate which all write data
  • DDL - Create, Drop, Alter, Comment (even for temporary tables because currently these cause writes to catalog tables)
  • Access to nextval on sequences because nextval causes the value of the sequence to be increment
  • LISTEN, UNLISTEN, NOTIFY since they currently write to system tables
  • Remember: Triggers will never be executed!

All of these are allowed on primary, just not on standby in read-only mode. When failover, switchover occurs these commands will then be possible.

It is possible that the restrictions on LISTEN, UNLISTEN, NOTIFY and temporary tables may be lifted in a future release, if their internal implementation allows it.

Note that current behaviour of read only transactions is to allow these

  • nextval on sequences

So there is a subtle difference in behaviour between standby read-only transactions and non-recovery read only transactions.

If failover or switchover occurs the database will switch to normal processing mode. Sessions will remain connected while the server changes mode. Current transactions will continue, though will remain read-only. After this, it will be possible to initiate read-write transactions, though users must *manually* reset their default_transaction_read_only setting first.

Users will be able to tell whether their session is read-only by issuing SHOW default_transaction_read_only, or by running an information function pg_is_in_recovery() returns bool. There is also be a function that allows you to read the last known transaction end timestamp (pg_last_completed_xact_timestamp) and the transactionid (pg_last_completed_xid).

In recovery, transactions will not be permitted to take any lock higher than AccessShareLock nor assign a TransactionId. This mostly prevents writing WAL, but we protect against that specifically as well, just in case some user code attempts it.

The LOCK TABLE command by default applies an AccessExclusiveLock. This will generate WAL messages when executed on the primary node. When executed on the standby node the default will be to issue an AccessShareLock. Any LOCK TABLE command that runs on the standby and requests a specific lock type other than AccessShareLock will be rejected.

During recovery database writes will only be performed by startup process as it processes WAL records and replays them (also known as WAL redo or simply "apply"). In general this means that queries will not experience lock conflicts with writes, just like normal Postgres concurrency control (MVCC).

Query Conflicts

However, there is some potential for conflict between standby queries and WAL redo from the primary node. The user is provided with a number of optional ways to handle these conflicts.

Conflicts can occur for four reasons

  • Access Exclusive Locks from primary node (including both locks and various kinds of DDL action)
  • Early cleanup of data still visible to the current query's snapshot
  • Dropping tablespaces on the master while standby queries are using those tablespace for temporary work files (work_mem overflow)
  • Dropping users, databases or tablespaces on the master while those objects are connected on standby
  • Waiting to acquire buffer cleanup locks (for which there is no time out)

discussed in more detail below.

Some WAL redo actions will be for DDL actions. These DDL actions are repeating actions that have already committed on the primary node, so they must not fail on the standby node. These DDL locks take priority and will automatically *cancel* any read-only transactions that get in their way, after a grace period. This is similar to the possibility of being canceled by the deadlock detector, but in this case the standby process always wins, since the repeated actions must not fail. This also ensures that replication doesn't fall behind while we wait for a query to complete. Again, we assume that the standby is there for high availability purposes primarily.

An example of the above would be:

  • User on Standby running a query against table X
  • Administrator on Primary runs DROP TABLE

Clearly the query cannot continue if we let the DROP TABLE proceed. If this situation occurred on the primary, the DROP TABLE would wait until the query has finished. When the query is on the standby and the DROP TABLE is on the primary, the primary doesn't have information about what the standby is running and so cannot wait. In most cases this is desirable, so the workload on the standby has little or no effect on the primary and so we would say the two nodes are only loosely coupled.

The second reason for conflict between standby queries and WAL redo is "early cleanup". Normally, PostgreSQL allows cleanup of old row versions when there are no users who may need to see them to ensure correct visibility of data (known as MVCC). If there is a standby query that has been running for longer than any query on the primary then it is possible for old row versions to be removed by either VACUUM or HOT. This will then generate WAL records that, if applied, would remove data on the standby that might *potentially* be required by the standby query. In more technical language, the Primary's xmin hozrizon is later than the Standby's xmin horizon, allowing dead rows to be removed.

We have a number of choices for resolving query conflicts. Choice (1) is automatic, options (2) and (3) should be taken as remedial actions if an annoying number of cancelations occur.

1. We wait and hope the query completes. If the recovery is not paused, then we will wait automatically until the server is max_standby_delay behind. Once that grace period expires, we then take one of the following actions:

a) If the conflict is caused by a lock, we cancel the standby transaction immediately, even if it is idle-in-transaction.

b) If the conflict is not caused by a lock, we tell the standby query that a conflict has occurred and that it must cancel itself if there is a risk that it attempts to read data that has been cleaned up already. Specifically, if the standby query reads a data block that has been recently modified it will cancel itself. Note it only needs one modified data block to cause a cancel and that the data block does *not* need to have been modified by a cleanup record - any change is sufficient to create this risk. (This is very similar to the much feared Oracle error message "snapshot too old"). Note also that this means that idle-in-transaction sessions are never canceled except by locks. Users should be clear that tables that are regularly and heavily updated on primary server will quickly cause cancellation of any longer running queries made against those tables.

If cancellation does occur, the query and/or transaction is repeatable - i.e. the error is dynamic and will not necessarily occur the same way if the query is executed again.

2. We connect to Primary server from Standby server and keep a transaction open. This guarantees that a cleanup record is never generated and we don't ever get into situation 1(b). This option can be performed fairly simply using contrib/dblink functions.

3. We pause recovery by issuing a pg_recovery_pause() function. This halts application of WAL records completely and allows queries to proceed to completion without problem. We can issue pg_recovery_continue() at any time, so the pause can be held for long or short periods, as the administrator allows. This method of conflict resolution may mean that there is a build up of WAL records waiting to be applied and this will progressively increase the failover delay. If there is regular arrival of WAL records this would quickly prevent the use of the standby as a high availability failover target. Some users may wish to use multiple standby servers for various purposes. Pauses in recovery stay until explicitly released, i.e. max_standby_delay is overridden by the use of pauses.

An example of the above would be:

  • User on Standby running a long query involving tables X and Y
  • User on Primary repeatedly updates table Z

The user on the primary will be generating HOT updates on table Z and these will generate cleanup WAL records. In this case the rows cleaned up are very recent and on the standby node we are able to detect that removing those row versions could *potentially* prevent the Standby query from executing correctly and so there is a conflict. Initially, we will wait for max_standby_delay, but then afterwards we will give the standby query notice to quit. In this case, the standby query never does read data that has been cleaned, since it is looking at different tables, so the standby query runs to completion normally.

We keep track of the first 8 relations (that is, indexes and/or tables) that cause conflicts. When we have conflicts with more than 8 tables we apply the conflict handling to all relations. So in the above example if we had already recorded conflicts with eight other tables (plus Z means 9 tables, more than our cache) then we *would* cancel the query against tables X and Y, even though those tables have not been touched. Conflict processing increases quickly with number of tables in the conflict cache, so raising the number above 8 is not practical.

Note that max_standby_delay is set in recovery.conf. It applies to the server as a whole, so once used it may not be available for other users. They will have to wait for the server to catch up again before the grace period is available again. So max_standby_delay is a configuration parameter set by the administrator which controls the maximum acceptable failover delay, rather than a user parameter to specify how long their query needs to run in.

Waits for buffer cleanup locks do not currently result in query cancellation. Long waits are uncommon, though can happen in some cases with long running nested loop joins.

These two items are discussed in the administrator's section since they are not typical user situations

  • Dropping tablespaces on the master while standby queries are using those tablespace for temporary work files (work_mem overflow)
  • Dropping users, databases or tablespaces on the master while those objects are connected on standby

Administrators Overview

Hot Standby is enabled by default, though can be disabled by setting "recovery_connections = off" in recovery.conf.

A second parameter, "max_standby_delay" should also be set by the administrator in recovery.conf. The default is 30 seconds. This should be set according to business priorities. For example if the server is primarily tasked as a High Availability server, then you may wish to lower max_standby_delay or even set it to zero. If the standby server is tasked as an additional server for decision support then it may be acceptable to set this to a value of many hours, e.g. max_standby_delay = 43200 (12 hours). It is also possible to set max_standby_delay to -1 which means "always wait" if there are conflicts, though if unmonitored this may not be a useful mode.

It is strongly recommended that the setting of max_connections on the standby should be equal to or greater than the setting of max_connections on the primary. This is to ensure that standby has sufficient resources to manage incoming transactions. If the standby is asked to manage too many concurrent *write* transactions it will run out of recovery session resources. Read only transactions have no effect on the standby. If the standby runs out of recovery session resources then recovery will continue normally, but it will prevent MVCC snapshots from being taken and all standby queries will fail with an ERROR. Snapshots will be re-enabled after the next checkpoint on the primary for which the number of write transactions has fallen back below the cutoff.

Users will be able to write large sort temp files and re-generate relcache info files, so there is no part of the database that is truly read-only during hot standby mode. There is no restriction on use of set returning functions, or other users of tuplestore/tuplesort code.

Statspack functions should work OK, so tools such as pgAdminIII should work. pgAgent will not, though the next version will allow execution of jobs on a read only node.

Failover can be initiated at any time by allowing the startup process to reach the end of WAL, or by issuing the function pg_recovery_stop() as superuser. (see next section).

Stats collector is active during recovery. All scans, reads, blocks, index usage etc will all be recorded normally on the standby. Replayed actions will not duplicate their effects on primary, so replaying an insert will not increment the Inserts column of pg_stat_user_tables. The stats file is deleted at start of recovery, so stats from master and standby will differ.

pg_cancel_backend() will work on user backends, but not Startup process. pg_locks will show locks held by backends and by startup process. pg_stat_activity will not show an entry for startup process.

check_pgsql will work, but its very simple. check_postgres will also work, though many some actions could give different or confusing results. e.g. last vacuum time will not be maintained for example, since no vacuum occurs on the standby.

Dynamically loadable modules work, including the new pg_stat_statements.

Advisory locks work normally in recovery, including deadlock detection. No other form of deadlock is possible in recovery because the only only lock type that can be taken on database objects is AccessShareLock. Please note that advisory locks are never WAL logged, so it is not possible for an advisory lock to conflict with WAL replay.

slony won't run on the standby either! The Standby is a Clone and not a Slave. A Slave is a separate database that is forced to duplicate the actions of the Master. The Standby is an exact copy, in every detail that matters.

New oids cannot be assigned, though various UUID generators may still work if required.

Currently, creating temp tables is not allowed during read only transactions, so in some cases existing scripts will not run correctly. It is possible we may relax that restriction in a later release. This is both a SQL Standard compliance issue and a technical issue, so will not be resolved in this release.

Running DROP TABLESPACE on master has no effect on users on standby, though can only succeed if the tablespace is empty. Some standby users may be actively using the tablespace via their temp_tablespaces parameter. If so, those users will be have their current queries cancelled, so that they remove their temp files and allow WAL replay to proceed.

Running DROP DATABASE, ALTER DATABASE SET TABLESPACE, or ALTER DATABASE RENAME on master will cause all users connected to that database on the standby to be forcibly disconnected, once max_standby_delay has been reached. Similarly, running DROP USER or DROP ROLE for a role with login capability will cause all users connected using that role to be forcibly disconnected, once max_standby_delay has been reached.

WAL file control commands will not work during recovery e.g. pg_start_backup(), pg_switch_xlog() etc..

The following administrator commands will not be accepted during recovery mode


Note again that some of these commands are actually allowed during "read only" mode transactions on the master.

So you cannot create additional indexes on the standby node, nor can you collect statistics via ANALYZE in new/different ways. (EXPLAIN ANALYZE still works, it is a different command).

If you wish to perform maintenance on the standby, run commands on the primary and the standby will then also mimic those changes. e.g. if you need to ANALYZE a table, do so on the primary.

Autovacuum is *not* active during recovery, though will start normally if recovery ends. Bgwriter is active during recovery and will perform restartpoints (similar to checkpoints on primary) and normal block cleaning activities.

Please be reminded that hash indexes are unusable on hot standby, due to already existing and documented limitations for that index type.

Dynamic Control of Recovery

The functions shown in Table 9-56 assist in archive recovery. Except for the first three functions, these are restricted to superusers. All of these functions can only be executed during recovery.

Table 9-56. Recovery Control Functions

Name Return Type Description
pg_is_in_recovery bool True if recovery is still in progress.
pg_last_recovered_xact_timestamp() timestamp with time zone Returns the original completion timestamp with timezone of the last completed transaction in the current recovery.
pg_last_recovered_xid() integer Returns the transaction id (32-bit) of last completed transaction in the current recovery. Later numbered transaction ids may already have completed. This is unrelated to transactions on the source server.
pg_recovery_pause() void Pause recovery processing, unconditionally.
pg_recovery_continue() void If recovery is paused, continue processing.
pg_recovery_stop() void End recovery and begin normal processing.
pg_recovery_pause_xid(int) void Continue recovery until specified xid completes, if it is ever seen, then pause recovery.
pg_recovery_pause_time(timestamp) void Continue recovery until a transaction with specified timestamp completes, if one is ever seen, then pause recovery.
pg_recovery_advance(int) void Advance recovery specified number of records then pause.

pg_recovery_pause and pg_recovery_continue allow a superuser to control the progress of recovery of the database server. This allows queries to be executed to determine how far recovery should progress. If the superuser wishes recovery to complete and normal processing mode to start, execute pg_recovery_stop.

Variations of the pause function exist, mainly to allow PITR to dynamically control where it should progress to. pg_recovery_pause_xid and pg_recovery_pause_time allow the specification of a trial recovery target. Recovery will then progress to that point, and then the database can be inspected to see if this is the correct stopping point.

pg_recovery_advance allows recovery to progress record by record, for very careful analysis or debugging. Step size can be 1 or more records. If recovery is not yet paused then pg_recovery_advance will process the specified number of records then pause. If recovery is already paused, recovery will continue for another N records before pausing again.

If you pause recovery while the server is waiting for a WAL file when operating in standby mode it will have apparently no effect until the file arrives. Once the server begins processing WAL records again it will notice the pause request and will act upon it. This is not a bug.

Once recovery is paused it will stay paused until you release it, even after the server falls further behind than max_standby_delay.

You can specify "recovery_starts_paused = true" in recovery.conf, though the default is not-paused.

Diagnosing problems

First, check that your are connected to the correct server and that the server is in recovery. You can check this by looking at "ps" output and looking for a "startup process".

There are sometimes reasons why you cannot connect. Check the log for messages.

If you think you should be able to see an object or some data, check that the standby has processed all oustanding WAL. It may just be the standby is lagging behind.

Look at the current snapshot using "select txid_current_snapshot();". Look at pg_locks and pg_stat_activity.

Design Overview


Accuracy: Some changes since this was written, refer to main patch for current details

What already works

The most difficult part of running queries during recovery is creating a valid MVCC snapshot. The route we have chosen is to have WAL records emulate transaction entries in the procarray, thus allowing users to see entries for TransactionIds that have started writing WAL but not yet had their commit records applied.

The following items for prerequisites for emulating transactions during recovery.

  • TransactionIds are densely allocated in increasing order. This means that there are no gaps and the assignment is monotonic. Read-only transactions do not create

TransactionIds so we can run read-only transactions without disrupting duplicating already used ids.

  • Subtransaction commit does not write WAL, so it is important that it also does not write clog entries until commit. This then allows us to perform all the actions of subtransaction commit at the same time as we process the commit record.

As a result, we have the infrastructure on which to build to make Hot Standby work.


In 8.3, recovery maintains the commit log (or clog) as transactions commit or abort. This alone is insufficient to allow user queries to have full visibility of MVCC data.

For Hot Standby we will maintain 3 data structures:

  • ProcArray - records emulated "running" transactions
  • Clog - required to identify visibility of transactions
  • Subtrans - required to identify visibility of subtransactions

We do not need to maintain Multitrans during recovery. We will not be running HeapTupleSatisfiesVacuum() during recovery.

The events that maintain these structures are:

  • new transaction - add a recovery proc with this top-level xid
  • new subtransaction - new subxids are marked with top-level xid, so add new subxid onto recovery proc's subxidcache. We may also add entry to subtrans.
  • commit/abort - remove recovery proc

To allow the above, each WAL record has both the xid and the top-level xid

We expect new xids to come in dense order, i.e. each new xid is 1 later than previous (allowing for wraparound). However, there is a race condition between xid assignment (XidGenLock) and xid use (WALInsertLock), so the xids can arrive with gaps in the sequence. We will generally see the first WAL records for those intermediate xids appearing shortly afterwards, but even so we must keep track of these as yet UnobservedXids.

Row Removal

Various type of block-level actions were performed on the primary with Super-Exclusive/BufferCleanupLocks held. This same level of lock must be held otherwise queries may attempt to re-access data blocks while they are being cleaned. We will need to change many redo actions so that they take BufferCleanup locks, using a new XLogReadBufferForCleanup(). Changes would be required to HEAP2 actions (freeze, clean and cleanmove) as well as to all index vacuuming actions.

We can't cancel a query that holds a pin on a particular block since we don't keep track of who holds a pin. We just know a backend has a pin and that the startup process must wait.

This may become a problem, or it may not. In most cases, backends hold 5-15 pins concurrently and pins are held for relatively short times. Startup process will provide debug information for when pin times are extended, and for monitoring total delay from pin waits.

Some strategies, if this becomes a problem:

  • minimise pin hold times wherever this occurs
  • deferring application of WAL for pinned blocks (requiring us to use conditional locking)
  • making VACUUM FREEZE hold stronger locks on standby to prevent query access (but doesn't help HOT)
  • ensuring that we perform read ahead I/O for WAL records that have not yet reached head of apply queue
  • change some of the ways index AMs work with respect to row removal (see later)

Hot Standby may work best with full_page_writes = off.

In general, we must do one of these three things: 1) pass OldestXmin for read-only snapshots to the master 2) make WAL replay halt while queries complete 3) cancel queries that could see the data about to be removed

We will implement a combination of strategies 2 and 3, while allowing 1 to exist also. (It might be trivial to make (1) work using dblink).

Each WAL record that cleans tuples has a latestRemovedTransactionId on it. If latestRemovedTransactionId is later than a running query on standby then we wait. When we stop waiting we tell all at-risk queries the LSN of the first WAL record that has potentially removed tuples they can see. If they see a block with a higher LSN they cancel *themselves*.

This works OK, since SeqScans never read blocks at end of file that didn't exist when they started, so long queries need not be cancelled when they access growing tables.

The wait is controlled by a parameter called max_standby_delay. The standby delay is the time taken to complete all outstanding WAL activity. max_standby_delay is thus the additional time we think acceptable should we choose to perform failover from the primary to the standby. max_standby_delay = 0..1000000s with typical values being 0s for a very High Availability server, or 3600s for a typical reporting server.

When Startup waits, as described above, this causes the standby_delay for the whole server to increase. Probably more discussion required about how this would work.

Sounds like we need some a balancing system that has a target or optimal standby delay and an absolute max, so that if we are between optimal and max we try to return to optimal. Maybe this would be max_query_standby_delay and max_server_standby_delay. Individual queries can force the server to wait for up to max_query_standby_delay, though no wait occurs that would make the whole server get more behind than max_server_standby_delay. If max_query_standby_delay >= max_server_standby_delay the server delay takes precedence.

We would perform a planned switchover by first setting max_standby_delay=0 on standby, so it catches up. Then shutoff primary users, wait for all WAL to drain across to standby and let her start up.

We will maintain OldestXmin in shared memory to allow it to be accessed quickly.

Issues: Many different use cases here. Doubt we can provide for all of them in first release.

Transaction Snapshots

All queries will use MVCC Snapshot Isolation in either Read Committed or Serializable modes. These require snapshots, plus visibility information as described above.

Snapshots will be derived directly from shared memory on stanbdy, since data is maintained as transactions arrive/complete. We will also track latest_completed_xid.


All read-only transactions will use and maintain a relcache, allowing them to plan and execute queries using a cached copy of the current database catalog tables. The relcache will be maintained on each backend normally, re-reading catalog tables when invalidation messages are received.

Invalidation messages will be sent by the Startup process, though the Startip process will not maintain its own copy of the relcache. Invalidation messages will be issued when new WAL messages are received. XLOG_RELCACHE_INVAL messages will be sent every time we make a change to a cached heap tuple within CacheInvalidateHeapTuple(). These always occur after a change and will include all required information to send a full invalidation message.

This means that Startup process will continue to use XLogReadBuffer(), minimising the changes required in current ResourceManager's _redo functions - there are still some, see later. It also means that read-only backends will use ReadBuffer() calls normally, so again, no changes required throughout the normal executor code.

Relation Locking

In recovery, transactions will not be permitted to take any lock higher than AccessShareLock nor assign a TransactionId. This should also prevent us from writing WAL, but we protect against that specifically as well, just in case. (Maybe we can relax that to Assert sometime later). We can dirty data blocks but only to set hint bits. (That's another reason to differentiate between those two cases anyway). Note that in recovery, we will always be allowed to set hint bits - no need to check for asynchronous commits. All other actions which cause dirty data blocks should not be allowed, though this will be just an Assert. Specifically, HOT pruning will not be allowed in recovery mode.

Since read-only backends will only be allowed to take AccessShareLocks there will be no lock conflicts apart from with AccessExclusiveLocks. (If we allowed higher levels of lock we would then need to maintain Multitrans to examine lock details, which we would also rather avoid). So Startup process will not take, hold or release relation locks for any purpose, *apart* from when AccessExclusiveLocks are required.

The Startup process will emulate locking behaviour for transactions that require AccessExclusiveLocks (AELs). AELs will be held by first inserting a dummy TransactionLock entry into the lock table with the TransactionId of the transaction that requests the lock. Then the lock entry will be made. Locks will be released when processing a transaction commit, abort or shutdown checkpoint message, and the lock table entry for the transaction will be removed.

Locks will be taken by the Startup process when it receives a new WAL message. XLOG_LOCK_ACCESSEXCLUSIVE messages will be sent each time a backend *successfully* acquires an AccessExclusiveLock.

Any AEL request that conflicts with an existing lock will cause some action: if it conflicts with an existing AEL then an error is raised; if the AEL request conflicts with a read-only backend then we wait for a while (see Row Removal above) then the read-only backend will receive a cancel message to make it go away.

If Startup process crashes it is a PANIC anyway, so there is no difficulties in cleanup for the lock manager with this approach.

The LOCK TABLE command by default applies an AccessExclusiveLock. This will generate WAL messages when executed on the primary node. When executed on the standby node the default will be to issue an AccessShareLock. Any LOCK TABLE command that runs on the standby and requests a specific lock type will be rejected.

Issues: How will advisory locks work?

Index Locking

Recovery queries will also wish to use indexes, so all index structures are assumed to be uncorrupted and able to be used during recovery. (This is the same asumption we make immediately after recovery completes, but it is worth noting this explicitly).

An index type may not be usable during recovery if it's WAL _redo() function does not take and hold buffer locks in the same (or at least, an equally correct) manner during recovery. For example, btree indexes have strict rules on buffer locking that must be followed during block splits.

  • btree - initial analysis shows locking is correct for btrees (other opinions welcome)
  • GIST, GIN need some changes, but no problems foreseen (says Teodor)
  • hash - doesn't generate WAL yet, so unusable during recovery
  • bitmap - locking protocol works without needing cleanup locks, just exclusive (looking at patch)

Changes to index metapages must also generate cache invalidations by using XLOG_RELCACHE_INVAL messages in WAL. Metapage changes will require cleanup locks, though these are almost certain to be granted because of cacheing.

B-tree index tuples are removed by

  • inserts when the page is full - require only an exclusive lock
  • VACUUMs - require a cleanup lock

The same lock requirements apply to Hot Standby, so we will need to differentiate the two uses of XLOG_BTREE_DELETE so that we can re-apply the appropriate lock during btree_redo()

HOT imposes some tricky conditions as to when it is safe to use a new index. These conditions also apply to Hot Standby, so there is nothing we need do. (Note also that Hot Standby has got nothing to do with the HOT feature).

Summary of changes to WAL records

  • New WAL record for successfully acquired AccessExclusiveLocks
  • New WAL record for cache invalidation messages
  • Two new fields on each XLogRecord - xid2, a TransactionId and info2, 2 bytes of flags/data

The last change uses up the 6 bytes that were being wasted on 64-bit servers, and adds 4 bytes on to the XLogRecord length for 32-bit servers.

I would like to use 3 of the flag bits for Hot Standby:

  • flag1 to indicate the first WAL record for an xid. This tells us Hot

Standby to efficiently manage a recovery snapshot.

  • flag2 to indicate the first WAL record for a subxid. That tells us to

maintain subtrans during Hot Standby.

  • flag3 to indicate that this is a cleanup record. This tells us to use latestRemovedTransactionId to inform user queries of cleanup having taken place.

If flag2 is set then xid2 contains the xid of xl_xid's parent. else if flag3 is set then xid2 contains the latestRemovedTransactionId

Note that flag2 and flag3 are mutually exclusive. cleanup records do not make transactional changes, they just cleanup what can no longer be seen. The first change made by a subtransaction is always on another WAL record from any cleanup associated with the block (HOT etc).

Design Notes

Additional thinking behind the technical designs above, not necessarily fully coherent sets of thoughts.

Transaction Snapshots

Deriving transaction snapshots is probably the most difficult problem for Hot Standby to resolve. We *must* have a transaction snapshot to access table data in a consistent way. (See discussion later on Transaction Isolation).

We can derive transaction snapshots

  • remotely from primary node
  • locally on the standby node

If we derive a snapshot locally, then we will end up with a situation where the xmin of the local snapshot precedes the xmin of the primary node. When this occurs it will then be possible for WAL records to arrive on the standby that request removal of rows that a transaction might wish to see. Preventing that situation can be done by either deferring WAL apply or by cancelling queries.

We can defer WAL apply for particular tables only, but this could significantly complicate the apply process and is not a suggested option for the first release of this feature.

We might control query cancellation by tracking which tables have had rows removed that would have been visible to particular queries. Again, possible but suffers from the problem that tables on which HOT is frequently active would be almost unusable.

Simply ignoring WAL removal has been discussed and rejected (so far). Explicitly defining the tables a transaction wishes to see has also been discussed and rejected (so far).

So the most generally applicable and simplest initial solution is to take snapshots from the remote primary node. The proposal for this follows:

Transaction snapshots will be managed by a new process, Snapshot. Snapshot will start only in recovery mode and will exit when normal processing mode begins. Snapshot process will connect to the remote primary node and export snapshot data and copy this to shared memory on the standby node. When all standby backends have run UnregisterSnapshot() then the snapshot will then be unregistered on the remote primary node also.

The standby must not think a transaction is visible until all changes made by it are have been applied. So snapshots from the primary cannot be used on the standby until the LSN at which they were taken has been reached by WAL apply on the standby. Snapshots don't normally have an LSN, so we must derive that information as well when we take a Snapshot. With asynchronous commits a transaction can be committed and yet not yet written to disk, so we cannot rely on the current WAL write pointer. Taking the WAL insert pointer is the safest thing to do, though most heavily contended. We don't want to hold ProcArrayLock while requesting WALInsertLock, so we will discover the LSN of the WAL insert pointer *after* the Snapshot has been derived.

So snapshots from the primary take time to before they can be used. The delay is exactly the current processing delay from primary to standby, so another reason why we do not wish to fall behind.

Taking snapshots from primary has a few disadvantages

  • snapshots take time before they are usable
  • requesting new snapshots is via remote request
  • snapshots on primary prevent row removal (but this was also an advantage of this technique!)

If primary and standby node are connected by private ethernet then the snapshot request time will be ~10ms, though that is probably 100 times slower than current snapshot access. If primary and standby are opposite sides of an ocean then times could be as high as 500ms. Taking snapshots directly could be annoyingly slow for small read-only statements, so we need to buffer this process in various ways. For larger queries, this may not be a problem at all, but we do not wish to limit or assume the uses of the standby node.

First of all, each backend will have the opportunity to reuse previous snapshots both within transactions and across them. A userset parameter snapshot_reuse_window = 0..60000ms will define the time window in which any request for a new snapshot will simply result in being fed the last snapshot again. When the window on a snapshot has expired a newer snapshot will be presented instead. This idea is similar to serializable transactions, which continually reuse the same snapshot. This is a useful parameter for normal processing as well, since it will reduce contention on the ProcArrayLock for many applications.

Snapshots can be reused across transactions in recovery mode, since they are held in shared memory. Amount of shared memory dedicated to storing snapshots will be max_connections * size of snapshots.

Since there is a delay between obtaining a new snapshot and it becoming usable the Snapshot process will buffer them until they become "mature", like a good Whiskey. Snapshot process will take regular snapshots and pre-age them so that when a backend requests a snapshot it will be given the most recently matured snapshot. Time between snapshots is set by snapshot_preread_timeout = 0..60000ms. If a request for a snapshot arrives and there are no snapshots waiting to mature then this will trigger snapshot process to request a new primary snapshot. (This parameter could be automatically set based upon the arrival rate of snapshot requests, but this is a something to consider later).

If snapshot_reuse_window = 0 then a backend will be presented with a freshly obtained snapshot and will then wait until the exact first moment it can be used before returning.

We can continue to reuse snapshots from the primary even if the primary crashes, becomes disconnected or is shutdown/restarted. New snapshots are obviously not possible until it appears again. It's not that common for us to lose contact with the primary *and* for it to *not* be a failover, so this seems like an acceptable restriction.

txid_current_snapshot() is sufficient for our needs. We only need to know about top-level xids. This means we will need a working subtrans data structure, which means we will need some way to update it. Proposal is for the first WAL record in any subtransaction to note the parentxid. WAL apply loop will update subtrans during recovery when a WAL record arrives with an augmented parentxid.


Query performance may slow down WAL apply if too many queries execute concurrently and use up CPU and I/O resources that would have been used by the Startup process. Queries that hold pins for long periods on tables with frequent updates will slow down WAL apply.

Performance is not expected to be impaired by these new WAL message types since they apply only to AccessExclusiveLocks, which are reasonably rare even in applications that issue many DDL requests.


Two Phase Commit

Will need to maintain locking for prepared transactions. So will need to read twophase state and re-create any AELs therein. No other actions foreseen.

Tuple Locking

Tuple level locks are WAL-logged, so WAL replay will make the changes on blocks to emulate locking. There is no required access to lock manager.

From correctness perspective, no changes to tuples that are locked should appear in WAL, so we ignore the possibility.

No changes required here. No users can wait on locks, since they all use AccessShareLocks. So they just ignore tuple locks.

Data structure Check

  • Heap - discussed above
  • Index - discussed above
  • FSM - checked with Heikki and new FSM causes no issues for Hot Standby
  • twophase - just files, so structure is OK
  • pg_log - writable
  • pg_xlog - not writable by WAL apply or users
  • clog - data structure valid and usable
  • subtrans - data structure is non transactional
  • multixact - is not required


Florian Pflug's contribution to making Hot Standby was an important one. The single hardest problem in 2007 was how we would get access to a TransactionId. Florian's careful analysis resulted in the idea that is now critical to the rest of this project: VirtualTransactionIds. In addition, various internal changes made by a range of developers for 8.3 have made this significantly easier than it was previously.

Florian also identified the problems of relcache handling and block cleanup locks. Unfortunately, none of Florian's earlier code can/has been reused.

This design includes technical feedback from many PostgreSQL hackers, all of whom I thank and acknowledge.

Project Plans


  • 1 Correctness
  • 2 Performance


  • Design (2008!)
  • Connection during recovery (mid-July)
  • Infrastructure changes (1 Sept)
  • Queries work when no WAL changes arrive (snapshots)
  • Queries work while WAL changes arrive (block locking)
  • Queries work while DDL changes arrive (locking, relcache)
  • Performance
  • Queries work with two-phase commit

Features not planned for this release

(Or at least not by us in this release.)

  • Automatic offloading - run a query on primary node and have it offload processing of larger queries onto standby nodes.
  • Different mechanisms for deriving standby snapshots
  • LISTEN, NOTIFY - should be possible when this happens via shared memory rather than pg_listener
Personal tools