Hot Standby TODO

From PostgreSQL wiki
Jump to navigationJump to search

Refer also here: Hot Standby


Must-fix issues

All of these items will be handled prior to Beta:

  1. No issues remaining as of 13 Feb 2010

Serious Issues

  1. Improved conflict resolution OR conflict avoidance
    • When replaying b-tree deletions, we currently wait out/cancel all running (read-only) transactions. We take the ultra-conservative stance because we don't know how recent the tuples being deleted are. If we could store a better estimate for latestRemovedXid in the WAL record, we could make that less conservative.
    • Simon says: getting a better estimate is only helpful if xmin horizon is longer on master than on standby
    • Added a new filter based around use of latestCompleteXid - reduces problem somewhat
    • New deferred conflict resolution model designed and discussed on hackers
  2. Standby delay on idle system
    • The "standby delay" is measured as current timestamp - timestamp of last replayed commit record. If there's little activity in the master, that can lead to surprising results. For example, imagine that max_standby_delay is set to 8 hours. The standby is fully up-to-date with the master, and there's no write activity in master. After 10 hours, a long reporting query is started in the standby. Ten minutes later, a small transaction is executed in the master that conflicts with the reporting query. I would expect the reporting query to be canceled 8 hours after the conflicting transaction began, but it is in fact canceled immediately, because it's over 8 hours since the last commit record was replayed.
    • Simon says... changed to allow checkpoints to update recoveryLastXTime (Simon DONE)
    • Heikki: we skip checkpoints when master is completely idle, so we still have the same problem in that case
    • Requires keep-alives with timestamps to be added to sync rep feature
  3. Statement cancel on idle session issues
    • When an idle-in-transaction transaction is killed because of conflict with recovery, we use FATAL and kill the whole connection. Should find a way to just cancel the current transactions.
    • Simon says: Joachim has now resolved this, just need to rework and commit, Tom found issues that need further work
  4. Performance
    • Profiling
    • Look at whether we need to have an option to cancel a query at end of recovery. If we do that we don't need to continually check whether we are still in recovery while running, so that may buy us some performance and scalability.


  • Docs
  • Add reference from vacuum_defer_cleanup_age to max_standby_delay
  • Fix xref from vacuum_defer_cleanup_age to Hot Standby chapter
  • When switching from standby mode to normal operation, we momentarily hold all AccessExclusiveLocks held by prepared xacts twice, needing twice the lock space. You can run out of lock space at that point, causing failover to fail. Simon: significantly reduced issue.
  • There's the optimization to replay of b-tree vacuum records that we discussed earlier: Replay has to touch all leaf pages because of the interlock between heap scans, to ensure that we don't vacuum away a heap tuple that a concurrent index scan is about to visit. Instead of actually reading in and pinning all pages, during replay we could just check that the pages that don't need any other work to be done are not currently pinned in the buffer cache.
    • Simon for post-commit optimization.
  • ResolveRecoveryConflictWithVirtualXIDs polls until the victim transactions have ended. It would be much nicer to sleep. We'd need a version of LockAcquire with a timeout. Hmm, IIRC someone submitted a patch for lock timeouts recently. Maybe we could borrow code from that?
  • All but one caller of ResolveRecoveryConflictWithVirtualXIDs gets the XID list from GetConflictingVirtualXIDs(). How about providing a shorthand function to do both steps in one call? Would make the call sites a bit less verbose.
  • Starting recovery connections from a shutdown checkpoint has been requested. This gives problems if recovery_connections is disabled at startup on primary and then recovery continues, so has not been added as yet.
  • Master->Standby NOTIFY - Use a specially loggable NOTIFY command to enable LISTENing on standby for those events
  • If we use SIGUSR1 multiplexing, can introduce a race condition that could allow a false positive signal, resulting in cancelation by mistake. This could happen if the actual backend being signaled exits quickly after being identified as a conflict target, then a new backend that reuses the same backend slot *and* has same pid could be mistaken for the old backend.

Resolved Issues

These issues have been resolved by CVS commits

  1. Relcache init file invalidation
  2. Statement cancel on idle session issues
    • Jurka says: should not give an ERROR message (this part only)
  3. Drop database doesn't cancel standby sessions that are completely idle
  4. Statement cancel on idle session issues
    • Tom says: Statement cancel signals should use SIGUSR1 multiplexing
  5. Startup process waits behind some long running processes
    • Endless wait causes deadlock with LockBufferForCleanup() in some cases
    • Workaround applied for deadlock, though needs more precise targeting of potential deadlocks - fix designed, now needs to be applied only iff max_standby_delay = -1
    • Best resolution appears to be resolve the lock-wait and the deadlock stops being an issue, so add SIGALRM handling to Startup process. Resolution by instructing all backends to ERROR if they hold the buffer pin that recovery requires.
  6. Reconstruct latestRemovedXid for btree delete records via heap pages accessed during recovery
  7. Starting recovery connections from a shutdown checkpoint has been requested.