Logical replication and physical standby failover

From PostgreSQL wiki
Jump to navigationJump to search

See also Failover slots for some historically relevant information.

Problem statement

As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.

This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.

These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss. It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.

I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.

Specific issues and proposed solutions

Ensuring catalog_xmin safety on physical replicas

Problems

  • When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.
  • A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.
  • We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().
  • We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).

Tooling workarounds

pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.

It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.

Proposed solution(s)

  • Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.
  • Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values
  • If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.
  • Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.
  • ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe

Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.

Syncing logical replication slot state to physical replicas

Problems

  • Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.
  • Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.
  • A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.


Tooling workarounds

pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.

pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.

pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.

Proposed solution(s)

  • pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.

-- or --

  • Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling

and

  • A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)

Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.

Logical slot exported snapshots are not persistent or crash safe

Problems

  • Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.


Tooling workarounds

  • Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.

Proposed solution(s)

  • Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .

Logical slot exported snapshots cannot be used to offload consistent reads to replicas

Problems

  • There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.


Tooling workarounds

Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.

Proposed solution(s)

  • A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.

-or-

  • Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.

Logical slots can fill pg_wal and can't benefit from archiving

Problems

  • The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.


Tooling workarounds

No sensible ones.

Proposed solution(s)

  • Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.

Consistency between logical subscribers and physical replicas of a publisher (upstream)

Problems

  • Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.

    Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.

Tooling workarounds

Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.

Proposed solution(s)

  • Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.
  • Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.
  • If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.
  • Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.


Consistency between primary and physical replicas of a logical standby

Problems

  • Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.

    This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.

    The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.

Tooling workarounds

  • Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.

Proposed solution(s)

  • Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.
  • Use that function in the pglogical replication worker's feedback reporting

Why can't we just re-create slots after failover?

All this would be much simpler if we could just use the subscriber's pg_catalog.pg_replication_origin_status state to re-create logical replication slots on the publisher. But this isn't possible to do safely.

Re-creating slot with original state

But that's not possible to do safely. The replication origin only tracks the remote_lsn which corresponds to the upstream slot's confirmed_flush_lsn. It doesn't track the upstream catalog_xmin or restart_lsn. These are necessary to create a logical replication slot and cannot be simply derived from the confirmed_flush_lsn.

Even if we could track the full slot state, a replication slot's catalog_xmin must remain in effect at all times in order to prevent (auto)vacuum from removing catalog and user-catalog tuples that would've been visible to any transaction that is still pending processing on a logical slot. If these tuples might've been vacuumed away we could ERROR out during logical decoding, produce incorrect results, or crash.

While we could in theory re-create the slot on the promoted standby just before promoting it, assuming we had a way to record the restart_lsn and catalog_xmin, this won't guarantee that the catalog_xmin is valid. We don't track the oldest safe catalog_xmin in the catalogs or control file (see the logical decoding on standby -hackers thread) so we can't know for sure if it's safe. And the publisher might've advanced the slot and vacuumed some changes away, which the standby might then replay before promotion.

Additionally the standby might've removed needed pg_wal, and we don't support logical decoding of WAL via restore_command.

Creating a new slot instead

Instead of trying to restore the slot state after failover, you might try to make a new logical slot with the same name, and resume replaying from it.

This won't work either. The new slot will have a confirmed_flush_lsn at some point after the point of promotion. If the subscriber requests replay from some LSN prior to that, the publisher will silently start sending changes from the slot's confirmed_flush_lsn instead. See the details given in the discussion earlier in this article.

All-logical-replication HA

There's an argument in the postgres community that we shouldn't invest time and effort in making the mature (but limited) physical replication support interoperate well with logical replication for HA and failover. Instead we should spend that effort on improving logical replication enough to make it an equivalent and transparent replacement for physical replication and physical replica promotion.

Logical replication HA missing pieces

The following points are some of the issues that would need to be addressed if we want logical replication to fully replace physical replication for failover and HA.

slots on failover-candidate replicas of publisher

If a logical subscriber is "promoted" to replace its publisher, all other subscribers of the old publisher are broken. They have no way to consistently replay any transactions committed on the old-publisher before the promotion event because the old-publisher LSNs make no sense on the new-publisher, and the old-publisher slots won't exist on the new-publisher.

It's not possible to just create a new logical slot on the new-publisher after promotion because slots cannot replay changes from the past or be rewound. They're forward-only.

maintaining slots

We'd need to have failover-candidate subscribers keep track of slots on the publisher: create new slots when they're created on the publisher, drop them when they're dropped on the publisher, retain resources on the subscriber until all publisher slots no longer require those resources, and advance them when they're advanced on the publisher.

This can't really be done just by pulling state from the subscriber because then there's an unpredictable window where new slots on the publisher won't exist on failover to the subscriber. So we'd need some replication slot hooks and the ability for the publisher to be aware of its failover-candidate subscribers.

(node,lsn) mapping

We need failover-candidate subscribers can to be able to advance their local slots for peers of the provider in response to publisher slot advances to release resources, and so that subscribers of the old-publisher can replay changes consistently from the old-subscriber from the new-subscriber at the correct start-point.

So we could START_REPLICATION LOGICAL SLOT "foo" PUBLISHER "pub_id" LSN "XX/YY" and the promoted subscriber could map lsn XX/YY on node pub_id to its local LSNs.

We'd need something like a persistent lsn mapping of publisher to subscriber txns and some kind of node-id scheme.

Or some other means of consistent replication progress tracking that's tolerant of the publisher and subscriber having totally different LSNs, like the timeline IDs we have for physical replication but preferably without their limitations and hazards.

Flush confirmations and ordering

Same issue as for physical replica primary failover, physical-before-logical-ordering.

It's necessary to ensure that if a standby is promoted to replace a failed publisher, the furthest-ahead standby is promoted. Any other subscribers that are further ahead would otherwise have txns that the promoted subscriber wouldn't have, resulting in divergence.

If both failover-candidate subscribers and other subscribers / logical replication consumers exist, failover-candidate subscribers must confirm flush of new txns on the publisher before a commit can be sent to any other consumers. Otherwise, on failover, the other consumers would have txns from the old-publisher that the promoted replacement publisher would not have, resulting in divergence.

replication origins on failover-candidates of subscribers

There are challenges for promoting a cascaded subscriber to replace a failed subscriber node too.

Maintaining replication origins

When a subscriber advances its replication origin on a publisher, that information needs to be able to be reported to cascaded subscribers so that they can keep track of their effective replay position on the publisher. That way if the cascaded subscriber is promoted to replay directly from the publisher after the old subscriber fails, the promoted new subscriber knows which LSN to request from the publisher when starting replay.

Maintaining replication origins for the publisher on the subscriber's replicas at the right value shouldn't be too hard. We already report the true-origin upstream lsns in the logical protocol. This breaks down in cascades though. If we have:

 P -> A -> B -> C

and want to promote C to replace a failed B, resulting in

 P -> A -> C
 
    [x] B

we need to be able to keep track of the intermediate-upstream lsn of A on C, not just the true-origin-of-commit lsn of P.

This is not an issue for physical rep because there's only one LSN sequence shared by all nodes.

Flush confirmations and ordering

Same issue as for physical replica standby failover, physical-before-logical-ordering.

Much like for physical replication, the subscriber must hold down the flush lsn it reports to the publisher to the oldest value confirmed as flushed by all failover-candidate cascaded subscribers. Otherwise if a failover-candidate for a subscriber is promoted, the publisher might've advanced the slot's confirmed_flush_lsn and will then fail to (re)send some txns to the promoted subscriber.

Alternately, each failover-candidate subscriber must maintain its own slot on the publisher, or have the active subscriber or the publisher maintain those slots on behalf of the failover-candidates. The slots must only advance once the failover-candidate subscriber replays changes from the active subscriber.

Sequences

In-core logical replication doesn't replicate sequence advances in a consistent manner right now. We'd have to decode sequence advance records from WAL and ensure the replicas' sequences are advanced too. It's OK if they jump ahead, like they do after a crash of the primary, so long as they're never behind.

Large transaction lag and synchronous replication

Logical replication only starts applying a txn on the subscriber once the provider side commits, so large txns can cause latency spikes in apply. This can result in much longer waits for synchronous commits in logical replication based HA.

Logical replication transparent drop-in replacement missing pieces

These aren't HA-specific but are present limitations in logical rep that would stop some or many users from switching easily from physical rep (which they presently use for HA) to logical rep.

DDL replication

To allow users of physical replication to seamlessly switch to logical replication we need a comprehensive solution to transparently replicating schema changes, including graceful handling of global objects (roles, etc).

Large objects

Logical replication does not support replication of large objects (`pg_largeobject`, `lo_create`, etc) so users of this feature cannot benefit from logical replication and could not use logical replication based failover.

Performance

In some cases logical replication performs a lot better than physical replication, especially where network bandwidth is a major constraint and/or the database is very b-tree index heavy. Physical replication is bulky on the wire and applying index updates can be quite expensive in blocking read I/O for the startup process executing redo.

In other cases logical replication is a lot slower and won't be a suitable replacement for physical replication for failover purposes. Particularly where there is high concurrency on the provider. Any large increase in replication latency is very important for failover viability. There's ongoing work on streaming logical decoding, parallelized logical decoding, and parallel logical apply that will eventually help with this, but it's complex and it's hard to avoid deadlock-related performance issues.