Failover slots

From PostgreSQL wiki
Jump to navigationJump to search

Failover slots (unsuccessful feature proposal)

Failover slots were a proposed feature for PostgreSQL 9.6. The feature proposal has been dropped. Failover slots were not included in PostgreSQL and are unlikely to be included in the form proposed in this patch. However, parts of the functionality underlying the feature did get committed.

The wiki page Logical replication and physical standby failover discusses the current state of physical failover support for logical replication upstream and downstream postgres instances, and the various tooling-based strategies that can make it possible.

Partially committed

Some of the functionality underlying the failover slots and logical decoding on standby patch sets did get committed, including:

  • catalog_xmin in hot_standby_feedback messages from physical streaming replicas (PostgreSQL 10) (list discussion) This allows a streaming replica to forward the `catalog_xmin` of any logical slots that exist on the replica up to the primary where the `catalog_xmin` is then applied to the physical slot of the streaming replica. The primary will then guarantee the `catalog_xmin` of those logical slots on the standby even when the standby is disconnected, preventing vacuum from removing dead catalog tuples that might still be needed for logical slots on the standby. That makes it safe to use the logical slots on the standby for logical decoding after promoting the standby.
  • timeline following for logical slots (PostgreSQL 10) (CF entry, prior CF entry). This patch teaches the xlogreader code how to follow timeline switches correctly when doing logical decoding, so the change stream from a logical slot does not terminate when the upstream timeline increments due to a failover/promotion event. Necessary to allow logical decoding to follow failover.
  • Cleanup slots during drop database (PostgreSQL 10). Required to allow a standby to replay `DROP DATABASE` for a database where logical slots exist for that database on the standby.

This is enough to allow external tools to roll their own "failover slots" by syncing slot state from primary to standby(s), though it's a bit delicate to do so correctly. It's necessary to write a C extension that creates replication slots using the low level slot management APIs, since no SQL-visible functions exist to do so. An example of the use of those APIs can be found in the original failover slots patch in src/test/modules.

A few related patches are also relevant to failover and logical slots:

Relevant mailing list discussion

Implementing replication slot failover with tooling

With the above patches in PostgreSQL 10 it's now possible to implement failover management for PostgreSQL logical replication slots in external tooling.

Standbys must be configured with:

  • hot_standby_feedback = on
  • A primary_slot_name to use a physical replication slot on the primary

The tool will need to provide an extension in each failover-candidate standby that provides a means of managing low-level replication slot state, since there is no SQL interface for this in PostgreSQL at time of writing. Exactly how this is done, and whether it's a push or pull model etc, is up to the tool. A very simplistic and minimal example can be found in the patch attached to this mail, in src/test/modules/test_slot_timelines. (A tool should not copy `pg_replslot/*/state` files from primary to standby instead; these won't be re-read by the standby when updated while the server is running, and could get replaced by stale contents from shared memory).

To manage failover, the tool should periodically scan the primary's slots. For each logical replication slot the tool wishes to preserve for failover to a standby, the tool should create/update an identical logical replication slot on any failover-candidate standby(s). The tool must check that that the standby has replayed up to the confirmed_flush_lsn of a slot and delay syncing that slot if needed. When syncing slots, the restart_lsn, confirmed_flush_lsn and catalog_xmin of the standby's copy of a slot must all be updated and persisted together. The tool should also delete slots from the standby when they cease to exist on the primary.

Limitations and caveats

WARNING: See Logical replication and physical standby failover for the significant challenges surrounding this approach. It's not easy to get right.

It's only safe to use any given logical replication slot on a standby after promotion once the catalog_xmin for the standby's physical slot on the primary is <= the catalog_xmin for the slot. Until that point, any such slots are unsafe to use; they may work, but produce incomplete or incorrect output or crash the walsender. I recommend that you create them with a different name like "_sync_temp1" or something, then rename them (create a new one and drop the temp one) once the catalog_xmin is known to be safe. You can use the txid_status() function to help with this, or just watch the physical slot's catalog_xmin on the primary.

Even with this approach, a logical subscriber may receive and apply a transaction from the primary before the physical replica. A failover may then cause the physical replica to be promoted without having this transaction, so the provider and subscriber now differ. Addressing this would require a core code change to teach the walsender to delay sending logical commits until they've been confirmed by all failover-candidate physical replicas. A patch for this would be welcomed. Individual output plugins can work around this in the mean time by sleeping in their commit callback until all slots configured as replicas have flushed past the lsn of the commit being processed. The output plugin has to provide its own means of configuring which slots/connections represent replicas - it does not make sense to overload synchronous_standby_names for this, and you want to use slot names not standby connection names anyway.

The primary must preserve the physical replication slot for the standby. If the standby slot is dropped and re-created, it becomes unsafe to fail over to the standby and use any logical slots on the standby until they are resynced again. There's no simple way for tooling to detect if the standby's slot on the primary was dropped and re-created.

Unfortunately there are no C-level hook functions in the replication slot management code for tools to use to trigger wakeups, syncs or checks. Polling is required.

Information on original failover slots proposal

The following is older content preserved to aid in understanding the context of the topic.

Rationale

We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client. Without this you can't reliably stream data from a HA Pg install into a message queue / ETL system / whatever using logical decoding. You can't have physical HA for your nodes in a logical replication deployment, multi-master mesh cluster, etc. You currently have to create a new slot after failover but you have no way to get from your current before-failover state to a state consistent with the replay position of the standby at the moment of slot creation.

This patch does not make all slots into failover slots. Because failover slots must write WAL on create/advance/drop they cannot be used on a standby in recovery. That means normal non-failover slots remain useful - you can use a physical slot to get WAL from a standby, and it should be possible to add support for logical decoding from standbys too.

Limitations

Failover slots can only be created, replayed from and dropped on a master (original or promoted standby).

Later work could allow all slots to become failover slots by introducing a slot logging mechanism that functions "beside" WAL, allowing standbys to have their own failover slots and replay changes on those slots to their own cascading standbys. This can be done without changing the UI or breaking clients that work with the first (9.6) version since they won't care that the transport between nodes has changed from WAL to ... something else. To be handwaved up later, and probably only capable of working using slots and streaming replication not archive recovery.

We need failover slots now, to make logical decoding more useful in production environments. Supporting them on cascading standbys is a "nice to have later".

Patch notes

Additional explanation to accompany the patch submission.

Timeline following for logical decoding

This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.

I originally intended to add timeline following support directly to the xlogreader but that wouldn't work well with redo, is more intrusive, and would cause problems with frontend code like pg_xlogdump and pg_rewind that use the xlogreader code. So I implemented it as a nearly-standalone helper that logical decoding callbacks can invoke to ask the xlogreader to track timeline state for them. It's in xlogutils.c not xlogreader.c because it can't be compiled for frontend code. (A later patch could make timeline.c compile for FRONTEND, move the helper to xlogreader.c and add support for timeline following in pg_xlogdump fairly easily... if anyone cares. I don't see the point.)

This patch does NOT use the restart_tli member of ReplicationSlotPersistentData. That member is unused in the current 94/95/96 code as well. I think it should be removed. I didn't use it because it's no help; it's still necessary to read the timeline history files to determine the validity boundaries of a historical timeline, and you might as well just look up the timeline of a given point in history there too. Additionally, trying to use the slot's restart_tli doesn't make sense alongside the existing use of restart_lsn to lazily update the slot's restart position in response to feedback. Timeline following must be eagerly done during replay. In the case of a peek the timeline switch won't be persistent either. Trying to use restart_tli to track which timeline to read new records from is wrong and useless. There's no point having it at all since we can determine the restart timeline from the restart LSN using the timeline history.

It should be possible to use the timeline following support in patch 1 alone by creating an extension that does a ReplicationSlotCreate on a standby and manually mirrors a slot on the master using a side-channel (application etc) to co-ordinate.

BTW, I found the timeline logic in Pg complex enough that I intend to follow up with a README to describe it. I had to map it all out - and the different ways different parts of Pg handle timeline following and WAL reading - before I could make sense of it.

Failover slots

Adds failover slots, slots whose state updates are recorded in WAL and replayed on replica servers.

The point "drop non-failover slots in archive recovery startup" needs more explanation. In 9.4 and 9.5 we just omit pg_replslot from pg_basebackup copies. So they're effectively dropped on backup, only for the backup copy. For other backup methods like rsync, disk snapshots, etc, it'll tend to get included and the server will read and retain the slots when started up from such a backup. This means that on restore we might or might not keep slots based on the method used. If slots were retained they'll holdthe minimum LSN down and prevent WAL retention on the replica since the replica makes its own independent decisions about WAL retention. Additionally logical slots will become increasingly *wrong*: their catalog xmin won't advance, but vacuum on the master server the standby is replaying from will remove dead rows from the catalogs based on the up-to-date catalog xmin of the live, advancing copy of the slot on the master. Nothing updates the copy of the slot from the basebackup.

For failover slots I changed pg_basebackup to copy pg_replslot since I have to copy failover slot state. An alternative would be to make sure the backup checkpoint includes every slot whether or not it's dirty, but copying pg_replslot seemed to make more sense and is more consistent with normal recovery startup. This made the existing problem with stale slots more visible though.

To solve that, as part of this patch the server now drops non-failover slots during archive recovery startup - which includes both true archive based restore/PITR and streaming replication.


As for the backup label change: I don't feel too strongly about this. It'll allow backup tools to make sure to copy and archive enough WAL to keep failover slots functional. It's harmless - all the common tools fish things out of backup labels with regexps and won't care about a new line. It can *not* be achieved by changing START WAL LOCATION instead since that's what the server uses as the start of redo when you begin archive recovery so it must be a checkpoint's REDO pointer. On the other hand pg_basebackup gets the information it needs about minimum WAL from the return value to the BASE_BACKUP command, which I changed to the min() of the WAL needed for redo or failover slots, so it doesn't actually need the info in the label to do its job right. In the great majority of cases if you restore from a backup you'll do archive replay anyway so any failover slots would rapidly advance past needing the extra WAL that got retained in the backup anyway. I'd rather solve this 100% though, not ask users to hope they don't hit this window. Especially since slots might be used for bursty or delayed replication that widens the window.

Notice that slot creation doesn't get replayed to a replica until the slot is made persistent. That way ephemeral slots from failed decoding setups don't have any effect on the replica. I'd like extra attention on this part though, as I'm concerned there might be a risk that as a result the replica could remove WAL still needed by the slot by the time it's made persistent and replicated. If you promote during this period your slot on the replica would be broken. I haven't been able to reproduce it and I'm not convinced it can happen but could use reviewer input there. The reason I don't just replicate ephemeral slot creation to solve this is that ephemeral slots are dropped silently during redo after crash recovery, where we can't write WAL to tell the replica it no longer needs to keep its copies.


User interface for failover slots

Adds the SQL function parameters, walsender keyword and view changes as well as the docs changes.

Does not add a --failover option to pg_recvlogical or pg_receivexlog's --create-slot option. Think one is needed?