Failover slots

From PostgreSQL wiki

Jump to: navigation, search


Failover slots

Failover slots are a proposed feature for PostgreSQL 9.6.

We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client. Without this you can't reliably stream data from a HA Pg install into a message queue / ETL system / whatever using logical decoding. You can't have physical HA for your nodes in a logical replication deployment, multi-master mesh cluster, etc. You currently have to create a new slot after failover but you have no way to get from your current before-failover state to a state consistent with the replay position of the standby at the moment of slot creation.

This patch does not make all slots into failover slots. Because failover slots must write WAL on create/advance/drop they cannot be used on a standby in recovery. That means normal non-failover slots remain useful - you can use a physical slot to get WAL from a standby, and it should be possible to add support for logical decoding from standbys too.


Failover slots can only be created, replayed from and dropped on a master (original or promoted standby).

Later work could allow all slots to become failover slots by introducing a slot logging mechanism that functions "beside" WAL, allowing standbys to have their own failover slots and replay changes on those slots to their own cascading standbys. This can be done without changing the UI or breaking clients that work with the first (9.6) version since they won't care that the transport between nodes has changed from WAL to ... something else. To be handwaved up later, and probably only capable of working using slots and streaming replication not archive recovery.

We need failover slots now, to make logical decoding more useful in production environments. Supporting them on cascading standbys is a "nice to have later".

Patch notes

Additional explanation to accompany the patch submission.

Timeline following for logical decoding

This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.

I originally intended to add timeline following support directly to the xlogreader but that wouldn't work well with redo, is more intrusive, and would cause problems with frontend code like pg_xlogdump and pg_rewind that use the xlogreader code. So I implemented it as a nearly-standalone helper that logical decoding callbacks can invoke to ask the xlogreader to track timeline state for them. It's in xlogutils.c not xlogreader.c because it can't be compiled for frontend code. (A later patch could make timeline.c compile for FRONTEND, move the helper to xlogreader.c and add support for timeline following in pg_xlogdump fairly easily... if anyone cares. I don't see the point.)

This patch does NOT use the restart_tli member of ReplicationSlotPersistentData. That member is unused in the current 94/95/96 code as well. I think it should be removed. I didn't use it because it's no help; it's still necessary to read the timeline history files to determine the validity boundaries of a historical timeline, and you might as well just look up the timeline of a given point in history there too. Additionally, trying to use the slot's restart_tli doesn't make sense alongside the existing use of restart_lsn to lazily update the slot's restart position in response to feedback. Timeline following must be eagerly done during replay. In the case of a peek the timeline switch won't be persistent either. Trying to use restart_tli to track which timeline to read new records from is wrong and useless. There's no point having it at all since we can determine the restart timeline from the restart LSN using the timeline history.

It should be possible to use the timeline following support in patch 1 alone by creating an extension that does a ReplicationSlotCreate on a standby and manually mirrors a slot on the master using a side-channel (application etc) to co-ordinate.

BTW, I found the timeline logic in Pg complex enough that I intend to follow up with a README to describe it. I had to map it all out - and the different ways different parts of Pg handle timeline following and WAL reading - before I could make sense of it.

Failover slots

Adds failover slots, slots whose state updates are recorded in WAL and replayed on replica servers.

The point "drop non-failover slots in archive recovery startup" needs more explanation. In 9.4 and 9.5 we just omit pg_replslot from pg_basebackup copies. So they're effectively dropped on backup, only for the backup copy. For other backup methods like rsync, disk snapshots, etc, it'll tend to get included and the server will read and retain the slots when started up from such a backup. This means that on restore we might or might not keep slots based on the method used. If slots were retained they'll holdthe minimum LSN down and prevent WAL retention on the replica since the replica makes its own independent decisions about WAL retention. Additionally logical slots will become increasingly *wrong*: their catalog xmin won't advance, but vacuum on the master server the standby is replaying from will remove dead rows from the catalogs based on the up-to-date catalog xmin of the live, advancing copy of the slot on the master. Nothing updates the copy of the slot from the basebackup.

For failover slots I changed pg_basebackup to copy pg_replslot since I have to copy failover slot state. An alternative would be to make sure the backup checkpoint includes every slot whether or not it's dirty, but copying pg_replslot seemed to make more sense and is more consistent with normal recovery startup. This made the existing problem with stale slots more visible though.

To solve that, as part of this patch the server now drops non-failover slots during archive recovery startup - which includes both true archive based restore/PITR and streaming replication.

As for the backup label change: I don't feel too strongly about this. It'll allow backup tools to make sure to copy and archive enough WAL to keep failover slots functional. It's harmless - all the common tools fish things out of backup labels with regexps and won't care about a new line. It can *not* be achieved by changing START WAL LOCATION instead since that's what the server uses as the start of redo when you begin archive recovery so it must be a checkpoint's REDO pointer. On the other hand pg_basebackup gets the information it needs about minimum WAL from the return value to the BASE_BACKUP command, which I changed to the min() of the WAL needed for redo or failover slots, so it doesn't actually need the info in the label to do its job right. In the great majority of cases if you restore from a backup you'll do archive replay anyway so any failover slots would rapidly advance past needing the extra WAL that got retained in the backup anyway. I'd rather solve this 100% though, not ask users to hope they don't hit this window. Especially since slots might be used for bursty or delayed replication that widens the window.

Notice that slot creation doesn't get replayed to a replica until the slot is made persistent. That way ephemeral slots from failed decoding setups don't have any effect on the replica. I'd like extra attention on this part though, as I'm concerned there might be a risk that as a result the replica could remove WAL still needed by the slot by the time it's made persistent and replicated. If you promote during this period your slot on the replica would be broken. I haven't been able to reproduce it and I'm not convinced it can happen but could use reviewer input there. The reason I don't just replicate ephemeral slot creation to solve this is that ephemeral slots are dropped silently during redo after crash recovery, where we can't write WAL to tell the replica it no longer needs to keep its copies.

User interface for failover slots

Adds the SQL function parameters, walsender keyword and view changes as well as the docs changes.

Does not add a --failover option to pg_recvlogical or pg_receivexlog's --create-slot option. Think one is needed?

Personal tools