https://wiki.postgresql.org/api.php?action=feedcontributions&user=Ringerc&feedformat=atomPostgreSQL wiki - User contributions [en]2024-03-29T08:51:17ZUser contributionsMediaWiki 1.35.13https://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35571Logical replication and physical standby failover2020-12-09T01:46:24Z<p>Ringerc: /* Why can't we just re-create slots after failover? */</p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
'''Problems'''<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
''' Problems '''<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
''' Problems '''<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
''' Problems '''<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
-or-<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
''' Problems '''<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
No sensible ones.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
''' Problems '''<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br /><br /> Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
''' Tooling workarounds '''<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
''' Problems '''<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br /><br /> This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br /><br /> The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
= Why can't we just re-create slots after failover? =<br />
<br />
All this would be much simpler if we could just use the subscriber's <code>pg_catalog.pg_replication_origin_status</code> state to re-create logical replication slots on the publisher. But this isn't possible to do safely.<br />
<br />
== Re-creating slot with original state ==<br />
<br />
But that's not possible to do safely. The replication origin only tracks the <code>remote_lsn</code> which corresponds to the upstream slot's <code>confirmed_flush_lsn</code>. It doesn't track the upstream <code>catalog_xmin</code> or <code>restart_lsn</code>. These are necessary to create a logical replication slot and cannot be simply derived from the <code>confirmed_flush_lsn</code>.<br />
<br />
Even if we could track the full slot state, a replication slot's <code>catalog_xmin</code> must remain in effect at all times in order to prevent (auto)vacuum from removing catalog and user-catalog tuples that would've been visible to any transaction that is still pending processing on a logical slot. If these tuples might've been vacuumed away we could ERROR out during logical decoding, produce incorrect results, or crash.<br />
<br />
While we could in theory re-create the slot on the promoted standby just before promoting it, assuming we had a way to record the restart_lsn and catalog_xmin, this won't guarantee that the catalog_xmin is valid. We don't track the oldest safe catalog_xmin in the catalogs or control file (see the logical decoding on standby -hackers thread) so we can't know for sure if it's safe. And the publisher might've advanced the slot and vacuumed some changes away, which the standby might then replay before promotion.<br />
<br />
Additionally the standby might've removed needed pg_wal, and we don't support logical decoding of WAL via restore_command.<br />
<br />
== Creating a new slot instead ==<br />
<br />
Instead of trying to restore the slot state after failover, you might try to make a new logical slot with the same name, and resume replaying from it.<br />
<br />
This won't work either. The new slot will have a <code>confirmed_flush_lsn</code> at some point after the point of promotion. If the subscriber requests replay from some LSN prior to that, the publisher will silently start sending changes from the slot's <code>confirmed_flush_lsn</code> instead. See the details given in the discussion earlier in this article.<br />
<br />
= All-logical-replication HA =<br />
<br />
There's an argument in the postgres community that we shouldn't invest time and effort in making the mature (but limited) physical replication support interoperate well with logical replication for HA and failover. Instead we should spend that effort on improving logical replication enough to make it an equivalent and transparent replacement for physical replication and physical replica promotion.<br />
<br />
== Logical replication HA missing pieces ==<br />
<br />
The following points are some of the issues that would need to be addressed if we want logical replication to fully replace physical replication for failover and HA.<br />
<br />
=== slots on failover-candidate replicas of publisher ===<br />
<br />
If a logical subscriber is "promoted" to replace its publisher, all other subscribers of the old publisher are broken. They have no way to consistently replay any transactions committed on the old-publisher before the promotion event because the old-publisher LSNs make no sense on the new-publisher, and the old-publisher slots won't exist on the new-publisher.<br />
<br />
It's not possible to just create a new logical slot on the new-publisher after promotion because slots cannot replay changes from the past or be rewound. They're forward-only.<br />
<br />
==== maintaining slots ====<br />
<br />
We'd need to have failover-candidate subscribers keep track of slots on the publisher: create new slots when they're created on the publisher, drop them when they're dropped on the publisher, retain resources on the subscriber until all publisher slots no longer require those resources, and advance them when they're advanced on the publisher.<br />
<br />
This can't really be done just by pulling state from the subscriber because then there's an unpredictable window where new slots on the publisher won't exist on failover to the subscriber. So we'd need some replication slot hooks and the ability for the publisher to be aware of its failover-candidate subscribers.<br />
<br />
==== (node,lsn) mapping ====<br />
<br />
We need failover-candidate subscribers can to be able to advance their local slots for peers of the provider in response to publisher slot advances to release resources, and so that subscribers of the old-publisher can replay changes consistently from the old-subscriber from the new-subscriber at the correct start-point.<br />
<br />
So we could <code>START_REPLICATION LOGICAL SLOT "foo" PUBLISHER "pub_id" LSN "XX/YY"</code> and the promoted subscriber could map lsn <code>XX/YY</code> on node <code>pub_id</code> to its local LSNs.<br />
<br />
We'd need something like a persistent lsn mapping of publisher to subscriber txns and some kind of node-id scheme. <br />
<br />
Or some other means of consistent replication progress tracking that's tolerant of the publisher and subscriber having totally different LSNs, like the timeline IDs we have for physical replication but preferably without their limitations and hazards.<br />
<br />
==== Flush confirmations and ordering ====<br />
<br />
Same issue as for physical replica primary failover, physical-before-logical-ordering.<br />
<br />
It's necessary to ensure that if a standby is promoted to replace a failed publisher, the furthest-ahead standby is promoted. Any other subscribers that are further ahead would otherwise have txns that the promoted subscriber wouldn't have, resulting in divergence.<br />
<br />
If both failover-candidate subscribers and other subscribers / logical replication consumers exist, failover-candidate subscribers must confirm flush of new txns on the publisher before a commit can be sent to any other consumers. Otherwise, on failover, the other consumers would have txns from the old-publisher that the promoted replacement publisher would not have, resulting in divergence.<br />
<br />
=== replication origins on failover-candidates of subscribers ===<br />
<br />
There are challenges for promoting a cascaded subscriber to replace a failed subscriber node too.<br />
<br />
==== Maintaining replication origins ====<br />
<br />
When a subscriber advances its replication origin on a publisher, that information needs to be able to be reported to cascaded subscribers so that they can keep track of their effective replay position on the publisher. That way if the cascaded subscriber is promoted to replay directly from the publisher after the old subscriber fails, the promoted new subscriber knows which LSN to request from the publisher when starting replay.<br />
<br />
Maintaining replication origins for the publisher on the subscriber's replicas at the right value shouldn't be too hard. We already report the true-origin upstream lsns in the logical protocol. This breaks down in cascades though. If we have:<br />
<br />
P -> A -> B -> C<br />
<br />
and want to promote C to replace a failed B, resulting in<br />
<br />
P -> A -> C<br />
<br />
[x] B<br />
<br />
we need to be able to keep track of the intermediate-upstream lsn of A on C, not just the true-origin-of-commit lsn of P.<br />
<br />
This is not an issue for physical rep because there's only one LSN sequence shared by all nodes.<br />
<br />
=== Flush confirmations and ordering ===<br />
<br />
Same issue as for physical replica standby failover, physical-before-logical-ordering.<br />
<br />
Much like for physical replication, the subscriber must hold down the flush lsn it reports to the publisher to the oldest value confirmed as flushed by all failover-candidate cascaded subscribers. Otherwise if a failover-candidate for a subscriber is promoted, the publisher might've advanced the slot's confirmed_flush_lsn and will then fail to (re)send some txns to the promoted subscriber.<br />
<br />
Alternately, each failover-candidate subscriber must maintain its own slot on the publisher, or have the active subscriber or the publisher maintain those slots on behalf of the failover-candidates. The slots must only advance once the failover-candidate subscriber replays changes from the active subscriber.<br />
<br />
=== Sequences ===<br />
<br />
In-core logical replication doesn't replicate sequence advances in a consistent manner right now. We'd have to decode sequence advance records from WAL and ensure the replicas' sequences are advanced too. It's OK if they jump ahead, like they do after a crash of the primary, so long as they're never behind.<br />
<br />
=== Large transaction lag and synchronous replication ===<br />
<br />
Logical replication only starts applying a txn on the subscriber once the provider side commits, so large txns can cause latency spikes in apply. This can result in much longer waits for synchronous commits in logical replication based HA.<br />
<br />
== Logical replication transparent drop-in replacement missing pieces ==<br />
<br />
These aren't HA-specific but are present limitations in logical rep that would stop some or many users from switching easily from physical rep (which they presently use for HA) to logical rep.<br />
<br />
=== DDL replication ===<br />
<br />
To allow users of physical replication to seamlessly switch to logical replication we need a comprehensive solution to transparently replicating schema changes, including graceful handling of global objects (roles, etc).<br />
<br />
=== Large objects ===<br />
<br />
Logical replication does not support replication of large objects (`pg_largeobject`, `lo_create`, etc) so users of this feature cannot benefit from logical replication and could not use logical replication based failover.<br />
<br />
=== Performance ===<br />
<br />
In some cases logical replication performs a lot better than physical replication, especially where network bandwidth is a major constraint and/or the database is very b-tree index heavy. Physical replication is bulky on the wire and applying index updates can be quite expensive in blocking read I/O for the startup process executing redo.<br />
<br />
In other cases logical replication is a lot slower and won't be a suitable replacement for physical replication for failover purposes. Particularly where there is high concurrency on the provider. Any large increase in replication latency is very important for failover viability. There's ongoing work on streaming logical decoding, parallelized logical decoding, and parallel logical apply that will eventually help with this, but it's complex and it's hard to avoid deadlock-related performance issues.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35570Logical replication and physical standby failover2020-12-09T01:46:00Z<p>Ringerc: /* Problem statement */</p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
'''Problems'''<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
''' Problems '''<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
''' Problems '''<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
''' Problems '''<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
-or-<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
''' Problems '''<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
No sensible ones.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
''' Problems '''<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br /><br /> Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
''' Tooling workarounds '''<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
''' Problems '''<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br /><br /> This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br /><br /> The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
= Why can't we just re-create slots after failover? =<br />
<br />
All this would be much simpler if we could just use the subscriber's <code>pg_catalog.pg_replication_origin_status</code> state to re-create logical replication slots on the publisher. But this isn't possible to do safely.<br />
<br />
== Re-creating slot with original state<br />
<br />
But that's not possible to do safely. The replication origin only tracks the <code>remote_lsn</code> which corresponds to the upstream slot's <code>confirmed_flush_lsn</code>. It doesn't track the upstream <code>catalog_xmin</code> or <code>restart_lsn</code>. These are necessary to create a logical replication slot and cannot be simply derived from the <code>confirmed_flush_lsn</code>.<br />
<br />
Even if we could track the full slot state, a replication slot's <code>catalog_xmin</code> must remain in effect at all times in order to prevent (auto)vacuum from removing catalog and user-catalog tuples that would've been visible to any transaction that is still pending processing on a logical slot. If these tuples might've been vacuumed away we could ERROR out during logical decoding, produce incorrect results, or crash.<br />
<br />
While we could in theory re-create the slot on the promoted standby just before promoting it, assuming we had a way to record the restart_lsn and catalog_xmin, this won't guarantee that the catalog_xmin is valid. We don't track the oldest safe catalog_xmin in the catalogs or control file (see the logical decoding on standby -hackers thread) so we can't know for sure if it's safe. And the publisher might've advanced the slot and vacuumed some changes away, which the standby might then replay before promotion.<br />
<br />
Additionally the standby might've removed needed pg_wal, and we don't support logical decoding of WAL via restore_command.<br />
<br />
== Creating a new slot instead ==<br />
<br />
Instead of trying to restore the slot state after failover, you might try to make a new logical slot with the same name, and resume replaying from it.<br />
<br />
This won't work either. The new slot will have a <code>confirmed_flush_lsn</code> at some point after the point of promotion. If the subscriber requests replay from some LSN prior to that, the publisher will silently start sending changes from the slot's <code>confirmed_flush_lsn</code> instead. See the details given in the discussion earlier in this article.<br />
<br />
= All-logical-replication HA =<br />
<br />
There's an argument in the postgres community that we shouldn't invest time and effort in making the mature (but limited) physical replication support interoperate well with logical replication for HA and failover. Instead we should spend that effort on improving logical replication enough to make it an equivalent and transparent replacement for physical replication and physical replica promotion.<br />
<br />
== Logical replication HA missing pieces ==<br />
<br />
The following points are some of the issues that would need to be addressed if we want logical replication to fully replace physical replication for failover and HA.<br />
<br />
=== slots on failover-candidate replicas of publisher ===<br />
<br />
If a logical subscriber is "promoted" to replace its publisher, all other subscribers of the old publisher are broken. They have no way to consistently replay any transactions committed on the old-publisher before the promotion event because the old-publisher LSNs make no sense on the new-publisher, and the old-publisher slots won't exist on the new-publisher.<br />
<br />
It's not possible to just create a new logical slot on the new-publisher after promotion because slots cannot replay changes from the past or be rewound. They're forward-only.<br />
<br />
==== maintaining slots ====<br />
<br />
We'd need to have failover-candidate subscribers keep track of slots on the publisher: create new slots when they're created on the publisher, drop them when they're dropped on the publisher, retain resources on the subscriber until all publisher slots no longer require those resources, and advance them when they're advanced on the publisher.<br />
<br />
This can't really be done just by pulling state from the subscriber because then there's an unpredictable window where new slots on the publisher won't exist on failover to the subscriber. So we'd need some replication slot hooks and the ability for the publisher to be aware of its failover-candidate subscribers.<br />
<br />
==== (node,lsn) mapping ====<br />
<br />
We need failover-candidate subscribers can to be able to advance their local slots for peers of the provider in response to publisher slot advances to release resources, and so that subscribers of the old-publisher can replay changes consistently from the old-subscriber from the new-subscriber at the correct start-point.<br />
<br />
So we could <code>START_REPLICATION LOGICAL SLOT "foo" PUBLISHER "pub_id" LSN "XX/YY"</code> and the promoted subscriber could map lsn <code>XX/YY</code> on node <code>pub_id</code> to its local LSNs.<br />
<br />
We'd need something like a persistent lsn mapping of publisher to subscriber txns and some kind of node-id scheme. <br />
<br />
Or some other means of consistent replication progress tracking that's tolerant of the publisher and subscriber having totally different LSNs, like the timeline IDs we have for physical replication but preferably without their limitations and hazards.<br />
<br />
==== Flush confirmations and ordering ====<br />
<br />
Same issue as for physical replica primary failover, physical-before-logical-ordering.<br />
<br />
It's necessary to ensure that if a standby is promoted to replace a failed publisher, the furthest-ahead standby is promoted. Any other subscribers that are further ahead would otherwise have txns that the promoted subscriber wouldn't have, resulting in divergence.<br />
<br />
If both failover-candidate subscribers and other subscribers / logical replication consumers exist, failover-candidate subscribers must confirm flush of new txns on the publisher before a commit can be sent to any other consumers. Otherwise, on failover, the other consumers would have txns from the old-publisher that the promoted replacement publisher would not have, resulting in divergence.<br />
<br />
=== replication origins on failover-candidates of subscribers ===<br />
<br />
There are challenges for promoting a cascaded subscriber to replace a failed subscriber node too.<br />
<br />
==== Maintaining replication origins ====<br />
<br />
When a subscriber advances its replication origin on a publisher, that information needs to be able to be reported to cascaded subscribers so that they can keep track of their effective replay position on the publisher. That way if the cascaded subscriber is promoted to replay directly from the publisher after the old subscriber fails, the promoted new subscriber knows which LSN to request from the publisher when starting replay.<br />
<br />
Maintaining replication origins for the publisher on the subscriber's replicas at the right value shouldn't be too hard. We already report the true-origin upstream lsns in the logical protocol. This breaks down in cascades though. If we have:<br />
<br />
P -> A -> B -> C<br />
<br />
and want to promote C to replace a failed B, resulting in<br />
<br />
P -> A -> C<br />
<br />
[x] B<br />
<br />
we need to be able to keep track of the intermediate-upstream lsn of A on C, not just the true-origin-of-commit lsn of P.<br />
<br />
This is not an issue for physical rep because there's only one LSN sequence shared by all nodes.<br />
<br />
=== Flush confirmations and ordering ===<br />
<br />
Same issue as for physical replica standby failover, physical-before-logical-ordering.<br />
<br />
Much like for physical replication, the subscriber must hold down the flush lsn it reports to the publisher to the oldest value confirmed as flushed by all failover-candidate cascaded subscribers. Otherwise if a failover-candidate for a subscriber is promoted, the publisher might've advanced the slot's confirmed_flush_lsn and will then fail to (re)send some txns to the promoted subscriber.<br />
<br />
Alternately, each failover-candidate subscriber must maintain its own slot on the publisher, or have the active subscriber or the publisher maintain those slots on behalf of the failover-candidates. The slots must only advance once the failover-candidate subscriber replays changes from the active subscriber.<br />
<br />
=== Sequences ===<br />
<br />
In-core logical replication doesn't replicate sequence advances in a consistent manner right now. We'd have to decode sequence advance records from WAL and ensure the replicas' sequences are advanced too. It's OK if they jump ahead, like they do after a crash of the primary, so long as they're never behind.<br />
<br />
=== Large transaction lag and synchronous replication ===<br />
<br />
Logical replication only starts applying a txn on the subscriber once the provider side commits, so large txns can cause latency spikes in apply. This can result in much longer waits for synchronous commits in logical replication based HA.<br />
<br />
== Logical replication transparent drop-in replacement missing pieces ==<br />
<br />
These aren't HA-specific but are present limitations in logical rep that would stop some or many users from switching easily from physical rep (which they presently use for HA) to logical rep.<br />
<br />
=== DDL replication ===<br />
<br />
To allow users of physical replication to seamlessly switch to logical replication we need a comprehensive solution to transparently replicating schema changes, including graceful handling of global objects (roles, etc).<br />
<br />
=== Large objects ===<br />
<br />
Logical replication does not support replication of large objects (`pg_largeobject`, `lo_create`, etc) so users of this feature cannot benefit from logical replication and could not use logical replication based failover.<br />
<br />
=== Performance ===<br />
<br />
In some cases logical replication performs a lot better than physical replication, especially where network bandwidth is a major constraint and/or the database is very b-tree index heavy. Physical replication is bulky on the wire and applying index updates can be quite expensive in blocking read I/O for the startup process executing redo.<br />
<br />
In other cases logical replication is a lot slower and won't be a suitable replacement for physical replication for failover purposes. Particularly where there is high concurrency on the provider. Any large increase in replication latency is very important for failover viability. There's ongoing work on streaming logical decoding, parallelized logical decoding, and parallel logical apply that will eventually help with this, but it's complex and it's hard to avoid deadlock-related performance issues.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35557Logical replication and physical standby failover2020-11-30T04:41:28Z<p>Ringerc: /* replication origins on failover-candidates of subscribers */</p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
'''Problems'''<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
''' Problems '''<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
''' Problems '''<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
''' Problems '''<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
-or-<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
''' Problems '''<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
No sensible ones.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
''' Problems '''<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br /><br /> Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
''' Tooling workarounds '''<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
''' Problems '''<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br /><br /> This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br /><br /> The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
= All-logical-replication HA =<br />
<br />
There's an argument in the postgres community that we shouldn't invest time and effort in making the mature (but limited) physical replication support interoperate well with logical replication for HA and failover. Instead we should spend that effort on improving logical replication enough to make it an equivalent and transparent replacement for physical replication and physical replica promotion.<br />
<br />
== Logical replication HA missing pieces ==<br />
<br />
The following points are some of the issues that would need to be addressed if we want logical replication to fully replace physical replication for failover and HA.<br />
<br />
=== slots on failover-candidate replicas of publisher ===<br />
<br />
If a logical subscriber is "promoted" to replace its publisher, all other subscribers of the old publisher are broken. They have no way to consistently replay any transactions committed on the old-publisher before the promotion event because the old-publisher LSNs make no sense on the new-publisher, and the old-publisher slots won't exist on the new-publisher.<br />
<br />
It's not possible to just create a new logical slot on the new-publisher after promotion because slots cannot replay changes from the past or be rewound. They're forward-only.<br />
<br />
==== maintaining slots ====<br />
<br />
We'd need to have failover-candidate subscribers keep track of slots on the publisher: create new slots when they're created on the publisher, drop them when they're dropped on the publisher, retain resources on the subscriber until all publisher slots no longer require those resources, and advance them when they're advanced on the publisher.<br />
<br />
This can't really be done just by pulling state from the subscriber because then there's an unpredictable window where new slots on the publisher won't exist on failover to the subscriber. So we'd need some replication slot hooks and the ability for the publisher to be aware of its failover-candidate subscribers.<br />
<br />
==== (node,lsn) mapping ====<br />
<br />
We need failover-candidate subscribers can to be able to advance their local slots for peers of the provider in response to publisher slot advances to release resources, and so that subscribers of the old-publisher can replay changes consistently from the old-subscriber from the new-subscriber at the correct start-point.<br />
<br />
So we could <code>START_REPLICATION LOGICAL SLOT "foo" PUBLISHER "pub_id" LSN "XX/YY"</code> and the promoted subscriber could map lsn <code>XX/YY</code> on node <code>pub_id</code> to its local LSNs.<br />
<br />
We'd need something like a persistent lsn mapping of publisher to subscriber txns and some kind of node-id scheme. <br />
<br />
Or some other means of consistent replication progress tracking that's tolerant of the publisher and subscriber having totally different LSNs, like the timeline IDs we have for physical replication but preferably without their limitations and hazards.<br />
<br />
==== Flush confirmations and ordering ====<br />
<br />
Same issue as for physical replica primary failover, physical-before-logical-ordering.<br />
<br />
It's necessary to ensure that if a standby is promoted to replace a failed publisher, the furthest-ahead standby is promoted. Any other subscribers that are further ahead would otherwise have txns that the promoted subscriber wouldn't have, resulting in divergence.<br />
<br />
If both failover-candidate subscribers and other subscribers / logical replication consumers exist, failover-candidate subscribers must confirm flush of new txns on the publisher before a commit can be sent to any other consumers. Otherwise, on failover, the other consumers would have txns from the old-publisher that the promoted replacement publisher would not have, resulting in divergence.<br />
<br />
=== replication origins on failover-candidates of subscribers ===<br />
<br />
There are challenges for promoting a cascaded subscriber to replace a failed subscriber node too.<br />
<br />
==== Maintaining replication origins ====<br />
<br />
When a subscriber advances its replication origin on a publisher, that information needs to be able to be reported to cascaded subscribers so that they can keep track of their effective replay position on the publisher. That way if the cascaded subscriber is promoted to replay directly from the publisher after the old subscriber fails, the promoted new subscriber knows which LSN to request from the publisher when starting replay.<br />
<br />
Maintaining replication origins for the publisher on the subscriber's replicas at the right value shouldn't be too hard. We already report the true-origin upstream lsns in the logical protocol. This breaks down in cascades though. If we have:<br />
<br />
P -> A -> B -> C<br />
<br />
and want to promote C to replace a failed B, resulting in<br />
<br />
P -> A -> C<br />
<br />
[x] B<br />
<br />
we need to be able to keep track of the intermediate-upstream lsn of A on C, not just the true-origin-of-commit lsn of P.<br />
<br />
This is not an issue for physical rep because there's only one LSN sequence shared by all nodes.<br />
<br />
=== Flush confirmations and ordering ===<br />
<br />
Same issue as for physical replica standby failover, physical-before-logical-ordering.<br />
<br />
Much like for physical replication, the subscriber must hold down the flush lsn it reports to the publisher to the oldest value confirmed as flushed by all failover-candidate cascaded subscribers. Otherwise if a failover-candidate for a subscriber is promoted, the publisher might've advanced the slot's confirmed_flush_lsn and will then fail to (re)send some txns to the promoted subscriber.<br />
<br />
Alternately, each failover-candidate subscriber must maintain its own slot on the publisher, or have the active subscriber or the publisher maintain those slots on behalf of the failover-candidates. The slots must only advance once the failover-candidate subscriber replays changes from the active subscriber.<br />
<br />
=== Sequences ===<br />
<br />
In-core logical replication doesn't replicate sequence advances in a consistent manner right now. We'd have to decode sequence advance records from WAL and ensure the replicas' sequences are advanced too. It's OK if they jump ahead, like they do after a crash of the primary, so long as they're never behind.<br />
<br />
=== Large transaction lag and synchronous replication ===<br />
<br />
Logical replication only starts applying a txn on the subscriber once the provider side commits, so large txns can cause latency spikes in apply. This can result in much longer waits for synchronous commits in logical replication based HA.<br />
<br />
== Logical replication transparent drop-in replacement missing pieces ==<br />
<br />
These aren't HA-specific but are present limitations in logical rep that would stop some or many users from switching easily from physical rep (which they presently use for HA) to logical rep.<br />
<br />
=== DDL replication ===<br />
<br />
To allow users of physical replication to seamlessly switch to logical replication we need a comprehensive solution to transparently replicating schema changes, including graceful handling of global objects (roles, etc).<br />
<br />
=== Large objects ===<br />
<br />
Logical replication does not support replication of large objects (`pg_largeobject`, `lo_create`, etc) so users of this feature cannot benefit from logical replication and could not use logical replication based failover.<br />
<br />
=== Performance ===<br />
<br />
In some cases logical replication performs a lot better than physical replication, especially where network bandwidth is a major constraint and/or the database is very b-tree index heavy. Physical replication is bulky on the wire and applying index updates can be quite expensive in blocking read I/O for the startup process executing redo.<br />
<br />
In other cases logical replication is a lot slower and won't be a suitable replacement for physical replication for failover purposes. Particularly where there is high concurrency on the provider. Any large increase in replication latency is very important for failover viability. There's ongoing work on streaming logical decoding, parallelized logical decoding, and parallel logical apply that will eventually help with this, but it's complex and it's hard to avoid deadlock-related performance issues.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35556Logical replication and physical standby failover2020-11-30T04:39:02Z<p>Ringerc: /* Flush confirmations and ordering */</p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
'''Problems'''<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
''' Problems '''<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
''' Problems '''<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
''' Problems '''<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
-or-<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
''' Problems '''<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
No sensible ones.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
''' Problems '''<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br /><br /> Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
''' Tooling workarounds '''<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
''' Problems '''<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br /><br /> This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br /><br /> The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
= All-logical-replication HA =<br />
<br />
There's an argument in the postgres community that we shouldn't invest time and effort in making the mature (but limited) physical replication support interoperate well with logical replication for HA and failover. Instead we should spend that effort on improving logical replication enough to make it an equivalent and transparent replacement for physical replication and physical replica promotion.<br />
<br />
== Logical replication HA missing pieces ==<br />
<br />
The following points are some of the issues that would need to be addressed if we want logical replication to fully replace physical replication for failover and HA.<br />
<br />
=== slots on failover-candidate replicas of publisher ===<br />
<br />
If a logical subscriber is "promoted" to replace its publisher, all other subscribers of the old publisher are broken. They have no way to consistently replay any transactions committed on the old-publisher before the promotion event because the old-publisher LSNs make no sense on the new-publisher, and the old-publisher slots won't exist on the new-publisher.<br />
<br />
It's not possible to just create a new logical slot on the new-publisher after promotion because slots cannot replay changes from the past or be rewound. They're forward-only.<br />
<br />
==== maintaining slots ====<br />
<br />
We'd need to have failover-candidate subscribers keep track of slots on the publisher: create new slots when they're created on the publisher, drop them when they're dropped on the publisher, retain resources on the subscriber until all publisher slots no longer require those resources, and advance them when they're advanced on the publisher.<br />
<br />
This can't really be done just by pulling state from the subscriber because then there's an unpredictable window where new slots on the publisher won't exist on failover to the subscriber. So we'd need some replication slot hooks and the ability for the publisher to be aware of its failover-candidate subscribers.<br />
<br />
==== (node,lsn) mapping ====<br />
<br />
We need failover-candidate subscribers can to be able to advance their local slots for peers of the provider in response to publisher slot advances to release resources, and so that subscribers of the old-publisher can replay changes consistently from the old-subscriber from the new-subscriber at the correct start-point.<br />
<br />
So we could <code>START_REPLICATION LOGICAL SLOT "foo" PUBLISHER "pub_id" LSN "XX/YY"</code> and the promoted subscriber could map lsn <code>XX/YY</code> on node <code>pub_id</code> to its local LSNs.<br />
<br />
We'd need something like a persistent lsn mapping of publisher to subscriber txns and some kind of node-id scheme. <br />
<br />
Or some other means of consistent replication progress tracking that's tolerant of the publisher and subscriber having totally different LSNs, like the timeline IDs we have for physical replication but preferably without their limitations and hazards.<br />
<br />
==== Flush confirmations and ordering ====<br />
<br />
Same issue as for physical replica primary failover, physical-before-logical-ordering.<br />
<br />
It's necessary to ensure that if a standby is promoted to replace a failed publisher, the furthest-ahead standby is promoted. Any other subscribers that are further ahead would otherwise have txns that the promoted subscriber wouldn't have, resulting in divergence.<br />
<br />
If both failover-candidate subscribers and other subscribers / logical replication consumers exist, failover-candidate subscribers must confirm flush of new txns on the publisher before a commit can be sent to any other consumers. Otherwise, on failover, the other consumers would have txns from the old-publisher that the promoted replacement publisher would not have, resulting in divergence.<br />
<br />
=== replication origins on failover-candidates of subscribers ===<br />
<br />
There are challenges for promoting a cascaded subscriber to replace a failed subscriber node too.<br />
<br />
==== Maintaining replication origins ====<br />
<br />
When a subscriber advances its replication origin on a publisher, that information needs to be able to be reported to cascaded subscribers so that they can keep track of their effective replay position on the publisher. That way if the cascaded subscriber is promoted to replay directly from the publisher after the old subscriber fails, the promoted new subscriber knows which LSN to request from the publisher when starting replay.<br />
<br />
=== Flush confirmations and ordering ===<br />
<br />
Same issue as for physical replica standby failover, physical-before-logical-ordering.<br />
<br />
Much like for physical replication, the subscriber must hold down the flush lsn it reports to the publisher to the oldest value confirmed as flushed by all failover-candidate cascaded subscribers. Otherwise if a failover-candidate for a subscriber is promoted, the publisher might've advanced the slot's confirmed_flush_lsn and will then fail to (re)send some txns to the promoted subscriber.<br />
<br />
Alternately, each failover-candidate subscriber must maintain its own slot on the publisher, or have the active subscriber or the publisher maintain those slots on behalf of the failover-candidates. The slots must only advance once the failover-candidate subscriber replays changes from the active subscriber.<br />
<br />
=== Sequences ===<br />
<br />
In-core logical replication doesn't replicate sequence advances in a consistent manner right now. We'd have to decode sequence advance records from WAL and ensure the replicas' sequences are advanced too. It's OK if they jump ahead, like they do after a crash of the primary, so long as they're never behind.<br />
<br />
=== Large transaction lag and synchronous replication ===<br />
<br />
Logical replication only starts applying a txn on the subscriber once the provider side commits, so large txns can cause latency spikes in apply. This can result in much longer waits for synchronous commits in logical replication based HA.<br />
<br />
== Logical replication transparent drop-in replacement missing pieces ==<br />
<br />
These aren't HA-specific but are present limitations in logical rep that would stop some or many users from switching easily from physical rep (which they presently use for HA) to logical rep.<br />
<br />
=== DDL replication ===<br />
<br />
To allow users of physical replication to seamlessly switch to logical replication we need a comprehensive solution to transparently replicating schema changes, including graceful handling of global objects (roles, etc).<br />
<br />
=== Large objects ===<br />
<br />
Logical replication does not support replication of large objects (`pg_largeobject`, `lo_create`, etc) so users of this feature cannot benefit from logical replication and could not use logical replication based failover.<br />
<br />
=== Performance ===<br />
<br />
In some cases logical replication performs a lot better than physical replication, especially where network bandwidth is a major constraint and/or the database is very b-tree index heavy. Physical replication is bulky on the wire and applying index updates can be quite expensive in blocking read I/O for the startup process executing redo.<br />
<br />
In other cases logical replication is a lot slower and won't be a suitable replacement for physical replication for failover purposes. Particularly where there is high concurrency on the provider. Any large increase in replication latency is very important for failover viability. There's ongoing work on streaming logical decoding, parallelized logical decoding, and parallel logical apply that will eventually help with this, but it's complex and it's hard to avoid deadlock-related performance issues.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35555Logical replication and physical standby failover2020-11-30T04:36:40Z<p>Ringerc: Explain reasons why logical replication based HA isn't a simple replacement for physical replication based HA</p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
'''Problems'''<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
''' Problems '''<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
''' Problems '''<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
''' Problems '''<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
-or-<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
''' Problems '''<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
No sensible ones.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
''' Problems '''<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br /><br /> Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
''' Tooling workarounds '''<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
''' Problems '''<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br /><br /> This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br /><br /> The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
= All-logical-replication HA =<br />
<br />
There's an argument in the postgres community that we shouldn't invest time and effort in making the mature (but limited) physical replication support interoperate well with logical replication for HA and failover. Instead we should spend that effort on improving logical replication enough to make it an equivalent and transparent replacement for physical replication and physical replica promotion.<br />
<br />
== Logical replication HA missing pieces ==<br />
<br />
The following points are some of the issues that would need to be addressed if we want logical replication to fully replace physical replication for failover and HA.<br />
<br />
=== slots on failover-candidate replicas of publisher ===<br />
<br />
If a logical subscriber is "promoted" to replace its publisher, all other subscribers of the old publisher are broken. They have no way to consistently replay any transactions committed on the old-publisher before the promotion event because the old-publisher LSNs make no sense on the new-publisher, and the old-publisher slots won't exist on the new-publisher.<br />
<br />
It's not possible to just create a new logical slot on the new-publisher after promotion because slots cannot replay changes from the past or be rewound. They're forward-only.<br />
<br />
==== maintaining slots ====<br />
<br />
We'd need to have failover-candidate subscribers keep track of slots on the publisher: create new slots when they're created on the publisher, drop them when they're dropped on the publisher, retain resources on the subscriber until all publisher slots no longer require those resources, and advance them when they're advanced on the publisher.<br />
<br />
This can't really be done just by pulling state from the subscriber because then there's an unpredictable window where new slots on the publisher won't exist on failover to the subscriber. So we'd need some replication slot hooks and the ability for the publisher to be aware of its failover-candidate subscribers.<br />
<br />
==== (node,lsn) mapping ====<br />
<br />
We need failover-candidate subscribers can to be able to advance their local slots for peers of the provider in response to publisher slot advances to release resources, and so that subscribers of the old-publisher can replay changes consistently from the old-subscriber from the new-subscriber at the correct start-point.<br />
<br />
So we could <code>START_REPLICATION LOGICAL SLOT "foo" PUBLISHER "pub_id" LSN "XX/YY"</code> and the promoted subscriber could map lsn <code>XX/YY</code> on node <code>pub_id</code> to its local LSNs.<br />
<br />
We'd need something like a persistent lsn mapping of publisher to subscriber txns and some kind of node-id scheme. <br />
<br />
Or some other means of consistent replication progress tracking that's tolerant of the publisher and subscriber having totally different LSNs, like the timeline IDs we have for physical replication but preferably without their limitations and hazards.<br />
<br />
==== Flush confirmations and ordering ====<br />
<br />
Same issue as for physical replica primary failover, physical-before-logical-ordering.<br />
<br />
It's necessary to ensure that if a standby is promoted to replace a failed publisher, the furthest-ahead standby is promoted. Any other subscribers that are further ahead would otherwise have txns that the promoted subscriber wouldn't have, resulting in divergence.<br />
<br />
If both failover-candidate subscribers and other subscribers / logical replication consumers exist, failover-candidate subscribers must confirm flush of new txns on the publisher before a commit can be sent to any other consumers. Otherwise, on failover, the other consumers would have txns from the old-publisher that the promoted replacement publisher would not have, resulting in divergence.<br />
<br />
=== replication origins on failover-candidates of subscribers ===<br />
<br />
There are challenges for promoting a cascaded subscriber to replace a failed subscriber node too.<br />
<br />
==== Maintaining replication origins ====<br />
<br />
When a subscriber advances its replication origin on a publisher, that information needs to be able to be reported to cascaded subscribers so that they can keep track of their effective replay position on the publisher. That way if the cascaded subscriber is promoted to replay directly from the publisher after the old subscriber fails, the promoted new subscriber knows which LSN to request from the publisher when starting replay.<br />
<br />
==== Flush confirmations and ordering ====<br />
<br />
Same issue as for physical replica standby failover, physical-before-logical-ordering.<br />
<br />
Much like for physical replication, the subscriber must hold down the flush lsn it reports to the publisher to the oldest value confirmed as flushed by all failover-candidate cascaded subscribers. Otherwise if a failover-candidate for a subscriber is promoted, the publisher might've advanced the slot's confirmed_flush_lsn and will then fail to (re)send some txns to the promoted subscriber.<br />
<br />
Alternately, each failover-candidate subscriber must maintain its own slot on the publisher, or have the active subscriber or the publisher maintain those slots on behalf of the failover-candidates. The slots must only advance once the failover-candidate subscriber replays changes from the active subscriber.<br />
<br />
=== Sequences ===<br />
<br />
In-core logical replication doesn't replicate sequence advances in a consistent manner right now. We'd have to decode sequence advance records from WAL and ensure the replicas' sequences are advanced too. It's OK if they jump ahead, like they do after a crash of the primary, so long as they're never behind.<br />
<br />
=== Large transaction lag and synchronous replication ===<br />
<br />
Logical replication only starts applying a txn on the subscriber once the provider side commits, so large txns can cause latency spikes in apply. This can result in much longer waits for synchronous commits in logical replication based HA.<br />
<br />
== Logical replication transparent drop-in replacement missing pieces ==<br />
<br />
These aren't HA-specific but are present limitations in logical rep that would stop some or many users from switching easily from physical rep (which they presently use for HA) to logical rep.<br />
<br />
=== DDL replication ===<br />
<br />
To allow users of physical replication to seamlessly switch to logical replication we need a comprehensive solution to transparently replicating schema changes, including graceful handling of global objects (roles, etc).<br />
<br />
=== Large objects ===<br />
<br />
Logical replication does not support replication of large objects (`pg_largeobject`, `lo_create`, etc) so users of this feature cannot benefit from logical replication and could not use logical replication based failover.<br />
<br />
=== Performance ===<br />
<br />
In some cases logical replication performs a lot better than physical replication, especially where network bandwidth is a major constraint and/or the database is very b-tree index heavy. Physical replication is bulky on the wire and applying index updates can be quite expensive in blocking read I/O for the startup process executing redo.<br />
<br />
In other cases logical replication is a lot slower and won't be a suitable replacement for physical replication for failover purposes. Particularly where there is high concurrency on the provider. Any large increase in replication latency is very important for failover viability. There's ongoing work on streaming logical decoding, parallelized logical decoding, and parallel logical apply that will eventually help with this, but it's complex and it's hard to avoid deadlock-related performance issues.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35554Logical replication and physical standby failover2020-11-25T05:07:22Z<p>Ringerc: </p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
'''Problems'''<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
''' Problems '''<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
''' Problems '''<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
''' Problems '''<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
-or-<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
''' Problems '''<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
''' Tooling workarounds '''<br />
<br />
No sensible ones.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
''' Problems '''<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br /><br /> Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
''' Tooling workarounds '''<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
''' Problems '''<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br /><br /> This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br /><br /> The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
''' Tooling workarounds '''<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
''' Proposed solution(s) '''<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35553Logical replication and physical standby failover2020-11-25T05:05:10Z<p>Ringerc: /* Problems */</p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
==== Problems ====<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
==== Problems ====<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
-or-<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
==== Problems ====<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
No sensible ones.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
==== Problems ====<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br /><br /> Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
==== Tooling workarounds ====<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
==== Problems ====<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br /><br /> This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br /><br /> The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35552Logical replication and physical standby failover2020-11-25T05:04:53Z<p>Ringerc: /* Problems */</p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
==== Problems ====<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
==== Problems ====<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
-or-<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
==== Problems ====<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
No sensible ones.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
==== Problems ====<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br /><br /> Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
==== Tooling workarounds ====<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
==== Problems ====<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br /><br /><br />
This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br /><br /><br />
The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35551Logical replication and physical standby failover2020-11-25T05:04:34Z<p>Ringerc: /* Problems */</p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
==== Problems ====<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
==== Problems ====<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
-or-<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
==== Problems ====<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
No sensible ones.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
==== Problems ====<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br /><br /> Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
==== Tooling workarounds ====<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
==== Problems ====<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br />
<br />
This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br />
<br />
The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35550Logical replication and physical standby failover2020-11-25T05:04:16Z<p>Ringerc: /* Proposed solution(s) */</p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
==== Problems ====<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
==== Problems ====<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
-or-<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
==== Problems ====<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
No sensible ones.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
==== Problems ====<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br />
<br />
Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
==== Tooling workarounds ====<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
==== Problems ====<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br />
<br />
This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br />
<br />
The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35549Logical replication and physical standby failover2020-11-25T05:04:04Z<p>Ringerc: /* Proposed solution(s) */</p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
<br />
-- or --<br />
<br />
* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling<br />
<br />
and<br />
<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
==== Problems ====<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
==== Problems ====<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
- or -<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
==== Problems ====<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
No sensible ones.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
==== Problems ====<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br />
<br />
Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
==== Tooling workarounds ====<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
==== Problems ====<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br />
<br />
This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br />
<br />
The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35548Logical replication and physical standby failover2020-11-25T05:02:51Z<p>Ringerc: </p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
-- or --* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling; and<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
==== Problems ====<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
==== Problems ====<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
- or -<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
==== Problems ====<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
No sensible ones.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
==== Problems ====<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br />
<br />
Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
==== Tooling workarounds ====<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
==== Problems ====<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br />
<br />
This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br />
<br />
The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]][[Category:Feature request]]</div>Ringerchttps://wiki.postgresql.org/index.php?title=Failover_slots&diff=35547Failover slots2020-11-25T05:02:12Z<p>Ringerc: Categories</p>
<hr />
<div>= Failover slots (unsuccessful feature proposal) =<br />
<br />
Failover slots [https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com were a proposed feature for PostgreSQL 9.6]. [https://commitfest.postgresql.org/9/488/ The feature proposal has been dropped]. Failover slots were not included in PostgreSQL and are unlikely to be included in the form proposed in this patch. However, parts of the functionality underlying the feature did get committed.<br />
<br />
The wiki page [[Logical replication and physical standby failover]] discusses the current state of physical failover support for logical replication upstream and downstream postgres instances, and the various tooling-based strategies that can make it possible.<br />
<br />
== Partially committed ==<br />
<br />
Some of the functionality underlying the failover slots and [https://commitfest.postgresql.org/11/788/ logical decoding on standby] patch sets did get committed, including:<br />
<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5737c12df0 catalog_xmin in hot_standby_feedback messages from physical streaming replicas (PostgreSQL 10)] ([https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com list discussion]) This allows a streaming replica to forward the `catalog_xmin` of any logical slots that exist on the replica up to the primary where the `catalog_xmin` is then applied to the physical slot of the streaming replica. The primary will then guarantee the `catalog_xmin` of those logical slots on the standby even when the standby is disconnected, preventing vacuum from removing dead catalog tuples that might still be needed for logical slots on the standby. That makes it safe to use the logical slots on the standby for logical decoding after promoting the standby.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1148e22a82 timeline following for logical slots (PostgreSQL 10)] ([https://commitfest.postgresql.org/9/568/ CF entry], [https://commitfest.postgresql.org/11/779/ prior CF entry]). This patch teaches the xlogreader code how to follow timeline switches correctly when doing logical decoding, so the change stream from a logical slot does not terminate when the upstream timeline increments due to a failover/promotion event. Necessary to allow logical decoding to follow failover.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff539da316 Cleanup slots during drop database (PostgreSQL 10)]. Required to allow a standby to replay `DROP DATABASE` for a database where logical slots exist for that database on the standby.<br />
<br />
This is enough to allow external tools to roll their own "failover slots" by syncing slot state from primary to standby(s), though it's a bit delicate to do so correctly. It's necessary to write a C extension that creates replication slots using the low level slot management APIs, since no SQL-visible functions exist to do so. An example of the use of those APIs can be found in the original failover slots patch in src/test/modules.<br />
<br />
A few related patches are also relevant to failover and logical slots:<br />
<br />
* [https://www.postgresql.org/message-id/5c26ff40-8452-fb13-1bea-56a0338a809a@2ndquadrant.com Logical decoding fast-forward and slot advance (Petr Jelinek)] - adds <code>pg_replication_slot_advance()</code> SQL function.<br />
<br />
== Relevant mailing list discussion ==<br />
<br />
* [https://www.postgresql.org/message-id/CAMsr%2BYEVmBJ%3DdyLw%3D%2BkTihmUnGy5_EW4Mig5T0maieg_Zu%3DXCg%40mail.gmail.com Logical decoding on standby] - this proposed feature integrated with failover slots and had some of the same moving parts.<br />
* [https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com Send catalog_xmin separately in hot standby feedback]<br />
* [https://www.postgresql.org/message-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com Logical decoding timeline following take II]<br />
* [https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com WIP: Failover slots]<br />
* [https://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com Timeline following for logical slots]<br />
* [https://www.postgresql.org/message-id/CAMsr+YG_1FU_-L8QWSk6oKFT4Jt8dpORy2RHXDyMy0B5ZfkpGA@mail.gmail.com Logical decoding timeline following fails to handle records split across segments]<br />
* [https://www.postgresql.org/message-id/20160503165812.GA29604@alvherre.pgsql What to revert]<br />
<br />
== Implementing replication slot failover with tooling ==<br />
<br />
With the above patches in PostgreSQL 10 it's now possible to implement failover management for PostgreSQL logical replication slots in external tooling.<br />
<br />
Standbys '''must''' be configured with:<br />
<br />
* <code>hot_standby_feedback = on</code><br />
* A <code>primary_slot_name</code> to use a physical replication slot on the primary<br />
<br />
The tool will need to provide an extension in each failover-candidate standby that provides a means of managing low-level replication slot state, since there is no SQL interface for this in PostgreSQL at time of writing. Exactly how this is done, and whether it's a push or pull model etc, is up to the tool. A very simplistic and minimal example can be found in [https://www.postgresql.org/message-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com the patch attached to this mail, in <code>src/test/modules/test_slot_timelines</code>]. (A tool should '''not''' copy `pg_replslot/*/state` files from primary to standby instead; these won't be re-read by the standby when updated while the server is running, and could get replaced by stale contents from shared memory).<br />
<br />
To manage failover, the tool should periodically scan the primary's slots. For each logical replication slot the tool wishes to preserve for failover to a standby, the tool should create/update an identical logical replication slot on any failover-candidate standby(s). The tool must check that that the standby has replayed up to the <code>confirmed_flush_lsn</code> of a slot and delay syncing that slot if needed. When syncing slots, the <code>restart_lsn</code>, <code>confirmed_flush_lsn</code> and <code>catalog_xmin</code> of the standby's copy of a slot must all be updated and persisted together. The tool should also delete slots from the standby when they cease to exist on the primary.<br />
<br />
=== Limitations and caveats ===<br />
<br />
WARNING: See [[Logical replication and physical standby failover]] for the significant challenges surrounding this approach. It's not easy to get right.<br />
<br />
It's only safe to use any given logical replication slot on a standby after promotion once the <code>catalog_xmin</code> for the standby's physical slot on the primary is &lt;= the <code>catalog_xmin</code> for the slot. Until that point, any such slots are unsafe to use; they may work, but produce incomplete or incorrect output or crash the walsender. I recommend that you create them with a different name like "_sync_temp1" or something, then rename them (create a new one and drop the temp one) once the <code>catalog_xmin</code> is known to be safe. You can use the <code>txid_status()</code> function to help with this, or just watch the physical slot's <code>catalog_xmin</code> on the primary.<br />
<br />
Even with this approach, a logical subscriber may receive and apply a transaction from the primary before the physical replica. A failover may then cause the physical replica to be promoted without having this transaction, so the provider and subscriber now differ. Addressing this would require a core code change to teach the walsender to delay sending logical commits until they've been confirmed by all failover-candidate physical replicas. A patch for this would be welcomed. Individual output plugins can work around this in the mean time by sleeping in their commit callback until all slots configured as replicas have flushed past the lsn of the commit being processed. The output plugin has to provide its own means of configuring which slots/connections represent replicas - it does not make sense to overload <code>synchronous_standby_names</code> for this, and you want to use slot names not standby connection names anyway.<br />
<br />
The primary '''must''' preserve the physical replication slot for the standby. If the standby slot is dropped and re-created, it becomes unsafe to fail over to the standby and use any logical slots on the standby until they are resynced again. There's no simple way for tooling to detect if the standby's slot on the primary was dropped and re-created.<br />
<br />
Unfortunately there are no C-level hook functions in the replication slot management code for tools to use to trigger wakeups, syncs or checks. Polling is required.<br />
<br />
= Information on original failover slots proposal =<br />
<br />
The following is older content preserved to aid in understanding the context of the topic.<br />
<br />
== Rationale ==<br />
<br />
We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client. Without this you can't reliably stream data from a HA Pg install into a message queue / ETL system / whatever using logical decoding. You can't have physical HA for your nodes in a logical replication deployment, multi-master mesh cluster, etc. You currently have to create a new slot after failover but you have no way to get from your current before-failover state to a state consistent with the replay position of the standby at the moment of slot creation.<br />
<br />
This patch does not make all slots into failover slots. Because failover slots must write WAL on create/advance/drop they cannot be used on a standby in recovery. That means normal non-failover slots remain useful - you can use a physical slot to get WAL from a standby, and it should be possible to add support for logical decoding from standbys too.<br />
<br />
== Limitations ==<br />
<br />
Failover slots can only be created, replayed from and dropped on a master (original or promoted standby).<br />
<br />
Later work could allow all slots to become failover slots by introducing a slot logging mechanism that functions "beside" WAL, allowing standbys to have their own failover slots and replay changes on those slots to their own cascading standbys. This can be done without changing the UI or breaking clients that work with the first (9.6) version since they won't care that the transport between nodes has changed from WAL to ... something else. To be handwaved up later, and probably only capable of working using slots and streaming replication not archive recovery.<br />
<br />
We need failover slots now, to make logical decoding more useful in production environments. Supporting them on cascading standbys is a "nice to have later".<br />
<br />
== Patch notes ==<br />
<br />
Additional explanation to accompany the patch submission.<br />
<br />
=== Timeline following for logical decoding ===<br />
<br />
This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.<br />
<br />
I originally intended to add timeline following support directly to the xlogreader but that wouldn't work well with redo, is more intrusive, and would cause problems with frontend code like pg_xlogdump and pg_rewind that use the xlogreader code. So I implemented it as a nearly-standalone helper that logical decoding callbacks can invoke to ask the xlogreader to track timeline state for them. It's in xlogutils.c not xlogreader.c because it can't be compiled for frontend code. (A later patch could make timeline.c compile for FRONTEND, move the helper to xlogreader.c and add support for timeline following in pg_xlogdump fairly easily... if anyone cares. I don't see the point.)<br />
<br />
This patch does NOT use the restart_tli member of ReplicationSlotPersistentData. That member is unused in the current 94/95/96 code as well. I think it should be removed. I didn't use it because it's no help; it's still necessary to read the timeline history files to determine the validity boundaries of a historical timeline, and you might as well just look up the timeline of a given point in history there too. Additionally, trying to use the slot's restart_tli doesn't make sense alongside the existing use of restart_lsn to lazily update the slot's restart position in response to feedback. Timeline following must be eagerly done during replay. In the case of a peek the timeline switch won't be persistent either. Trying to use restart_tli to track which timeline to read new records from is wrong and useless. There's no point having it at all since we can determine the restart timeline from the restart LSN using the timeline history.<br />
<br />
It should be possible to use the timeline following support in patch 1 alone by creating an extension that does a ReplicationSlotCreate on a standby and manually mirrors a slot on the master using a side-channel (application etc) to co-ordinate.<br />
<br />
BTW, I found the timeline logic in Pg complex enough that I intend to follow up with a README to describe it. I had to map it all out - and the different ways different parts of Pg handle timeline following and WAL reading - before I could make sense of it.<br />
<br />
=== Failover slots ===<br />
<br />
Adds failover slots, slots whose state updates are recorded in WAL and replayed on replica servers.<br />
<br />
The point "drop non-failover slots in archive recovery startup" needs more explanation. In 9.4 and 9.5 we just omit pg_replslot from pg_basebackup copies. So they're effectively dropped on backup, only for the backup copy. For other backup methods like rsync, disk snapshots, etc, it'll tend to get included and the server will read and retain the slots when started up from such a backup. This means that on restore we might or might not keep slots based on the method used. If slots were retained they'll holdthe minimum LSN down and prevent WAL retention on the replica since the replica makes its own independent decisions about WAL retention. Additionally logical slots will become increasingly *wrong*: their catalog xmin won't advance, but vacuum on the master server the standby is replaying from will remove dead rows from the catalogs based on the up-to-date catalog xmin of the live, advancing copy of the slot on the master. Nothing updates the copy of the slot from the basebackup.<br />
<br />
For failover slots I changed pg_basebackup to copy pg_replslot since I have to copy failover slot state. An alternative would be to make sure the backup checkpoint includes every slot whether or not it's dirty, but copying pg_replslot seemed to make more sense and is more consistent with normal recovery startup. This made the existing problem with stale slots more visible though.<br />
<br />
To solve that, as part of this patch the server now drops non-failover slots during archive recovery startup - which includes both true archive based restore/PITR and streaming replication.<br />
<br />
<br />
As for the backup label change: I don't feel too strongly about this. It'll allow backup tools to make sure to copy and archive enough WAL to keep failover slots functional. It's harmless - all the common tools fish things out of backup labels with regexps and won't care about a new line. It can *not* be achieved by changing START WAL LOCATION instead since that's what the server uses as the start of redo when you begin archive recovery so it must be a checkpoint's REDO pointer. On the other hand pg_basebackup gets the information it needs about minimum WAL from the return value to the BASE_BACKUP command, which I changed to the min() of the WAL needed for redo or failover slots, so it doesn't actually need the info in the label to do its job right. In the great majority of cases if you restore from a backup you'll do archive replay anyway so any failover slots would rapidly advance past needing the extra WAL that got retained in the backup anyway. I'd rather solve this 100% though, not ask users to hope they don't hit this window. Especially since slots might be used for bursty or delayed replication that widens the window.<br />
<br />
Notice that slot creation doesn't get replayed to a replica until the slot is made persistent. That way ephemeral slots from failed decoding setups don't have any effect on the replica. I'd like extra attention on this part though, as I'm concerned there might be a risk that as a result the replica could remove WAL still needed by the slot by the time it's made persistent and replicated. If you promote during this period your slot on the replica would be broken. I haven't been able to reproduce it and I'm not convinced it can happen but could use reviewer input there. The reason I don't just replicate ephemeral slot creation to solve this is that ephemeral slots are dropped silently during redo after crash recovery, where we can't write WAL to tell the replica it no longer needs to keep its copies. <br />
<br />
<br />
=== User interface for failover slots ===<br />
<br />
Adds the SQL function parameters, walsender keyword and view changes as well as the docs changes.<br />
<br />
Does not add a --failover option to pg_recvlogical or pg_receivexlog's --create-slot option. Think one is needed?<br />
<br />
[[Category:Clustering]][[Category:Replication]][[Category:Bi-Directional Replication]][[Category:High Availability]][[Category:Failover]][[Category:Logical Replication]]</div>Ringerchttps://wiki.postgresql.org/index.php?title=Talk:Main_Page&diff=35546Talk:Main Page2020-11-25T04:59:17Z<p>Ringerc: </p>
<hr />
<div><div style="border:1px dashed #028dC1; background-color:#f9f9f9; padding:1em 1em 1.5em 1em; text-align:center;"><br />
'''NOTE:''' This page is only for discussing edits to the [[Main Page|Main Page of the PostgreSQL Wiki]].<br />
<br />
Please refer to the [http://www.postgresql.org/support/ PostgreSQL Support] page for support. Software questions posted here will be deleted.</div><br />
<br />
== Add categories link ==<br />
<br />
Right now you're never going to find categories info unless you know to look for it. Can we link to https://wiki.postgresql.org/wiki/Special:Categories on the main page?<br />
<br />
== Renaming alternative language section link ==<br />
<br />
To be consistent, I suggest changing "日本" (Japan, the country) to "日本語" (Japanese, the language) in the "Alternate Languages" section. [[User:Nicolas.barbier|Nicolas.barbier]] 21:01, 22 May 2011 (UTC)<br />
yes, 日本 → 日本語, also I want to add 한국어 link.<br />
and also 中国 -> 中国语(Chinese).<br />
<br />
== BigreSQL additions ==<br />
Hi ALL<br />
<br />
I have added big para on '''BigreSQL''', But look like it is not saved hence writing once again.<br />
Bigresql = PostgreSQL engine + ProgresDB + BigData <br />
<br />
That is what I have visulazing and I am finding it is future for Postgres as opensource DB and it can take on any Appliance in that case and also support the Schema free Database.<br />
<br />
I have complete a though process on this. I am interested to development of same <br />
before that I like to get community view on this,<br />
<br />
Depend on response I will start giving detail design here. though I have already submiited my thought to Posgres Development team also.<br />
<br />
Cheers<br />
Jayant Dani<br />
Solution Archiect<br />
Head of CoE Technology (Big Data, Mobility, Portal)<br />
TCS<br />
Jayant.dani@tcs.com</div>Ringerchttps://wiki.postgresql.org/index.php?title=Failover_slots&diff=35545Failover slots2020-11-25T04:57:31Z<p>Ringerc: x-ref with new info on slot failover</p>
<hr />
<div>= Failover slots (unsuccessful feature proposal) =<br />
<br />
Failover slots [https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com were a proposed feature for PostgreSQL 9.6]. [https://commitfest.postgresql.org/9/488/ The feature proposal has been dropped]. Failover slots were not included in PostgreSQL and are unlikely to be included in the form proposed in this patch. However, parts of the functionality underlying the feature did get committed.<br />
<br />
The wiki page [[Logical replication and physical standby failover]] discusses the current state of physical failover support for logical replication upstream and downstream postgres instances, and the various tooling-based strategies that can make it possible.<br />
<br />
== Partially committed ==<br />
<br />
Some of the functionality underlying the failover slots and [https://commitfest.postgresql.org/11/788/ logical decoding on standby] patch sets did get committed, including:<br />
<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5737c12df0 catalog_xmin in hot_standby_feedback messages from physical streaming replicas (PostgreSQL 10)] ([https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com list discussion]) This allows a streaming replica to forward the `catalog_xmin` of any logical slots that exist on the replica up to the primary where the `catalog_xmin` is then applied to the physical slot of the streaming replica. The primary will then guarantee the `catalog_xmin` of those logical slots on the standby even when the standby is disconnected, preventing vacuum from removing dead catalog tuples that might still be needed for logical slots on the standby. That makes it safe to use the logical slots on the standby for logical decoding after promoting the standby.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1148e22a82 timeline following for logical slots (PostgreSQL 10)] ([https://commitfest.postgresql.org/9/568/ CF entry], [https://commitfest.postgresql.org/11/779/ prior CF entry]). This patch teaches the xlogreader code how to follow timeline switches correctly when doing logical decoding, so the change stream from a logical slot does not terminate when the upstream timeline increments due to a failover/promotion event. Necessary to allow logical decoding to follow failover.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff539da316 Cleanup slots during drop database (PostgreSQL 10)]. Required to allow a standby to replay `DROP DATABASE` for a database where logical slots exist for that database on the standby.<br />
<br />
This is enough to allow external tools to roll their own "failover slots" by syncing slot state from primary to standby(s), though it's a bit delicate to do so correctly. It's necessary to write a C extension that creates replication slots using the low level slot management APIs, since no SQL-visible functions exist to do so. An example of the use of those APIs can be found in the original failover slots patch in src/test/modules.<br />
<br />
A few related patches are also relevant to failover and logical slots:<br />
<br />
* [https://www.postgresql.org/message-id/5c26ff40-8452-fb13-1bea-56a0338a809a@2ndquadrant.com Logical decoding fast-forward and slot advance (Petr Jelinek)] - adds <code>pg_replication_slot_advance()</code> SQL function.<br />
<br />
== Relevant mailing list discussion ==<br />
<br />
* [https://www.postgresql.org/message-id/CAMsr%2BYEVmBJ%3DdyLw%3D%2BkTihmUnGy5_EW4Mig5T0maieg_Zu%3DXCg%40mail.gmail.com Logical decoding on standby] - this proposed feature integrated with failover slots and had some of the same moving parts.<br />
* [https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com Send catalog_xmin separately in hot standby feedback]<br />
* [https://www.postgresql.org/message-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com Logical decoding timeline following take II]<br />
* [https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com WIP: Failover slots]<br />
* [https://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com Timeline following for logical slots]<br />
* [https://www.postgresql.org/message-id/CAMsr+YG_1FU_-L8QWSk6oKFT4Jt8dpORy2RHXDyMy0B5ZfkpGA@mail.gmail.com Logical decoding timeline following fails to handle records split across segments]<br />
* [https://www.postgresql.org/message-id/20160503165812.GA29604@alvherre.pgsql What to revert]<br />
<br />
== Implementing replication slot failover with tooling ==<br />
<br />
With the above patches in PostgreSQL 10 it's now possible to implement failover management for PostgreSQL logical replication slots in external tooling.<br />
<br />
Standbys '''must''' be configured with:<br />
<br />
* <code>hot_standby_feedback = on</code><br />
* A <code>primary_slot_name</code> to use a physical replication slot on the primary<br />
<br />
The tool will need to provide an extension in each failover-candidate standby that provides a means of managing low-level replication slot state, since there is no SQL interface for this in PostgreSQL at time of writing. Exactly how this is done, and whether it's a push or pull model etc, is up to the tool. A very simplistic and minimal example can be found in [https://www.postgresql.org/message-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com the patch attached to this mail, in <code>src/test/modules/test_slot_timelines</code>]. (A tool should '''not''' copy `pg_replslot/*/state` files from primary to standby instead; these won't be re-read by the standby when updated while the server is running, and could get replaced by stale contents from shared memory).<br />
<br />
To manage failover, the tool should periodically scan the primary's slots. For each logical replication slot the tool wishes to preserve for failover to a standby, the tool should create/update an identical logical replication slot on any failover-candidate standby(s). The tool must check that that the standby has replayed up to the <code>confirmed_flush_lsn</code> of a slot and delay syncing that slot if needed. When syncing slots, the <code>restart_lsn</code>, <code>confirmed_flush_lsn</code> and <code>catalog_xmin</code> of the standby's copy of a slot must all be updated and persisted together. The tool should also delete slots from the standby when they cease to exist on the primary.<br />
<br />
=== Limitations and caveats ===<br />
<br />
WARNING: See [[Logical replication and physical standby failover]] for the significant challenges surrounding this approach. It's not easy to get right.<br />
<br />
It's only safe to use any given logical replication slot on a standby after promotion once the <code>catalog_xmin</code> for the standby's physical slot on the primary is &lt;= the <code>catalog_xmin</code> for the slot. Until that point, any such slots are unsafe to use; they may work, but produce incomplete or incorrect output or crash the walsender. I recommend that you create them with a different name like "_sync_temp1" or something, then rename them (create a new one and drop the temp one) once the <code>catalog_xmin</code> is known to be safe. You can use the <code>txid_status()</code> function to help with this, or just watch the physical slot's <code>catalog_xmin</code> on the primary.<br />
<br />
Even with this approach, a logical subscriber may receive and apply a transaction from the primary before the physical replica. A failover may then cause the physical replica to be promoted without having this transaction, so the provider and subscriber now differ. Addressing this would require a core code change to teach the walsender to delay sending logical commits until they've been confirmed by all failover-candidate physical replicas. A patch for this would be welcomed. Individual output plugins can work around this in the mean time by sleeping in their commit callback until all slots configured as replicas have flushed past the lsn of the commit being processed. The output plugin has to provide its own means of configuring which slots/connections represent replicas - it does not make sense to overload <code>synchronous_standby_names</code> for this, and you want to use slot names not standby connection names anyway.<br />
<br />
The primary '''must''' preserve the physical replication slot for the standby. If the standby slot is dropped and re-created, it becomes unsafe to fail over to the standby and use any logical slots on the standby until they are resynced again. There's no simple way for tooling to detect if the standby's slot on the primary was dropped and re-created.<br />
<br />
Unfortunately there are no C-level hook functions in the replication slot management code for tools to use to trigger wakeups, syncs or checks. Polling is required.<br />
<br />
= Information on original failover slots proposal =<br />
<br />
The following is older content preserved to aid in understanding the context of the topic.<br />
<br />
== Rationale ==<br />
<br />
We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client. Without this you can't reliably stream data from a HA Pg install into a message queue / ETL system / whatever using logical decoding. You can't have physical HA for your nodes in a logical replication deployment, multi-master mesh cluster, etc. You currently have to create a new slot after failover but you have no way to get from your current before-failover state to a state consistent with the replay position of the standby at the moment of slot creation.<br />
<br />
This patch does not make all slots into failover slots. Because failover slots must write WAL on create/advance/drop they cannot be used on a standby in recovery. That means normal non-failover slots remain useful - you can use a physical slot to get WAL from a standby, and it should be possible to add support for logical decoding from standbys too.<br />
<br />
== Limitations ==<br />
<br />
Failover slots can only be created, replayed from and dropped on a master (original or promoted standby).<br />
<br />
Later work could allow all slots to become failover slots by introducing a slot logging mechanism that functions "beside" WAL, allowing standbys to have their own failover slots and replay changes on those slots to their own cascading standbys. This can be done without changing the UI or breaking clients that work with the first (9.6) version since they won't care that the transport between nodes has changed from WAL to ... something else. To be handwaved up later, and probably only capable of working using slots and streaming replication not archive recovery.<br />
<br />
We need failover slots now, to make logical decoding more useful in production environments. Supporting them on cascading standbys is a "nice to have later".<br />
<br />
== Patch notes ==<br />
<br />
Additional explanation to accompany the patch submission.<br />
<br />
=== Timeline following for logical decoding ===<br />
<br />
This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.<br />
<br />
I originally intended to add timeline following support directly to the xlogreader but that wouldn't work well with redo, is more intrusive, and would cause problems with frontend code like pg_xlogdump and pg_rewind that use the xlogreader code. So I implemented it as a nearly-standalone helper that logical decoding callbacks can invoke to ask the xlogreader to track timeline state for them. It's in xlogutils.c not xlogreader.c because it can't be compiled for frontend code. (A later patch could make timeline.c compile for FRONTEND, move the helper to xlogreader.c and add support for timeline following in pg_xlogdump fairly easily... if anyone cares. I don't see the point.)<br />
<br />
This patch does NOT use the restart_tli member of ReplicationSlotPersistentData. That member is unused in the current 94/95/96 code as well. I think it should be removed. I didn't use it because it's no help; it's still necessary to read the timeline history files to determine the validity boundaries of a historical timeline, and you might as well just look up the timeline of a given point in history there too. Additionally, trying to use the slot's restart_tli doesn't make sense alongside the existing use of restart_lsn to lazily update the slot's restart position in response to feedback. Timeline following must be eagerly done during replay. In the case of a peek the timeline switch won't be persistent either. Trying to use restart_tli to track which timeline to read new records from is wrong and useless. There's no point having it at all since we can determine the restart timeline from the restart LSN using the timeline history.<br />
<br />
It should be possible to use the timeline following support in patch 1 alone by creating an extension that does a ReplicationSlotCreate on a standby and manually mirrors a slot on the master using a side-channel (application etc) to co-ordinate.<br />
<br />
BTW, I found the timeline logic in Pg complex enough that I intend to follow up with a README to describe it. I had to map it all out - and the different ways different parts of Pg handle timeline following and WAL reading - before I could make sense of it.<br />
<br />
=== Failover slots ===<br />
<br />
Adds failover slots, slots whose state updates are recorded in WAL and replayed on replica servers.<br />
<br />
The point "drop non-failover slots in archive recovery startup" needs more explanation. In 9.4 and 9.5 we just omit pg_replslot from pg_basebackup copies. So they're effectively dropped on backup, only for the backup copy. For other backup methods like rsync, disk snapshots, etc, it'll tend to get included and the server will read and retain the slots when started up from such a backup. This means that on restore we might or might not keep slots based on the method used. If slots were retained they'll holdthe minimum LSN down and prevent WAL retention on the replica since the replica makes its own independent decisions about WAL retention. Additionally logical slots will become increasingly *wrong*: their catalog xmin won't advance, but vacuum on the master server the standby is replaying from will remove dead rows from the catalogs based on the up-to-date catalog xmin of the live, advancing copy of the slot on the master. Nothing updates the copy of the slot from the basebackup.<br />
<br />
For failover slots I changed pg_basebackup to copy pg_replslot since I have to copy failover slot state. An alternative would be to make sure the backup checkpoint includes every slot whether or not it's dirty, but copying pg_replslot seemed to make more sense and is more consistent with normal recovery startup. This made the existing problem with stale slots more visible though.<br />
<br />
To solve that, as part of this patch the server now drops non-failover slots during archive recovery startup - which includes both true archive based restore/PITR and streaming replication.<br />
<br />
<br />
As for the backup label change: I don't feel too strongly about this. It'll allow backup tools to make sure to copy and archive enough WAL to keep failover slots functional. It's harmless - all the common tools fish things out of backup labels with regexps and won't care about a new line. It can *not* be achieved by changing START WAL LOCATION instead since that's what the server uses as the start of redo when you begin archive recovery so it must be a checkpoint's REDO pointer. On the other hand pg_basebackup gets the information it needs about minimum WAL from the return value to the BASE_BACKUP command, which I changed to the min() of the WAL needed for redo or failover slots, so it doesn't actually need the info in the label to do its job right. In the great majority of cases if you restore from a backup you'll do archive replay anyway so any failover slots would rapidly advance past needing the extra WAL that got retained in the backup anyway. I'd rather solve this 100% though, not ask users to hope they don't hit this window. Especially since slots might be used for bursty or delayed replication that widens the window.<br />
<br />
Notice that slot creation doesn't get replayed to a replica until the slot is made persistent. That way ephemeral slots from failed decoding setups don't have any effect on the replica. I'd like extra attention on this part though, as I'm concerned there might be a risk that as a result the replica could remove WAL still needed by the slot by the time it's made persistent and replicated. If you promote during this period your slot on the replica would be broken. I haven't been able to reproduce it and I'm not convinced it can happen but could use reviewer input there. The reason I don't just replicate ephemeral slot creation to solve this is that ephemeral slots are dropped silently during redo after crash recovery, where we can't write WAL to tell the replica it no longer needs to keep its copies. <br />
<br />
<br />
=== User interface for failover slots ===<br />
<br />
Adds the SQL function parameters, walsender keyword and view changes as well as the docs changes.<br />
<br />
Does not add a --failover option to pg_recvlogical or pg_receivexlog's --create-slot option. Think one is needed?</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35544Logical replication and physical standby failover2020-11-25T04:55:20Z<p>Ringerc: </p>
<hr />
<div>See also [[Failover slots]] for some historically relevant information.<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
-- or --* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling; and<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
==== Problems ====<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
==== Problems ====<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
- or -<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
==== Problems ====<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
No sensible ones.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
==== Problems ====<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br />
<br />
Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
==== Tooling workarounds ====<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
==== Problems ====<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br />
<br />
This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br />
<br />
The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting</div>Ringerchttps://wiki.postgresql.org/index.php?title=Logical_replication_and_physical_standby_failover&diff=35543Logical replication and physical standby failover2020-11-25T04:54:29Z<p>Ringerc: Discussion of logical replication as it relates to physical standbys and promotion and failover</p>
<hr />
<div>See also [[Failover Slots]]<br />
<br />
= Problem statement =<br />
<br />
As some of you may know, a number of external tools have come to rely on some hacks that manipulate internal replication slot data in order to support failover of a logical replication upstream to physical replicas of that upstream.<br />
<br />
This is pretty much vital for production deployments of logical replication in some scenarios; you can't really say "on failure of the upstream, rebuild everything downstream from it completely". Especially if what's downstream is a continuous ETL process or other stream-consumer that you can't simply drop and copy a new upstream base state to.<br />
<br />
These hacks are not documented, they're prone to a variety of subtle problems, and they're really not safe. The approach we take in pglogical generally works well, but it's complex and it's hard to make it as robust as I'd like without more help from the postgres core. I've seen a number of other people and products take approaches that seem to work, but can actually lead to silent inconsistencies, replication gaps, and data loss.<br />
It's a bit of a disincentive to invest effort in enhancing in-core logical rep, because right now it's rather impractical in HA environments.<br />
<br />
I'd like to improve the status quo. Failover slots never got anywhere because they wouldn't work on standbys (though we don't have logical decoding on standby anyway), but I'm not trying to resurrect that approach.<br />
<br />
== Specific issues and proposed solutions ==<br />
<br />
=== Ensuring catalog_xmin safety on physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* When copying slot state to a replica, there's no way to know if a logical slot's catalog_xmin is actually guaranteed safe on a replica or whether we could've already vacuumed those rows. Tooling has to take care of this and has no way to detect if it got it wrong.<br />
<br />
* A newly created standby is not a valid failover-promotion candidate for a logical replication primary until an initial period where slots exist but their catalog_xmin isn't actually guaranteed to be safe has past. Replaying from slots during this period can produce wrong results, possibly crashes. That's because there's a race between copying the slot state on the primary and when the catalog_xmin of the copy applied to the replica takes effect via the replica's hot_standby_feedback where the primary slot could advance and the catalog_xmin of the copy becomes invalid.<br />
<br />
* We don't check the xmin and catalog_xmin sent by the downstream in hot_standby_feedback messages and limit the newly set xmin or catalog_xmin on a physical slot to the oldest guaranteed-reserved value on the primary; we only guard against wraparound. So the physical slot a replica uses to hold down the primary's catalog_xmin for its slot copies can claim a catalog_xmin that's not actually protected on the primary. Changes could've been vacuumed away already. See ProcessStandbyHSFeedbackMessage().<br />
<br />
* We don't use the effective vs current catalog_xmin separation for physical slots. See PhysicalReplicationSlotNewXmin() . It assumes it doesn't need to care about effective_catalog_xmin because logical decoding is not involved. But when a physical slot protects logical slot resources on a replica that's a promotion candidate, logical decoding is involved, and that assumption isn't safe. We might advance a slot's catalog_xmin, advance the global catalog_xmin, vacuum away some changes, then crash before we flushed the dirty slot persistently to disk. (Not a big deal in practice, since a physical replica won't advance its reported catalog_xmin until it knows it no longer needs those changes because all its own slots' catalog_xmins are past that point).<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical protects catalog_xmin safety by creating temp slots when it does the initial slot copy. It waits for the catalog_xmin to become effective on the replica's physical slot via hot_standby_feedback. That stops catalog_xmin advancing on the primary but the reserved catalog_xmin could be stale if the upstream slot advanced in the mean time. So it syncs the new upstream slot's state (with a possibly advanced catalog_xmin) to the downstream and waits for the upstream lsn at which the slot copy was taken to be passed on the downstream. The slot is then persisted so it becomes visible to use. That's a lot of hoop jumping.<br />
<br />
It could protect the upstream reservation by making a temp slot on the upstream as a resource bookmark instead, but that is complex too.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Record safe catalog_xmin in the checkpoint state in pg_controldata. Advance it during checkpoints and by writing a new WAL record type when ReplicationSlotsComputeRequiredXmin() advances it. Clear it if all catalog_xmin reservations go away.<br />
* Use the tracked oldest safe xmin and catalog_xmin to limit the values applied from hot_standby_feedback to the known safe values<br />
* If no catalog_xmin is defined and a h_s feedback tries to set one, reserve one and cap the slot at the newly reserved catalog_xmin. Don't blindly accept the downstream's catalog_xmin.<br />
* Report the active replication slot's effective xmin and catalog_xmin in walsender's keepalives ( WalSndKeepalive() ) so downstream can tell if the upstream didn't honour its h_s_feedback reservations in full. Standby can already force walsender to send a keepalive reply so no change needed there.<br />
* ERROR if logical decoding is attempted from a slot that has a catalog_xmin not known to be safe<br />
<br />
Might also want to use the same candidate->effective separation when advancing physical slots' xmin and catalog_xmin as we do for logical slots, so we properly checkpoint them and can't go backwards on crash. Not sure if it is really required.<br />
<br />
=== Syncing logical replication slot state to physical replicas ===<br />
<br />
==== Problems ====<br />
<br />
* Each tool must provide its own C extension on physical replicas in order to copy replication slot state from the primary to the standbys, or use tricks like copying the slot state files while the server is stopped.<br />
<br />
* Slot state copying using WAL as a transport doesn't work well because there's (AFAIK) no way to fire a hook on redo of generic WAL, and there aren't any hooks in slot saving and persistence to help extensions capture slot advances anyway. So state copying needs an out-of-band channel or custom tables and polling on both sides. pglogical for example uses separate libpq connections from downstream to upstream for slot sync, which is cumbersome.<br />
<br />
* A newly created standby isn't immediately usable as a promotion candidate; slot state must be copied first, considering the caveats above about catalog_xmin safety too. The same issue applies when a new slot is created on the primary; it's not safe to promote until that new slot is synced to a replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
pglogical does its own C-code-level replication slot manipulation using a background worker on replicas.<br />
<br />
pglogical uses a separate libpq-protocol connection from replica to primary to handle slot state reading. Now that the walsender supports queries, this can use the same connstr used for the walreceiver to stream WAL from the primary.<br />
<br />
pglogical provides functions for the user to use to check whether their physical replica is ready to use as a promotion candidate.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* pg_replication_slot_export(slotname) => bytea and pg_replication_slot_import(slotname text, bytea slotdata, as_temp boolean) functions for slot sync. The slot data will contain the sysid of the primary as well as the current timeline and insert lsn. The import function will ERROR if the sysid doesn't match, the (timeline, lsn) is diverged from the downstream's history, or the slot's xmin or catalog_xmin cannot be guaranteed because they've been advanced past. It will block by default if the (timeline, lsn) is in the future. On import the slot will be created if it doesn't exist. Permit importing of a non-temp upstream slot's state as a temp downstream slot (for reservation) and/or a different slot name.<br />
-- or --* Hooks in slot.c's SaveSlotToPath() before and after the write that tools can use to be notified of and/or capture (maybe to generic WAL) persistent flushes of slot state without polling; and<br />
* A way to register for generic WAL redo callbacks, with a critical section to ensure the callback can't be missed if there's a failure after the generic WAL record is applied but before the followup actions are taken (I know this was discussed before, but extensions are now bigger and more complex than I think anyone really imagined when generic WAL went in)<br />
Both would benefit greatly from the catalog_xmin safety stuff though they'd be usable without it.<br />
<br />
=== Logical slot exported snapshots are not persistent or crash safe ===<br />
<br />
==== Problems ====<br />
<br />
* Exported snapshots from slots go away when the connection to the slot goes away. There's no way to make the snapshot crash-safe and persistent. The snapshot can be protected somewhat by attaching to it from other backends, but a server restart or network glitch can still destroy it. For big data copies during logical replication setup this is a nightmare.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools could make a loopback connection on the upstream side or launch a bgworker, and use that connection or worker to attach to the exported snapshot. That perserves it even if the slot conn closes, and isn't vulnerable to downstream network issues. However, it won't go away automatically even if the slot gets dropped. And it won't perserve the exported snapshot across a crash/restart.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Make a new logical replication slot's xmin persistent until explicitly cleared, and protect the associated snapshot data file by making it owned by the slot and only removing it when explicitly invalidated. The protected snapshot would be unprotected (and thus removed once all backends with it open have exited) once the slot is dropped or when replication from the slot begins. But NOT when the connection that created the slot drops. If you want that, use a temp slot. If BC is a concern here, add a new option to the walsender's CREATE_REPLICATION_SLOT like PERSISTENT_SNAPSHOT .<br />
<br />
=== Logical slot exported snapshots cannot be used to offload consistent reads to replicas ===<br />
<br />
==== Problems ====<br />
<br />
* There's no way to copy an exported snapshot from primary to replica in order to dump data from a replica that's consistent with a replication slot's exported snapshot on the primary. (Nor is there any way to create or move a logical slot to be exactly consistently with an existing exported snapshot). This prevents logical rep tools from offloading initial data copy to replicas without a lot of very complex hoop jumping. It's also relevant for any tool that wants to consistently query across multiple replicas - ETL and analytics tools for example.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
Few, and questionable if any are safe. There are complex issues with commit visibility ordering in WAL vs in the primary's PGXACT amongst other things.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* A pg_export_snapshot_data(text) => bytea to go along with pg_export_snapshot(). The exported snapshot bytea representation would include the xmin and current insert lsn. It would be accepted by a new SET TRANSACTION SNAPSHOT_DATA '\xF00' which would check the xmin and refuse to import the snapshot if the xmin was too old or the replica hadn't replayed past the insert lsn yet. Requires that we track safe catalog_xmin.<br />
<br />
- or -<br />
<br />
* Some way (handwave here) to write exported snapshot state to WAL, preserve it persistently on the primary until explicitly discarded, and attach to such persistent exported snapshots on the standby. A little like we do with 2pc prepared xacts.<br />
<br />
=== Logical slots can fill pg_wal and can't benefit from archiving ===<br />
<br />
==== Problems ====<br />
<br />
* The read_page callback for the logical decoding xlogreader does not know how to use the WAL archiver to fetch removed WAL. So restart_lsn means both "oldest lsn I have to process for correct reorder buffering" and "oldest lsn I must keep WAL segments in pg_wal for". This is an availability hazard and also makes it hard to keep pg_wal on high performance capacity-limited storage. We have max_slot_wal_keep_size now, but the slot just breaks if we cross that threshold, which is bad.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
No sensible ones.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal, now that it's integrated into postgresql.conf. Instead of renaming the segment into place, once the segment is fetched, we can open and unlink it so we don't have to worry about leaking them except on windows, where we can use proc exit callbacks for cleanup. Make caching and readahead the restore_command's problem.<br />
<br />
=== Consistency between logical subscribers and physical replicas of a publisher (upstream) ===<br />
<br />
==== Problems ====<br />
<br />
* Logical downstreams for a given upstream can receive changes before physical replicas of the upstream. If the upstream is replaced with a promoted physical replica, the promoted replica might not have recent txns committed on the old-upstream and already replicated to downstreams.<br />
<br />
Relying on synchronous_standby_names in the output plugin is undesirable because (a) it means clients can't request sync rep to logical downstreams without deadlocking logical replication; (b) output plugins can't send xacts if a standby is disconnected even if the standby has a slot and the slot is safely flushed past the lsn being sent, because s_s_n relies on application_name and pg_stat_replication not pg_replication_slots; if the primary crashes and restarts, sync rep won't wait for those LSNs even if standbys haven't received them yet.<br />
<br />
==== Tooling workarounds ====<br />
<br />
Output plugins can implement their own logic to ensure failover-candidate standbys replay past a given commit before sending it to logical downstreams, but each output plugin shouldn't need its own. The tool has to have configuration to keep track of which physical slots matter, has to have code to wait in the output plugin commit callback, etc.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Define a new failover_replica_slot_names with a list of physical replication slot names that are candidates for failover-promotion to replace the current node. Use the same logic and syntax for n-safe as we use in synchronous_standby_names.<br />
<br />
* Let failover_replica_slot_names be set per-backend via connstr, user or db GUCs, etc, unlike synchronous_standby_names, so output plugins can set it themselves, and so we can adapt to changes in cluster topology.<br />
<br />
* If failover_replica_slot_names is non-empty, wait until all listed physical slots' restart_lsn s and logical slot's confirmed_flush_lsn s have replayed past a given commit lsn before calling any output plugin's commit callback for that lsn. If the backend's current slot is listed, skip over it when checking.<br />
* Respect failover_replica_slot_names like we do synchronous_standby_names for synchronous commit purposes - don't confirm a commit to the client until failover_replica_slot_names has accepted it.<br />
<br />
<br />
=== Consistency between primary and physical replicas of a logical standby ===<br />
<br />
==== Problems ====<br />
<br />
* Logical slots can't go backwards, and they silently fast-forward past changes requested by the downstream if the upstream's confirmed_flush_lsn is greater. On failover there can be gaps in the data stream received by a promoted replica of a standby.<br />
<br />
This happens because the old subscriber confirmed flush of changes to the publisher when they were locally flushed, but before they were flushed to replicas, so they vanish when replicas get promoted.<br />
<br />
The replica's own pg_replication_origin_status.remote_lsn is correct (not advanced) but when the replica connects to the primary and asks for replay to start at the replica's pg_replication_origin_status.remote_lsn, the publisher silently starts at the max of thd downstream's requested pg_replication_origin_status.remote_lsn and the upstream slot's pg_replication_slots.confirmed_flush_lsn. The latter was advanced by the now-failed and replaced node, so some changes are skipped over and never seen by the replica.<br />
<br />
<br />
==== Tooling workarounds ====<br />
<br />
* Tools can keep track of failover-promotion-candidate physical replicas themselves, and can hold down the lsn they report as flushed to the upstream walsender until their failover-candidate downstream(s) have flushed that lsn too. This requires each tool to manage that separately.<br />
<br />
==== Proposed solution(s) ====<br />
<br />
* Provide a Pg API function that reports the newest lsn that's safely flushed by the local node and the failover-candidates in failover_replica_slot_names, if set.<br />
<br />
* Use that function in the pglogical replication worker's feedback reporting</div>Ringerchttps://wiki.postgresql.org/index.php?title=Failover_slots&diff=35508Failover slots2020-11-10T06:38:15Z<p>Ringerc: </p>
<hr />
<div>= Failover slots (unsuccessful feature proposal) =<br />
<br />
Failover slots [https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com were a proposed feature for PostgreSQL 9.6]. [https://commitfest.postgresql.org/9/488/ The feature proposal has been dropped]. Failover slots were not included in PostgreSQL and are unlikely to be included in the form proposed in this patch. However, parts of the functionality underlying the feature did get committed.<br />
<br />
== Partially committed ==<br />
<br />
Some of the functionality underlying the failover slots and [https://commitfest.postgresql.org/11/788/ logical decoding on standby] patch sets did get committed, including:<br />
<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5737c12df0 catalog_xmin in hot_standby_feedback messages from physical streaming replicas (PostgreSQL 10)] ([https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com list discussion]) This allows a streaming replica to forward the `catalog_xmin` of any logical slots that exist on the replica up to the primary where the `catalog_xmin` is then applied to the physical slot of the streaming replica. The primary will then guarantee the `catalog_xmin` of those logical slots on the standby even when the standby is disconnected, preventing vacuum from removing dead catalog tuples that might still be needed for logical slots on the standby. That makes it safe to use the logical slots on the standby for logical decoding after promoting the standby.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1148e22a82 timeline following for logical slots (PostgreSQL 10)] ([https://commitfest.postgresql.org/9/568/ CF entry], [https://commitfest.postgresql.org/11/779/ prior CF entry]). This patch teaches the xlogreader code how to follow timeline switches correctly when doing logical decoding, so the change stream from a logical slot does not terminate when the upstream timeline increments due to a failover/promotion event. Necessary to allow logical decoding to follow failover.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff539da316 Cleanup slots during drop database (PostgreSQL 10)]. Required to allow a standby to replay `DROP DATABASE` for a database where logical slots exist for that database on the standby.<br />
<br />
This is enough to allow external tools to roll their own "failover slots" by syncing slot state from primary to standby(s), though it's a bit delicate to do so correctly. It's necessary to write a C extension that creates replication slots using the low level slot management APIs, since no SQL-visible functions exist to do so. An example of the use of those APIs can be found in the original failover slots patch in src/test/modules.<br />
<br />
A few related patches are also relevant to failover and logical slots:<br />
<br />
* [https://www.postgresql.org/message-id/5c26ff40-8452-fb13-1bea-56a0338a809a@2ndquadrant.com Logical decoding fast-forward and slot advance (Petr Jelinek)] - adds <code>pg_replication_slot_advance()</code> SQL function.<br />
<br />
== Relevant mailing list discussion ==<br />
<br />
* [https://www.postgresql.org/message-id/CAMsr%2BYEVmBJ%3DdyLw%3D%2BkTihmUnGy5_EW4Mig5T0maieg_Zu%3DXCg%40mail.gmail.com Logical decoding on standby] - this proposed feature integrated with failover slots and had some of the same moving parts.<br />
* [https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com Send catalog_xmin separately in hot standby feedback]<br />
* [https://www.postgresql.org/message-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com Logical decoding timeline following take II]<br />
* [https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com WIP: Failover slots]<br />
* [https://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com Timeline following for logical slots]<br />
* [https://www.postgresql.org/message-id/CAMsr+YG_1FU_-L8QWSk6oKFT4Jt8dpORy2RHXDyMy0B5ZfkpGA@mail.gmail.com Logical decoding timeline following fails to handle records split across segments]<br />
* [https://www.postgresql.org/message-id/20160503165812.GA29604@alvherre.pgsql What to revert]<br />
<br />
== Implementing replication slot failover with tooling ==<br />
<br />
With the above patches in PostgreSQL 10 it's now possible to implement failover management for PostgreSQL logical replication slots in external tooling.<br />
<br />
Standbys '''must''' be configured with:<br />
<br />
* <code>hot_standby_feedback = on</code><br />
* A <code>primary_slot_name</code> to use a physical replication slot on the primary<br />
<br />
The tool will need to provide an extension in each failover-candidate standby that provides a means of managing low-level replication slot state, since there is no SQL interface for this in PostgreSQL at time of writing. Exactly how this is done, and whether it's a push or pull model etc, is up to the tool. A very simplistic and minimal example can be found in [https://www.postgresql.org/message-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com the patch attached to this mail, in <code>src/test/modules/test_slot_timelines</code>]. (A tool should '''not''' copy `pg_replslot/*/state` files from primary to standby instead; these won't be re-read by the standby when updated while the server is running, and could get replaced by stale contents from shared memory).<br />
<br />
To manage failover, the tool should periodically scan the primary's slots. For each logical replication slot the tool wishes to preserve for failover to a standby, the tool should create/update an identical logical replication slot on any failover-candidate standby(s). The tool must check that that the standby has replayed up to the <code>confirmed_flush_lsn</code> of a slot and delay syncing that slot if needed. When syncing slots, the <code>restart_lsn</code>, <code>confirmed_flush_lsn</code> and <code>catalog_xmin</code> of the standby's copy of a slot must all be updated and persisted together. The tool should also delete slots from the standby when they cease to exist on the primary.<br />
<br />
=== Limitations ===<br />
<br />
It's only safe to use any given logical replication slot on a standby after promotion once the <code>catalog_xmin</code> for the standby's physical slot on the primary is &lt;= the <code>catalog_xmin</code> for the slot. Until that point, any such slots are unsafe to use; they may work, but produce incomplete or incorrect output or crash the walsender. I recommend that you create them with a different name like "_sync_temp1" or something, then rename them (create a new one and drop the temp one) once the <code>catalog_xmin</code> is known to be safe. You can use the <code>txid_status()</code> function to help with this, or just watch the physical slot's <code>catalog_xmin</code> on the primary.<br />
<br />
Even with this approach, a logical subscriber may receive and apply a transaction from the primary before the physical replica. A failover may then cause the physical replica to be promoted without having this transaction, so the provider and subscriber now differ. Addressing this would require a core code change to teach the walsender to delay sending logical commits until they've been confirmed by all failover-candidate physical replicas. A patch for this would be welcomed. Individual output plugins can work around this in the mean time by sleeping in their commit callback until all slots configured as replicas have flushed past the lsn of the commit being processed. The output plugin has to provide its own means of configuring which slots/connections represent replicas - it does not make sense to overload <code>synchronous_standby_names</code> for this, and you want to use slot names not standby connection names anyway.<br />
<br />
The primary '''must''' preserve the physical replication slot for the standby. If the standby slot is dropped and re-created, it becomes unsafe to fail over to the standby and use any logical slots on the standby until they are resynced again. There's no simple way for tooling to detect if the standby's slot on the primary was dropped and re-created.<br />
<br />
Unfortunately there are no C-level hook functions in the replication slot management code for tools to use to trigger wakeups, syncs or checks. Polling is required.<br />
<br />
= Information on original failover slots proposal =<br />
<br />
The following is older content preserved to aid in understanding the context of the topic.<br />
<br />
== Rationale ==<br />
<br />
We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client. Without this you can't reliably stream data from a HA Pg install into a message queue / ETL system / whatever using logical decoding. You can't have physical HA for your nodes in a logical replication deployment, multi-master mesh cluster, etc. You currently have to create a new slot after failover but you have no way to get from your current before-failover state to a state consistent with the replay position of the standby at the moment of slot creation.<br />
<br />
This patch does not make all slots into failover slots. Because failover slots must write WAL on create/advance/drop they cannot be used on a standby in recovery. That means normal non-failover slots remain useful - you can use a physical slot to get WAL from a standby, and it should be possible to add support for logical decoding from standbys too.<br />
<br />
== Limitations ==<br />
<br />
Failover slots can only be created, replayed from and dropped on a master (original or promoted standby).<br />
<br />
Later work could allow all slots to become failover slots by introducing a slot logging mechanism that functions "beside" WAL, allowing standbys to have their own failover slots and replay changes on those slots to their own cascading standbys. This can be done without changing the UI or breaking clients that work with the first (9.6) version since they won't care that the transport between nodes has changed from WAL to ... something else. To be handwaved up later, and probably only capable of working using slots and streaming replication not archive recovery.<br />
<br />
We need failover slots now, to make logical decoding more useful in production environments. Supporting them on cascading standbys is a "nice to have later".<br />
<br />
== Patch notes ==<br />
<br />
Additional explanation to accompany the patch submission.<br />
<br />
=== Timeline following for logical decoding ===<br />
<br />
This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.<br />
<br />
I originally intended to add timeline following support directly to the xlogreader but that wouldn't work well with redo, is more intrusive, and would cause problems with frontend code like pg_xlogdump and pg_rewind that use the xlogreader code. So I implemented it as a nearly-standalone helper that logical decoding callbacks can invoke to ask the xlogreader to track timeline state for them. It's in xlogutils.c not xlogreader.c because it can't be compiled for frontend code. (A later patch could make timeline.c compile for FRONTEND, move the helper to xlogreader.c and add support for timeline following in pg_xlogdump fairly easily... if anyone cares. I don't see the point.)<br />
<br />
This patch does NOT use the restart_tli member of ReplicationSlotPersistentData. That member is unused in the current 94/95/96 code as well. I think it should be removed. I didn't use it because it's no help; it's still necessary to read the timeline history files to determine the validity boundaries of a historical timeline, and you might as well just look up the timeline of a given point in history there too. Additionally, trying to use the slot's restart_tli doesn't make sense alongside the existing use of restart_lsn to lazily update the slot's restart position in response to feedback. Timeline following must be eagerly done during replay. In the case of a peek the timeline switch won't be persistent either. Trying to use restart_tli to track which timeline to read new records from is wrong and useless. There's no point having it at all since we can determine the restart timeline from the restart LSN using the timeline history.<br />
<br />
It should be possible to use the timeline following support in patch 1 alone by creating an extension that does a ReplicationSlotCreate on a standby and manually mirrors a slot on the master using a side-channel (application etc) to co-ordinate.<br />
<br />
BTW, I found the timeline logic in Pg complex enough that I intend to follow up with a README to describe it. I had to map it all out - and the different ways different parts of Pg handle timeline following and WAL reading - before I could make sense of it.<br />
<br />
=== Failover slots ===<br />
<br />
Adds failover slots, slots whose state updates are recorded in WAL and replayed on replica servers.<br />
<br />
The point "drop non-failover slots in archive recovery startup" needs more explanation. In 9.4 and 9.5 we just omit pg_replslot from pg_basebackup copies. So they're effectively dropped on backup, only for the backup copy. For other backup methods like rsync, disk snapshots, etc, it'll tend to get included and the server will read and retain the slots when started up from such a backup. This means that on restore we might or might not keep slots based on the method used. If slots were retained they'll holdthe minimum LSN down and prevent WAL retention on the replica since the replica makes its own independent decisions about WAL retention. Additionally logical slots will become increasingly *wrong*: their catalog xmin won't advance, but vacuum on the master server the standby is replaying from will remove dead rows from the catalogs based on the up-to-date catalog xmin of the live, advancing copy of the slot on the master. Nothing updates the copy of the slot from the basebackup.<br />
<br />
For failover slots I changed pg_basebackup to copy pg_replslot since I have to copy failover slot state. An alternative would be to make sure the backup checkpoint includes every slot whether or not it's dirty, but copying pg_replslot seemed to make more sense and is more consistent with normal recovery startup. This made the existing problem with stale slots more visible though.<br />
<br />
To solve that, as part of this patch the server now drops non-failover slots during archive recovery startup - which includes both true archive based restore/PITR and streaming replication.<br />
<br />
<br />
As for the backup label change: I don't feel too strongly about this. It'll allow backup tools to make sure to copy and archive enough WAL to keep failover slots functional. It's harmless - all the common tools fish things out of backup labels with regexps and won't care about a new line. It can *not* be achieved by changing START WAL LOCATION instead since that's what the server uses as the start of redo when you begin archive recovery so it must be a checkpoint's REDO pointer. On the other hand pg_basebackup gets the information it needs about minimum WAL from the return value to the BASE_BACKUP command, which I changed to the min() of the WAL needed for redo or failover slots, so it doesn't actually need the info in the label to do its job right. In the great majority of cases if you restore from a backup you'll do archive replay anyway so any failover slots would rapidly advance past needing the extra WAL that got retained in the backup anyway. I'd rather solve this 100% though, not ask users to hope they don't hit this window. Especially since slots might be used for bursty or delayed replication that widens the window.<br />
<br />
Notice that slot creation doesn't get replayed to a replica until the slot is made persistent. That way ephemeral slots from failed decoding setups don't have any effect on the replica. I'd like extra attention on this part though, as I'm concerned there might be a risk that as a result the replica could remove WAL still needed by the slot by the time it's made persistent and replicated. If you promote during this period your slot on the replica would be broken. I haven't been able to reproduce it and I'm not convinced it can happen but could use reviewer input there. The reason I don't just replicate ephemeral slot creation to solve this is that ephemeral slots are dropped silently during redo after crash recovery, where we can't write WAL to tell the replica it no longer needs to keep its copies. <br />
<br />
<br />
=== User interface for failover slots ===<br />
<br />
Adds the SQL function parameters, walsender keyword and view changes as well as the docs changes.<br />
<br />
Does not add a --failover option to pg_recvlogical or pg_receivexlog's --create-slot option. Think one is needed?</div>Ringerchttps://wiki.postgresql.org/index.php?title=Failover_slots&diff=35507Failover slots2020-11-10T02:46:17Z<p>Ringerc: Explain how to do failover slots in tooling</p>
<hr />
<div>= Failover slots (unsuccessful feature proposal) =<br />
<br />
Failover slots [https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com were a proposed feature for PostgreSQL 9.6]. [https://commitfest.postgresql.org/9/488/ The feature proposal has been dropped]. Failover slots were not included in PostgreSQL and are unlikely to be included in the form proposed in this patch. However, parts of the functionality underlying the feature did get committed.<br />
<br />
== Partially committed ==<br />
<br />
Some of the functionality underlying the failover slots and [https://commitfest.postgresql.org/11/788/ logical decoding on standby] patch sets did get committed, including:<br />
<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5737c12df0 catalog_xmin in hot_standby_feedback messages from physical streaming replicas (PostgreSQL 10)] ([https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com list discussion]) This allows a streaming replica to forward the `catalog_xmin` of any logical slots that exist on the replica up to the primary where the `catalog_xmin` is then applied to the physical slot of the streaming replica. The primary will then guarantee the `catalog_xmin` of those logical slots on the standby even when the standby is disconnected, preventing vacuum from removing dead catalog tuples that might still be needed for logical slots on the standby. That makes it safe to use the logical slots on the standby for logical decoding after promoting the standby.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1148e22a82 timeline following for logical slots (PostgreSQL 10)] ([https://commitfest.postgresql.org/9/568/ CF entry], [https://commitfest.postgresql.org/11/779/ prior CF entry]). This patch teaches the xlogreader code how to follow timeline switches correctly when doing logical decoding, so the change stream from a logical slot does not terminate when the upstream timeline increments due to a failover/promotion event. Necessary to allow logical decoding to follow failover.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff539da316 Cleanup slots during drop database (PostgreSQL 10)]. Required to allow a standby to replay `DROP DATABASE` for a database where logical slots exist for that database on the standby.<br />
<br />
This is enough to allow external tools to roll their own "failover slots" by syncing slot state from primary to standby(s), though it's a bit delicate to do so correctly. It's necessary to write a C extension that creates replication slots using the low level slot management APIs, since no SQL-visible functions exist to do so. An example of the use of those APIs can be found in the original failover slots patch in src/test/modules.<br />
<br />
A few related patches are also relevant to failover and logical slots:<br />
<br />
* [https://www.postgresql.org/message-id/5c26ff40-8452-fb13-1bea-56a0338a809a@2ndquadrant.com Logical decoding fast-forward and slot advance (Petr Jelinek)] - adds <code>pg_replication_slot_advance()</code> SQL function.<br />
<br />
== Relevant mailing list discussion ==<br />
<br />
* [https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com Send catalog_xmin separately in hot standby feedback]<br />
* [https://www.postgresql.org/message-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com Logical decoding timeline following take II]<br />
* [https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com WIP: Failover slots]<br />
* [https://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com Timeline following for logical slots]<br />
* [https://www.postgresql.org/message-id/CAMsr+YG_1FU_-L8QWSk6oKFT4Jt8dpORy2RHXDyMy0B5ZfkpGA@mail.gmail.com Logical decoding timeline following fails to handle records split across segments]<br />
* [https://www.postgresql.org/message-id/20160503165812.GA29604@alvherre.pgsql What to revert]<br />
<br />
== Implementing replication slot failover with tooling ==<br />
<br />
With the above patches in PostgreSQL 10 it's now possible to implement failover management for PostgreSQL logical replication slots in external tooling.<br />
<br />
Standbys '''must''' be configured with:<br />
<br />
* <code>hot_standby_feedback = on</code><br />
* A <code>primary_slot_name</code> to use a physical replication slot on the primary<br />
<br />
The tool will need to provide an extension in each failover-candidate standby that provides a means of managing low-level replication slot state, since there is no SQL interface for this in PostgreSQL at time of writing. Exactly how this is done, and whether it's a push or pull model etc, is up to the tool. A very simplistic and minimal example can be found in [https://www.postgresql.org/message-id/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com the patch attached to this mail, in <code>src/test/modules/test_slot_timelines</code>]. (A tool should '''not''' copy `pg_replslot/*/state` files from primary to standby instead; these won't be re-read by the standby when updated while the server is running, and could get replaced by stale contents from shared memory).<br />
<br />
To manage failover, the tool should periodically scan the primary's slots. For each logical replication slot the tool wishes to preserve for failover to a standby, the tool should create/update an identical logical replication slot on any failover-candidate standby(s). The tool must check that that the standby has replayed up to the <code>confirmed_flush_lsn</code> of a slot and delay syncing that slot if needed. When syncing slots, the <code>restart_lsn</code>, <code>confirmed_flush_lsn</code> and <code>catalog_xmin</code> of the standby's copy of a slot must all be updated and persisted together. The tool should also delete slots from the standby when they cease to exist on the primary.<br />
<br />
=== Limitations ===<br />
<br />
It's only safe to use any given logical replication slot on a standby after promotion once the <code>catalog_xmin</code> for the standby's physical slot on the primary is &lt;= the <code>catalog_xmin</code> for the slot. Until that point, any such slots are unsafe to use; they may work, but produce incomplete or incorrect output or crash the walsender. I recommend that you create them with a different name like "_sync_temp1" or something, then rename them (create a new one and drop the temp one) once the <code>catalog_xmin</code> is known to be safe. You can use the <code>txid_status()</code> function to help with this, or just watch the physical slot's <code>catalog_xmin</code> on the primary.<br />
<br />
Even with this approach, a logical subscriber may receive and apply a transaction from the primary before the physical replica. A failover may then cause the physical replica to be promoted without having this transaction, so the provider and subscriber now differ. Addressing this would require a core code change to teach the walsender to delay sending logical commits until they've been confirmed by all failover-candidate physical replicas. A patch for this would be welcomed. Individual output plugins can work around this in the mean time by sleeping in their commit callback until all slots configured as replicas have flushed past the lsn of the commit being processed. The output plugin has to provide its own means of configuring which slots/connections represent replicas - it does not make sense to overload <code>synchronous_standby_names</code> for this, and you want to use slot names not standby connection names anyway.<br />
<br />
The primary '''must''' preserve the physical replication slot for the standby. If the standby slot is dropped and re-created, it becomes unsafe to fail over to the standby and use any logical slots on the standby until they are resynced again. There's no simple way for tooling to detect if the standby's slot on the primary was dropped and re-created.<br />
<br />
Unfortunately there are no C-level hook functions in the replication slot management code for tools to use to trigger wakeups, syncs or checks. Polling is required.<br />
<br />
= Information on original failover slots proposal =<br />
<br />
The following is older content preserved to aid in understanding the context of the topic.<br />
<br />
== Rationale ==<br />
<br />
We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client. Without this you can't reliably stream data from a HA Pg install into a message queue / ETL system / whatever using logical decoding. You can't have physical HA for your nodes in a logical replication deployment, multi-master mesh cluster, etc. You currently have to create a new slot after failover but you have no way to get from your current before-failover state to a state consistent with the replay position of the standby at the moment of slot creation.<br />
<br />
This patch does not make all slots into failover slots. Because failover slots must write WAL on create/advance/drop they cannot be used on a standby in recovery. That means normal non-failover slots remain useful - you can use a physical slot to get WAL from a standby, and it should be possible to add support for logical decoding from standbys too.<br />
<br />
== Limitations ==<br />
<br />
Failover slots can only be created, replayed from and dropped on a master (original or promoted standby).<br />
<br />
Later work could allow all slots to become failover slots by introducing a slot logging mechanism that functions "beside" WAL, allowing standbys to have their own failover slots and replay changes on those slots to their own cascading standbys. This can be done without changing the UI or breaking clients that work with the first (9.6) version since they won't care that the transport between nodes has changed from WAL to ... something else. To be handwaved up later, and probably only capable of working using slots and streaming replication not archive recovery.<br />
<br />
We need failover slots now, to make logical decoding more useful in production environments. Supporting them on cascading standbys is a "nice to have later".<br />
<br />
== Patch notes ==<br />
<br />
Additional explanation to accompany the patch submission.<br />
<br />
=== Timeline following for logical decoding ===<br />
<br />
This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.<br />
<br />
I originally intended to add timeline following support directly to the xlogreader but that wouldn't work well with redo, is more intrusive, and would cause problems with frontend code like pg_xlogdump and pg_rewind that use the xlogreader code. So I implemented it as a nearly-standalone helper that logical decoding callbacks can invoke to ask the xlogreader to track timeline state for them. It's in xlogutils.c not xlogreader.c because it can't be compiled for frontend code. (A later patch could make timeline.c compile for FRONTEND, move the helper to xlogreader.c and add support for timeline following in pg_xlogdump fairly easily... if anyone cares. I don't see the point.)<br />
<br />
This patch does NOT use the restart_tli member of ReplicationSlotPersistentData. That member is unused in the current 94/95/96 code as well. I think it should be removed. I didn't use it because it's no help; it's still necessary to read the timeline history files to determine the validity boundaries of a historical timeline, and you might as well just look up the timeline of a given point in history there too. Additionally, trying to use the slot's restart_tli doesn't make sense alongside the existing use of restart_lsn to lazily update the slot's restart position in response to feedback. Timeline following must be eagerly done during replay. In the case of a peek the timeline switch won't be persistent either. Trying to use restart_tli to track which timeline to read new records from is wrong and useless. There's no point having it at all since we can determine the restart timeline from the restart LSN using the timeline history.<br />
<br />
It should be possible to use the timeline following support in patch 1 alone by creating an extension that does a ReplicationSlotCreate on a standby and manually mirrors a slot on the master using a side-channel (application etc) to co-ordinate.<br />
<br />
BTW, I found the timeline logic in Pg complex enough that I intend to follow up with a README to describe it. I had to map it all out - and the different ways different parts of Pg handle timeline following and WAL reading - before I could make sense of it.<br />
<br />
=== Failover slots ===<br />
<br />
Adds failover slots, slots whose state updates are recorded in WAL and replayed on replica servers.<br />
<br />
The point "drop non-failover slots in archive recovery startup" needs more explanation. In 9.4 and 9.5 we just omit pg_replslot from pg_basebackup copies. So they're effectively dropped on backup, only for the backup copy. For other backup methods like rsync, disk snapshots, etc, it'll tend to get included and the server will read and retain the slots when started up from such a backup. This means that on restore we might or might not keep slots based on the method used. If slots were retained they'll holdthe minimum LSN down and prevent WAL retention on the replica since the replica makes its own independent decisions about WAL retention. Additionally logical slots will become increasingly *wrong*: their catalog xmin won't advance, but vacuum on the master server the standby is replaying from will remove dead rows from the catalogs based on the up-to-date catalog xmin of the live, advancing copy of the slot on the master. Nothing updates the copy of the slot from the basebackup.<br />
<br />
For failover slots I changed pg_basebackup to copy pg_replslot since I have to copy failover slot state. An alternative would be to make sure the backup checkpoint includes every slot whether or not it's dirty, but copying pg_replslot seemed to make more sense and is more consistent with normal recovery startup. This made the existing problem with stale slots more visible though.<br />
<br />
To solve that, as part of this patch the server now drops non-failover slots during archive recovery startup - which includes both true archive based restore/PITR and streaming replication.<br />
<br />
<br />
As for the backup label change: I don't feel too strongly about this. It'll allow backup tools to make sure to copy and archive enough WAL to keep failover slots functional. It's harmless - all the common tools fish things out of backup labels with regexps and won't care about a new line. It can *not* be achieved by changing START WAL LOCATION instead since that's what the server uses as the start of redo when you begin archive recovery so it must be a checkpoint's REDO pointer. On the other hand pg_basebackup gets the information it needs about minimum WAL from the return value to the BASE_BACKUP command, which I changed to the min() of the WAL needed for redo or failover slots, so it doesn't actually need the info in the label to do its job right. In the great majority of cases if you restore from a backup you'll do archive replay anyway so any failover slots would rapidly advance past needing the extra WAL that got retained in the backup anyway. I'd rather solve this 100% though, not ask users to hope they don't hit this window. Especially since slots might be used for bursty or delayed replication that widens the window.<br />
<br />
Notice that slot creation doesn't get replayed to a replica until the slot is made persistent. That way ephemeral slots from failed decoding setups don't have any effect on the replica. I'd like extra attention on this part though, as I'm concerned there might be a risk that as a result the replica could remove WAL still needed by the slot by the time it's made persistent and replicated. If you promote during this period your slot on the replica would be broken. I haven't been able to reproduce it and I'm not convinced it can happen but could use reviewer input there. The reason I don't just replicate ephemeral slot creation to solve this is that ephemeral slots are dropped silently during redo after crash recovery, where we can't write WAL to tell the replica it no longer needs to keep its copies. <br />
<br />
<br />
=== User interface for failover slots ===<br />
<br />
Adds the SQL function parameters, walsender keyword and view changes as well as the docs changes.<br />
<br />
Does not add a --failover option to pg_recvlogical or pg_receivexlog's --create-slot option. Think one is needed?</div>Ringerchttps://wiki.postgresql.org/index.php?title=Failover_slots&diff=35506Failover slots2020-11-10T02:09:59Z<p>Ringerc: </p>
<hr />
<div>= Failover slots (unsuccessful feature proposal) =<br />
<br />
Failover slots [https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com were a proposed feature for PostgreSQL 9.6]. [https://commitfest.postgresql.org/9/488/ The feature proposal has been dropped]. Failover slots were not included in PostgreSQL and are unlikely to be included in the form proposed in this patch. However, parts of the functionality underlying the feature did get committed.<br />
<br />
== Partially committed ==<br />
<br />
Some of the functionality underlying the failover slots and [https://commitfest.postgresql.org/11/788/ logical decoding on standby] patch sets did get committed, including:<br />
<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5737c12df0 catalog_xmin in hot_standby_feedback messages from physical streaming replicas (PostgreSQL 10)] [https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com CF entry]. This allows a streaming replica to forward the `catalog_xmin` of any logical slots that exist on the replica up to the primary where the `catalog_xmin` is then applied to the physical slot of the streaming replica. The primary will then guarantee the `catalog_xmin` of those logical slots on the standby even when the standby is disconnected, preventing vacuum from removing dead catalog tuples that might still be needed for logical slots on the standby. That makes it safe to use the logical slots on the standby for logical decoding after promoting the standby.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1148e22a82 timeline following for logical slots (PostgreSQL 10)] ([https://commitfest.postgresql.org/9/568/ CF entry], [https://commitfest.postgresql.org/11/779/ prior CF entry]). This patch teaches the xlogreader code how to follow timeline switches correctly when doing logical decoding, so the change stream from a logical slot does not terminate when the upstream timeline increments due to a failover/promotion event. Necessary to allow logical decoding to follow failover.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff539da316 Cleanup slots during drop database (PostgreSQL 10)]. Required to allow a standby to replay `DROP DATABASE` for a database where logical slots exist for that database on the standby.<br />
<br />
This is enough to allow external tools to roll their own "failover slots" by syncing slot state from primary to standby(s), though it's a bit delicate to do so correctly. It's necessary to write a C extension that creates replication slots using the low level slot management APIs, since no SQL-visible functions exist to do so. An example of the use of those APIs can be found in the original failover slots patch in src/test/modules.<br />
<br />
A few related patches are also relevant to failover and logical slots:<br />
<br />
* [https://www.postgresql.org/message-id/5c26ff40-8452-fb13-1bea-56a0338a809a@2ndquadrant.com Logical decoding fast-forward and slot advance (Petr Jelinek)] - adds <code>pg_replication_slot_advance()</code> SQL function.<br />
<br />
= Discussion of original failover slots proposal =<br />
<br />
The following is older content preserved to aid in understanding the context of the topic.<br />
<br />
== Rationale ==<br />
<br />
We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client. Without this you can't reliably stream data from a HA Pg install into a message queue / ETL system / whatever using logical decoding. You can't have physical HA for your nodes in a logical replication deployment, multi-master mesh cluster, etc. You currently have to create a new slot after failover but you have no way to get from your current before-failover state to a state consistent with the replay position of the standby at the moment of slot creation.<br />
<br />
This patch does not make all slots into failover slots. Because failover slots must write WAL on create/advance/drop they cannot be used on a standby in recovery. That means normal non-failover slots remain useful - you can use a physical slot to get WAL from a standby, and it should be possible to add support for logical decoding from standbys too.<br />
<br />
== Limitations ==<br />
<br />
Failover slots can only be created, replayed from and dropped on a master (original or promoted standby).<br />
<br />
Later work could allow all slots to become failover slots by introducing a slot logging mechanism that functions "beside" WAL, allowing standbys to have their own failover slots and replay changes on those slots to their own cascading standbys. This can be done without changing the UI or breaking clients that work with the first (9.6) version since they won't care that the transport between nodes has changed from WAL to ... something else. To be handwaved up later, and probably only capable of working using slots and streaming replication not archive recovery.<br />
<br />
We need failover slots now, to make logical decoding more useful in production environments. Supporting them on cascading standbys is a "nice to have later".<br />
<br />
== Patch notes ==<br />
<br />
Additional explanation to accompany the patch submission.<br />
<br />
=== Timeline following for logical decoding ===<br />
<br />
This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.<br />
<br />
I originally intended to add timeline following support directly to the xlogreader but that wouldn't work well with redo, is more intrusive, and would cause problems with frontend code like pg_xlogdump and pg_rewind that use the xlogreader code. So I implemented it as a nearly-standalone helper that logical decoding callbacks can invoke to ask the xlogreader to track timeline state for them. It's in xlogutils.c not xlogreader.c because it can't be compiled for frontend code. (A later patch could make timeline.c compile for FRONTEND, move the helper to xlogreader.c and add support for timeline following in pg_xlogdump fairly easily... if anyone cares. I don't see the point.)<br />
<br />
This patch does NOT use the restart_tli member of ReplicationSlotPersistentData. That member is unused in the current 94/95/96 code as well. I think it should be removed. I didn't use it because it's no help; it's still necessary to read the timeline history files to determine the validity boundaries of a historical timeline, and you might as well just look up the timeline of a given point in history there too. Additionally, trying to use the slot's restart_tli doesn't make sense alongside the existing use of restart_lsn to lazily update the slot's restart position in response to feedback. Timeline following must be eagerly done during replay. In the case of a peek the timeline switch won't be persistent either. Trying to use restart_tli to track which timeline to read new records from is wrong and useless. There's no point having it at all since we can determine the restart timeline from the restart LSN using the timeline history.<br />
<br />
It should be possible to use the timeline following support in patch 1 alone by creating an extension that does a ReplicationSlotCreate on a standby and manually mirrors a slot on the master using a side-channel (application etc) to co-ordinate.<br />
<br />
BTW, I found the timeline logic in Pg complex enough that I intend to follow up with a README to describe it. I had to map it all out - and the different ways different parts of Pg handle timeline following and WAL reading - before I could make sense of it.<br />
<br />
=== Failover slots ===<br />
<br />
Adds failover slots, slots whose state updates are recorded in WAL and replayed on replica servers.<br />
<br />
The point "drop non-failover slots in archive recovery startup" needs more explanation. In 9.4 and 9.5 we just omit pg_replslot from pg_basebackup copies. So they're effectively dropped on backup, only for the backup copy. For other backup methods like rsync, disk snapshots, etc, it'll tend to get included and the server will read and retain the slots when started up from such a backup. This means that on restore we might or might not keep slots based on the method used. If slots were retained they'll holdthe minimum LSN down and prevent WAL retention on the replica since the replica makes its own independent decisions about WAL retention. Additionally logical slots will become increasingly *wrong*: their catalog xmin won't advance, but vacuum on the master server the standby is replaying from will remove dead rows from the catalogs based on the up-to-date catalog xmin of the live, advancing copy of the slot on the master. Nothing updates the copy of the slot from the basebackup.<br />
<br />
For failover slots I changed pg_basebackup to copy pg_replslot since I have to copy failover slot state. An alternative would be to make sure the backup checkpoint includes every slot whether or not it's dirty, but copying pg_replslot seemed to make more sense and is more consistent with normal recovery startup. This made the existing problem with stale slots more visible though.<br />
<br />
To solve that, as part of this patch the server now drops non-failover slots during archive recovery startup - which includes both true archive based restore/PITR and streaming replication.<br />
<br />
<br />
As for the backup label change: I don't feel too strongly about this. It'll allow backup tools to make sure to copy and archive enough WAL to keep failover slots functional. It's harmless - all the common tools fish things out of backup labels with regexps and won't care about a new line. It can *not* be achieved by changing START WAL LOCATION instead since that's what the server uses as the start of redo when you begin archive recovery so it must be a checkpoint's REDO pointer. On the other hand pg_basebackup gets the information it needs about minimum WAL from the return value to the BASE_BACKUP command, which I changed to the min() of the WAL needed for redo or failover slots, so it doesn't actually need the info in the label to do its job right. In the great majority of cases if you restore from a backup you'll do archive replay anyway so any failover slots would rapidly advance past needing the extra WAL that got retained in the backup anyway. I'd rather solve this 100% though, not ask users to hope they don't hit this window. Especially since slots might be used for bursty or delayed replication that widens the window.<br />
<br />
Notice that slot creation doesn't get replayed to a replica until the slot is made persistent. That way ephemeral slots from failed decoding setups don't have any effect on the replica. I'd like extra attention on this part though, as I'm concerned there might be a risk that as a result the replica could remove WAL still needed by the slot by the time it's made persistent and replicated. If you promote during this period your slot on the replica would be broken. I haven't been able to reproduce it and I'm not convinced it can happen but could use reviewer input there. The reason I don't just replicate ephemeral slot creation to solve this is that ephemeral slots are dropped silently during redo after crash recovery, where we can't write WAL to tell the replica it no longer needs to keep its copies. <br />
<br />
<br />
=== User interface for failover slots ===<br />
<br />
Adds the SQL function parameters, walsender keyword and view changes as well as the docs changes.<br />
<br />
Does not add a --failover option to pg_recvlogical or pg_receivexlog's --create-slot option. Think one is needed?</div>Ringerchttps://wiki.postgresql.org/index.php?title=Failover_slots&diff=35505Failover slots2020-11-10T01:25:44Z<p>Ringerc: Formatting fix</p>
<hr />
<div>= Failover slots (unsuccessful feature proposal) =<br />
<br />
Failover slots were are a proposed feature for PostgreSQL 9.6. [https://commitfest.postgresql.org/9/488/ The feature proposal has been dropped]. Failover slots were not included in PostgreSQL and are unlikely to be included in the form proposed in this patch. However, parts of the functionality underlying the feature did get committed.<br />
<br />
== Partially committed ==<br />
<br />
Some of the functionality underlying the failover slots and [https://commitfest.postgresql.org/11/788/ logical decoding on standby] patch sets did get committed, including:<br />
<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5737c12df0 catalog_xmin in hot_standby_feedback messages from physical streaming replicas (PostgreSQL 10)] [https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com CF entry]. This allows a streaming replica to forward the `catalog_xmin` of any logical slots that exist on the replica up to the primary where the `catalog_xmin` is then applied to the physical slot of the streaming replica. The primary will then guarantee the `catalog_xmin` of those logical slots on the standby even when the standby is disconnected, preventing vacuum from removing dead catalog tuples that might still be needed for logical slots on the standby. That makes it safe to use the logical slots on the standby for logical decoding after promoting the standby.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1148e22a82 timeline following for logical slots (PostgreSQL 10)] ([https://commitfest.postgresql.org/9/568/ CF entry], [https://commitfest.postgresql.org/11/779/ prior CF entry]). This patch teaches the xlogreader code how to follow timeline switches correctly when doing logical decoding, so the change stream from a logical slot does not terminate when the upstream timeline increments due to a failover/promotion event. Necessary to allow logical decoding to follow failover.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff539da316 Cleanup slots during drop database (PostgreSQL 10)]. Required to allow a standby to replay `DROP DATABASE` for a database where logical slots exist for that database on the standby.<br />
<br />
This is enough to allow external tools to roll their own "failover slots" by syncing slot state from primary to standby(s), though it's a bit delicate to do so correctly. It's necessary to write a C extension that creates replication slots using the low level slot management APIs, since no SQL-visible functions exist to do so. An example of the use of those APIs can be found in the original failover slots patch in src/test/modules.<br />
<br />
A few related patches are also relevant to failover and logical slots:<br />
<br />
* [https://www.postgresql.org/message-id/5c26ff40-8452-fb13-1bea-56a0338a809a@2ndquadrant.com Logical decoding fast-forward and slot advance (Petr Jelinek)] - adds <code>pg_replication_slot_advance()</code> SQL function.<br />
<br />
= Discussion of original failover slots proposal =<br />
<br />
The following is older content preserved to aid in understanding the context of the topic.<br />
<br />
== Rationale ==<br />
<br />
We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client. Without this you can't reliably stream data from a HA Pg install into a message queue / ETL system / whatever using logical decoding. You can't have physical HA for your nodes in a logical replication deployment, multi-master mesh cluster, etc. You currently have to create a new slot after failover but you have no way to get from your current before-failover state to a state consistent with the replay position of the standby at the moment of slot creation.<br />
<br />
This patch does not make all slots into failover slots. Because failover slots must write WAL on create/advance/drop they cannot be used on a standby in recovery. That means normal non-failover slots remain useful - you can use a physical slot to get WAL from a standby, and it should be possible to add support for logical decoding from standbys too.<br />
<br />
== Limitations ==<br />
<br />
Failover slots can only be created, replayed from and dropped on a master (original or promoted standby).<br />
<br />
Later work could allow all slots to become failover slots by introducing a slot logging mechanism that functions "beside" WAL, allowing standbys to have their own failover slots and replay changes on those slots to their own cascading standbys. This can be done without changing the UI or breaking clients that work with the first (9.6) version since they won't care that the transport between nodes has changed from WAL to ... something else. To be handwaved up later, and probably only capable of working using slots and streaming replication not archive recovery.<br />
<br />
We need failover slots now, to make logical decoding more useful in production environments. Supporting them on cascading standbys is a "nice to have later".<br />
<br />
== Patch notes ==<br />
<br />
Additional explanation to accompany the patch submission.<br />
<br />
=== Timeline following for logical decoding ===<br />
<br />
This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.<br />
<br />
I originally intended to add timeline following support directly to the xlogreader but that wouldn't work well with redo, is more intrusive, and would cause problems with frontend code like pg_xlogdump and pg_rewind that use the xlogreader code. So I implemented it as a nearly-standalone helper that logical decoding callbacks can invoke to ask the xlogreader to track timeline state for them. It's in xlogutils.c not xlogreader.c because it can't be compiled for frontend code. (A later patch could make timeline.c compile for FRONTEND, move the helper to xlogreader.c and add support for timeline following in pg_xlogdump fairly easily... if anyone cares. I don't see the point.)<br />
<br />
This patch does NOT use the restart_tli member of ReplicationSlotPersistentData. That member is unused in the current 94/95/96 code as well. I think it should be removed. I didn't use it because it's no help; it's still necessary to read the timeline history files to determine the validity boundaries of a historical timeline, and you might as well just look up the timeline of a given point in history there too. Additionally, trying to use the slot's restart_tli doesn't make sense alongside the existing use of restart_lsn to lazily update the slot's restart position in response to feedback. Timeline following must be eagerly done during replay. In the case of a peek the timeline switch won't be persistent either. Trying to use restart_tli to track which timeline to read new records from is wrong and useless. There's no point having it at all since we can determine the restart timeline from the restart LSN using the timeline history.<br />
<br />
It should be possible to use the timeline following support in patch 1 alone by creating an extension that does a ReplicationSlotCreate on a standby and manually mirrors a slot on the master using a side-channel (application etc) to co-ordinate.<br />
<br />
BTW, I found the timeline logic in Pg complex enough that I intend to follow up with a README to describe it. I had to map it all out - and the different ways different parts of Pg handle timeline following and WAL reading - before I could make sense of it.<br />
<br />
=== Failover slots ===<br />
<br />
Adds failover slots, slots whose state updates are recorded in WAL and replayed on replica servers.<br />
<br />
The point "drop non-failover slots in archive recovery startup" needs more explanation. In 9.4 and 9.5 we just omit pg_replslot from pg_basebackup copies. So they're effectively dropped on backup, only for the backup copy. For other backup methods like rsync, disk snapshots, etc, it'll tend to get included and the server will read and retain the slots when started up from such a backup. This means that on restore we might or might not keep slots based on the method used. If slots were retained they'll holdthe minimum LSN down and prevent WAL retention on the replica since the replica makes its own independent decisions about WAL retention. Additionally logical slots will become increasingly *wrong*: their catalog xmin won't advance, but vacuum on the master server the standby is replaying from will remove dead rows from the catalogs based on the up-to-date catalog xmin of the live, advancing copy of the slot on the master. Nothing updates the copy of the slot from the basebackup.<br />
<br />
For failover slots I changed pg_basebackup to copy pg_replslot since I have to copy failover slot state. An alternative would be to make sure the backup checkpoint includes every slot whether or not it's dirty, but copying pg_replslot seemed to make more sense and is more consistent with normal recovery startup. This made the existing problem with stale slots more visible though.<br />
<br />
To solve that, as part of this patch the server now drops non-failover slots during archive recovery startup - which includes both true archive based restore/PITR and streaming replication.<br />
<br />
<br />
As for the backup label change: I don't feel too strongly about this. It'll allow backup tools to make sure to copy and archive enough WAL to keep failover slots functional. It's harmless - all the common tools fish things out of backup labels with regexps and won't care about a new line. It can *not* be achieved by changing START WAL LOCATION instead since that's what the server uses as the start of redo when you begin archive recovery so it must be a checkpoint's REDO pointer. On the other hand pg_basebackup gets the information it needs about minimum WAL from the return value to the BASE_BACKUP command, which I changed to the min() of the WAL needed for redo or failover slots, so it doesn't actually need the info in the label to do its job right. In the great majority of cases if you restore from a backup you'll do archive replay anyway so any failover slots would rapidly advance past needing the extra WAL that got retained in the backup anyway. I'd rather solve this 100% though, not ask users to hope they don't hit this window. Especially since slots might be used for bursty or delayed replication that widens the window.<br />
<br />
Notice that slot creation doesn't get replayed to a replica until the slot is made persistent. That way ephemeral slots from failed decoding setups don't have any effect on the replica. I'd like extra attention on this part though, as I'm concerned there might be a risk that as a result the replica could remove WAL still needed by the slot by the time it's made persistent and replicated. If you promote during this period your slot on the replica would be broken. I haven't been able to reproduce it and I'm not convinced it can happen but could use reviewer input there. The reason I don't just replicate ephemeral slot creation to solve this is that ephemeral slots are dropped silently during redo after crash recovery, where we can't write WAL to tell the replica it no longer needs to keep its copies. <br />
<br />
<br />
=== User interface for failover slots ===<br />
<br />
Adds the SQL function parameters, walsender keyword and view changes as well as the docs changes.<br />
<br />
Does not add a --failover option to pg_recvlogical or pg_receivexlog's --create-slot option. Think one is needed?</div>Ringerchttps://wiki.postgresql.org/index.php?title=Failover_slots&diff=35504Failover slots2020-11-10T01:24:41Z<p>Ringerc: </p>
<hr />
<div>= Failover slots (unsuccessful feature proposal) =<br />
<br />
Failover slots were are a proposed feature for PostgreSQL 9.6. [https://commitfest.postgresql.org/9/488/ The feature proposal has been dropped]. Failover slots were not included in PostgreSQL and are unlikely to be included in the form proposed in this patch. However, parts of the functionality underlying the feature did get committed.<br />
<br />
== Partially committed ==<br />
<br />
Some of the functionality underlying the failover slots and [https://commitfest.postgresql.org/11/788/ logical decoding on standby] patch sets did get committed, including:<br />
<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5737c12df0 catalog_xmin in hot_standby_feedback messages from physical streaming replicas (PostgreSQL 10)] [https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com CF entry]. This allows a streaming replica to forward the `catalog_xmin` of any logical slots that exist on the replica up to the primary where the `catalog_xmin` is then applied to the physical slot of the streaming replica. The primary will then guarantee the `catalog_xmin` of those logical slots on the standby even when the standby is disconnected, preventing vacuum from removing dead catalog tuples that might still be needed for logical slots on the standby. That makes it safe to use the logical slots on the standby for logical decoding after promoting the standby.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1148e22a82 timeline following for logical slots (PostgreSQL 10)] ([https://commitfest.postgresql.org/9/568/ CF entry], [https://commitfest.postgresql.org/11/779/ prior CF entry]). This patch teaches the xlogreader code how to follow timeline switches correctly when doing logical decoding, so the change stream from a logical slot does not terminate when the upstream timeline increments due to a failover/promotion event. Necessary to allow logical decoding to follow failover.<br />
* [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff539da316 Cleanup slots during drop database (PostgreSQL 10)]. Required to allow a standby to replay `DROP DATABASE` for a database where logical slots exist for that database on the standby.<br />
<br />
This is enough to allow external tools to roll their own "failover slots" by syncing slot state from primary to standby(s), though it's a bit delicate to do so correctly. It's necessary to write a C extension that creates replication slots using the low level slot management APIs, since no SQL-visible functions exist to do so. An example of the use of those APIs can be found in the original failover slots patch in src/test/modules.<br />
<br />
A few related patches are also relevant to failover and logical slots:<br />
<br />
* [https://www.postgresql.org/message-id/5c26ff40-8452-fb13-1bea-56a0338a809a@2ndquadrant.com Logical decoding fast-forward and slot advance (Petr Jelinek)] - adds `pg_replication_slot_advance()` SQL function.<br />
<br />
= Discussion of original failover slots proposal =<br />
<br />
The following is older content preserved to aid in understanding the context of the topic.<br />
<br />
== Rationale ==<br />
<br />
We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client. Without this you can't reliably stream data from a HA Pg install into a message queue / ETL system / whatever using logical decoding. You can't have physical HA for your nodes in a logical replication deployment, multi-master mesh cluster, etc. You currently have to create a new slot after failover but you have no way to get from your current before-failover state to a state consistent with the replay position of the standby at the moment of slot creation.<br />
<br />
This patch does not make all slots into failover slots. Because failover slots must write WAL on create/advance/drop they cannot be used on a standby in recovery. That means normal non-failover slots remain useful - you can use a physical slot to get WAL from a standby, and it should be possible to add support for logical decoding from standbys too.<br />
<br />
== Limitations ==<br />
<br />
Failover slots can only be created, replayed from and dropped on a master (original or promoted standby).<br />
<br />
Later work could allow all slots to become failover slots by introducing a slot logging mechanism that functions "beside" WAL, allowing standbys to have their own failover slots and replay changes on those slots to their own cascading standbys. This can be done without changing the UI or breaking clients that work with the first (9.6) version since they won't care that the transport between nodes has changed from WAL to ... something else. To be handwaved up later, and probably only capable of working using slots and streaming replication not archive recovery.<br />
<br />
We need failover slots now, to make logical decoding more useful in production environments. Supporting them on cascading standbys is a "nice to have later".<br />
<br />
== Patch notes ==<br />
<br />
Additional explanation to accompany the patch submission.<br />
<br />
=== Timeline following for logical decoding ===<br />
<br />
This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.<br />
<br />
I originally intended to add timeline following support directly to the xlogreader but that wouldn't work well with redo, is more intrusive, and would cause problems with frontend code like pg_xlogdump and pg_rewind that use the xlogreader code. So I implemented it as a nearly-standalone helper that logical decoding callbacks can invoke to ask the xlogreader to track timeline state for them. It's in xlogutils.c not xlogreader.c because it can't be compiled for frontend code. (A later patch could make timeline.c compile for FRONTEND, move the helper to xlogreader.c and add support for timeline following in pg_xlogdump fairly easily... if anyone cares. I don't see the point.)<br />
<br />
This patch does NOT use the restart_tli member of ReplicationSlotPersistentData. That member is unused in the current 94/95/96 code as well. I think it should be removed. I didn't use it because it's no help; it's still necessary to read the timeline history files to determine the validity boundaries of a historical timeline, and you might as well just look up the timeline of a given point in history there too. Additionally, trying to use the slot's restart_tli doesn't make sense alongside the existing use of restart_lsn to lazily update the slot's restart position in response to feedback. Timeline following must be eagerly done during replay. In the case of a peek the timeline switch won't be persistent either. Trying to use restart_tli to track which timeline to read new records from is wrong and useless. There's no point having it at all since we can determine the restart timeline from the restart LSN using the timeline history.<br />
<br />
It should be possible to use the timeline following support in patch 1 alone by creating an extension that does a ReplicationSlotCreate on a standby and manually mirrors a slot on the master using a side-channel (application etc) to co-ordinate.<br />
<br />
BTW, I found the timeline logic in Pg complex enough that I intend to follow up with a README to describe it. I had to map it all out - and the different ways different parts of Pg handle timeline following and WAL reading - before I could make sense of it.<br />
<br />
=== Failover slots ===<br />
<br />
Adds failover slots, slots whose state updates are recorded in WAL and replayed on replica servers.<br />
<br />
The point "drop non-failover slots in archive recovery startup" needs more explanation. In 9.4 and 9.5 we just omit pg_replslot from pg_basebackup copies. So they're effectively dropped on backup, only for the backup copy. For other backup methods like rsync, disk snapshots, etc, it'll tend to get included and the server will read and retain the slots when started up from such a backup. This means that on restore we might or might not keep slots based on the method used. If slots were retained they'll holdthe minimum LSN down and prevent WAL retention on the replica since the replica makes its own independent decisions about WAL retention. Additionally logical slots will become increasingly *wrong*: their catalog xmin won't advance, but vacuum on the master server the standby is replaying from will remove dead rows from the catalogs based on the up-to-date catalog xmin of the live, advancing copy of the slot on the master. Nothing updates the copy of the slot from the basebackup.<br />
<br />
For failover slots I changed pg_basebackup to copy pg_replslot since I have to copy failover slot state. An alternative would be to make sure the backup checkpoint includes every slot whether or not it's dirty, but copying pg_replslot seemed to make more sense and is more consistent with normal recovery startup. This made the existing problem with stale slots more visible though.<br />
<br />
To solve that, as part of this patch the server now drops non-failover slots during archive recovery startup - which includes both true archive based restore/PITR and streaming replication.<br />
<br />
<br />
As for the backup label change: I don't feel too strongly about this. It'll allow backup tools to make sure to copy and archive enough WAL to keep failover slots functional. It's harmless - all the common tools fish things out of backup labels with regexps and won't care about a new line. It can *not* be achieved by changing START WAL LOCATION instead since that's what the server uses as the start of redo when you begin archive recovery so it must be a checkpoint's REDO pointer. On the other hand pg_basebackup gets the information it needs about minimum WAL from the return value to the BASE_BACKUP command, which I changed to the min() of the WAL needed for redo or failover slots, so it doesn't actually need the info in the label to do its job right. In the great majority of cases if you restore from a backup you'll do archive replay anyway so any failover slots would rapidly advance past needing the extra WAL that got retained in the backup anyway. I'd rather solve this 100% though, not ask users to hope they don't hit this window. Especially since slots might be used for bursty or delayed replication that widens the window.<br />
<br />
Notice that slot creation doesn't get replayed to a replica until the slot is made persistent. That way ephemeral slots from failed decoding setups don't have any effect on the replica. I'd like extra attention on this part though, as I'm concerned there might be a risk that as a result the replica could remove WAL still needed by the slot by the time it's made persistent and replicated. If you promote during this period your slot on the replica would be broken. I haven't been able to reproduce it and I'm not convinced it can happen but could use reviewer input there. The reason I don't just replicate ephemeral slot creation to solve this is that ephemeral slots are dropped silently during redo after crash recovery, where we can't write WAL to tell the replica it no longer needs to keep its copies. <br />
<br />
<br />
=== User interface for failover slots ===<br />
<br />
Adds the SQL function parameters, walsender keyword and view changes as well as the docs changes.<br />
<br />
Does not add a --failover option to pg_recvlogical or pg_receivexlog's --create-slot option. Think one is needed?</div>Ringerchttps://wiki.postgresql.org/index.php?title=Failover_slots&diff=35503Failover slots2020-11-10T01:11:33Z<p>Ringerc: Update current status of failover slots patch</p>
<hr />
<div>= Failover slots (unsuccessful feature proposal) =<br />
<br />
Failover slots were are a proposed feature for PostgreSQL 9.6. [https://commitfest.postgresql.org/9/488/ The feature proposal has been dropped]. Failover slots were not included in PostgreSQL and are unlikely to be included in the form proposed in this patch. However, parts of the functionality underlying the feature did get committed.<br />
<br />
== Partially committed<br />
<br />
Some of the functionality underlying the failover slots and [https://commitfest.postgresql.org/11/788/ logical decoding on standby] patch sets did get committed, including:<br />
<br />
* [https://www.postgresql.org/message-id/CAMsr%2BYFi-LV7S8ehnwUiZnb%3D1h_14PwQ25d-vyUNq-f5S5r%3DZg%40mail.gmail.com catalog_xmin in hot_standby_feedback messages from physical streaming replicas (PostgreSQL 10)]. This allows a streaming replica to forward the `catalog_xmin` of any logical slots that exist on the replica up to the primary where the `catalog_xmin` is then applied to the physical slot of the streaming replica. The primary will then guarantee the `catalog_xmin` of those logical slots on the standby even when the standby is disconnected, preventing vacuum from removing dead catalog tuples that might still be needed for logical slots on the standby. That makes it safe to use the logical slots on the standby for logical decoding after promoting the standby.<br />
* [https://commitfest.postgresql.org/9/568/ timeline following for logical slots (PostgreSQL 10)] (see also [https://commitfest.postgresql.org/11/779/ prior CF entry]). This patch teaches the xlogreader code how to follow timeline switches correctly when doing logical decoding, so the change stream from a logical slot does not terminate when the upstream timeline increments due to a failover/promotion event. Necessary to allow logical decoding to follow failover.<br />
<br />
== Rationale<br />
<br />
We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client. Without this you can't reliably stream data from a HA Pg install into a message queue / ETL system / whatever using logical decoding. You can't have physical HA for your nodes in a logical replication deployment, multi-master mesh cluster, etc. You currently have to create a new slot after failover but you have no way to get from your current before-failover state to a state consistent with the replay position of the standby at the moment of slot creation.<br />
<br />
This patch does not make all slots into failover slots. Because failover slots must write WAL on create/advance/drop they cannot be used on a standby in recovery. That means normal non-failover slots remain useful - you can use a physical slot to get WAL from a standby, and it should be possible to add support for logical decoding from standbys too.<br />
<br />
== Limitations ==<br />
<br />
Failover slots can only be created, replayed from and dropped on a master (original or promoted standby).<br />
<br />
Later work could allow all slots to become failover slots by introducing a slot logging mechanism that functions "beside" WAL, allowing standbys to have their own failover slots and replay changes on those slots to their own cascading standbys. This can be done without changing the UI or breaking clients that work with the first (9.6) version since they won't care that the transport between nodes has changed from WAL to ... something else. To be handwaved up later, and probably only capable of working using slots and streaming replication not archive recovery.<br />
<br />
We need failover slots now, to make logical decoding more useful in production environments. Supporting them on cascading standbys is a "nice to have later".<br />
<br />
== Patch notes ==<br />
<br />
Additional explanation to accompany the patch submission.<br />
<br />
=== Timeline following for logical decoding ===<br />
<br />
This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.<br />
<br />
I originally intended to add timeline following support directly to the xlogreader but that wouldn't work well with redo, is more intrusive, and would cause problems with frontend code like pg_xlogdump and pg_rewind that use the xlogreader code. So I implemented it as a nearly-standalone helper that logical decoding callbacks can invoke to ask the xlogreader to track timeline state for them. It's in xlogutils.c not xlogreader.c because it can't be compiled for frontend code. (A later patch could make timeline.c compile for FRONTEND, move the helper to xlogreader.c and add support for timeline following in pg_xlogdump fairly easily... if anyone cares. I don't see the point.)<br />
<br />
This patch does NOT use the restart_tli member of ReplicationSlotPersistentData. That member is unused in the current 94/95/96 code as well. I think it should be removed. I didn't use it because it's no help; it's still necessary to read the timeline history files to determine the validity boundaries of a historical timeline, and you might as well just look up the timeline of a given point in history there too. Additionally, trying to use the slot's restart_tli doesn't make sense alongside the existing use of restart_lsn to lazily update the slot's restart position in response to feedback. Timeline following must be eagerly done during replay. In the case of a peek the timeline switch won't be persistent either. Trying to use restart_tli to track which timeline to read new records from is wrong and useless. There's no point having it at all since we can determine the restart timeline from the restart LSN using the timeline history.<br />
<br />
It should be possible to use the timeline following support in patch 1 alone by creating an extension that does a ReplicationSlotCreate on a standby and manually mirrors a slot on the master using a side-channel (application etc) to co-ordinate.<br />
<br />
BTW, I found the timeline logic in Pg complex enough that I intend to follow up with a README to describe it. I had to map it all out - and the different ways different parts of Pg handle timeline following and WAL reading - before I could make sense of it.<br />
<br />
=== Failover slots ===<br />
<br />
Adds failover slots, slots whose state updates are recorded in WAL and replayed on replica servers.<br />
<br />
The point "drop non-failover slots in archive recovery startup" needs more explanation. In 9.4 and 9.5 we just omit pg_replslot from pg_basebackup copies. So they're effectively dropped on backup, only for the backup copy. For other backup methods like rsync, disk snapshots, etc, it'll tend to get included and the server will read and retain the slots when started up from such a backup. This means that on restore we might or might not keep slots based on the method used. If slots were retained they'll holdthe minimum LSN down and prevent WAL retention on the replica since the replica makes its own independent decisions about WAL retention. Additionally logical slots will become increasingly *wrong*: their catalog xmin won't advance, but vacuum on the master server the standby is replaying from will remove dead rows from the catalogs based on the up-to-date catalog xmin of the live, advancing copy of the slot on the master. Nothing updates the copy of the slot from the basebackup.<br />
<br />
For failover slots I changed pg_basebackup to copy pg_replslot since I have to copy failover slot state. An alternative would be to make sure the backup checkpoint includes every slot whether or not it's dirty, but copying pg_replslot seemed to make more sense and is more consistent with normal recovery startup. This made the existing problem with stale slots more visible though.<br />
<br />
To solve that, as part of this patch the server now drops non-failover slots during archive recovery startup - which includes both true archive based restore/PITR and streaming replication.<br />
<br />
<br />
As for the backup label change: I don't feel too strongly about this. It'll allow backup tools to make sure to copy and archive enough WAL to keep failover slots functional. It's harmless - all the common tools fish things out of backup labels with regexps and won't care about a new line. It can *not* be achieved by changing START WAL LOCATION instead since that's what the server uses as the start of redo when you begin archive recovery so it must be a checkpoint's REDO pointer. On the other hand pg_basebackup gets the information it needs about minimum WAL from the return value to the BASE_BACKUP command, which I changed to the min() of the WAL needed for redo or failover slots, so it doesn't actually need the info in the label to do its job right. In the great majority of cases if you restore from a backup you'll do archive replay anyway so any failover slots would rapidly advance past needing the extra WAL that got retained in the backup anyway. I'd rather solve this 100% though, not ask users to hope they don't hit this window. Especially since slots might be used for bursty or delayed replication that widens the window.<br />
<br />
Notice that slot creation doesn't get replayed to a replica until the slot is made persistent. That way ephemeral slots from failed decoding setups don't have any effect on the replica. I'd like extra attention on this part though, as I'm concerned there might be a risk that as a result the replica could remove WAL still needed by the slot by the time it's made persistent and replicated. If you promote during this period your slot on the replica would be broken. I haven't been able to reproduce it and I'm not convinced it can happen but could use reviewer input there. The reason I don't just replicate ephemeral slot creation to solve this is that ephemeral slots are dropped silently during redo after crash recovery, where we can't write WAL to tell the replica it no longer needs to keep its copies. <br />
<br />
<br />
=== User interface for failover slots ===<br />
<br />
Adds the SQL function parameters, walsender keyword and view changes as well as the docs changes.<br />
<br />
Does not add a --failover option to pg_recvlogical or pg_receivexlog's --create-slot option. Think one is needed?</div>Ringerchttps://wiki.postgresql.org/index.php?title=Fsync_Errors&diff=35372Fsync Errors2020-09-25T06:40:13Z<p>Ringerc: Again, forgot that this wasn't markdown and forgot to preview.</p>
<hr />
<div>This article covers the current status, history, and OS and OS version differences relating to the circa 2018 fsync() reliability issues discussed on the PostgreSQL mailing list and elsewhere. It has sometimes been referred to as "fsyncgate 2018".<br />
<br />
== Current status ==<br />
<br />
As of [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 this PostgreSQL 12 commit], PostgreSQL will now PANIC on fsync() failure. It was backpatched to PostgreSQL 11, 10, 9.6, 9.5 and 9.4. Thanks to Thomas Munro, Andres Freund, Robert Haas, and Craig Ringer.<br />
<br />
Linux kernel 4.13 improved <code>fsync()</code> error handling and the [https://linux.die.net/man/2/fsync man page for <code>fsync()</code> is somewhat improved] as well. See:<br />
<br />
* [https://kernelnewbies.org/Linux_4.13#Improved_block_layer_and_background_writes_error_handling Kernelnewbies for 4.13]<br />
* Particularly significant 4.13 commits include:<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5660e13d2fd6af1903d4b0b98020af95ca2d638a "fs: new infrastructure for writeback error handling and reporting"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6acec592c6bc9a4c3136e46430e14767b07f9f1a "ext4: use errseq_t based error handling for reporting data writeback errors"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=acbf3c3452c3729829fdb0e5a52fed3cce556eb2 "Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed1e46aaf1bec6a12f4c89637f2c3ef4c70f18e "mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error"]<br />
<br />
Many thanks to Jeff Layton for work done in this area.<br />
<br />
Similar changes were made in [https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a#diff-28ed40e7ccec6683ebd46da2ca82d01c InnoDB/MySQL], [https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96 WiredTiger/MongoDB] and no doubt other software as a result of the PR around this.<br />
<br />
A proposed follow-up change to PostgreSQL was discussed in the thread [https://www.postgresql.org/message-id/flat/CAEepm%3D2gTANm%3De3ARnJT%3Dn0h8hf88wqmaZxk0JYkxw%2Bb21fNrw%40mail.gmail.com Refactoring the checkpointer's fsync request queue]]. The [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3eb77eba5a51780d5cf52cd66a9844cd4d26feb0 patch that was committed] did not incorporate the file-descriptor passing changes proposed. There is still discussion open on some additional safeguards that may use file system error counters and/or filesystem-wide flushing.<br />
<br />
== Articles and news ==<br />
<br />
* [https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com The "fsyncgate 2018" mailing list thread] <br />
* [https://lwn.net/Articles/752063/ LWN.net article "PostgreSQL's fsync() surprise"]<br />
* [https://lwn.net/Articles/724307/ LWN.net article "Improved block-layer error handling"]<br />
* [https://www.usenix.org/system/files/atc20-rebello.pdf Can Applications Recover from fsync Failures?] - a USENIX 2020 paper discussing some of these topics<br />
<br />
== Research notes and OS differences ==<br />
<br />
Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).<br />
<br />
What we want to know is: when can write-back errors be forgotten and go unreported to userspace? Arbitrarily, if errors are detected during asynchronous write-back? What about errors that occurred before you opened the file and got a new file descriptor and called fsync()? If fsync() reports failure and then you call fsync() again, can it falsely report success? PostgreSQL believes that a successful call to fsync() means that *all* data for a file is on disk, as part of its checkpointing protocol. Apparently that is not the case on some operating systems, leading to the potential for unreported data loss. .<br />
<br />
If you see a mistake or know something I don't, please update this document with supporting references, or ping thomas.munro@gmail.com!<br />
<br />
=== Open source kernels ===<br />
<br />
* Darwin/macOS: [https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695 buffers are invalidated], code similar to NetBSD<br />
* DragonflyBSD: not analysed -- the source of [https://github.com/DragonFlyBSD/DragonFlyBSD/blob/00639b5df9c853cf6136257b6c6db6739c3ba189/sys/kern/vfs_bio.c#L1278 brelse] might tell us<br />
* FreeBSD: [https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2639 buffers remain dirty] (and from version 11.1 on, [https://reviews.freebsd.org/rS316941 they are dropped on failure after the device goes away]) so future fsync() calls will try again and presumably fail; [https://www.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk recent testing report], [https://lists.freebsd.org/pipermail/freebsd-current/2007-August/076578.html 10 year old testing report] [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 commit from over 20 years ago fixing the issue]<br />
* Illumos: [https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441 writes are retried], at least in the case of asynchronous write-back. Not yet clear to me whether failure provoked by a synchronous fsync() call leaves buffers valid and dirty.<br />
* Linux < 4.13: [https://lwn.net/Articles/724307/ fsync() errors can be lost in various ways]; also buffers are marked clean after errors, so retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure<br />
* Linux 4.13 and 4.15: [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750 fsync() only reports writeback errors that occurred after you called open()] so our schemes for closing and opening files LRU-style and handing fsync() work off to the checkpointer process can hide write-back errors; also buffers are marked clean after errors so even if you opened the file before the failure, retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure.<br />
* Linux 4.14 and Linux >= 4.16 [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3 write-back error counter is initialised differently] so that somebody gets the inode's first error even if the file was closed and opened in between, but you still only get the error once (so retrying fsync() is not OK) and the error can be forgotten if the inode falls out of the inode cache (unlikely since all file descriptors referencing the inode must be closed first, and close calls fsync); buffers are still thrown away (either immediately or on memory pressure, depending on choice of fs) so you might read back an older version of the page than you most recently wrote<br />
* NetBSD: [http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html buffers are invalidated] [https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059 here] so future fsync() calls may return success despite data loss; there may also be other problems according to a [http://gnats.netbsd.org/53152 netbsd.org bug report] that was triggered by our discussion<br />
* OpenBSD: [https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867 buffers are invalidated], code similar to NetBSD; [https://marc.info/?l=openbsd-tech&m=152357572903197&w=2 OpenBSD hackers pinged for comment] [https://marc.info/?l=openbsd-tech&m=154850737529764&w=2 new OpenBSD hackers thread]; UPDATE: [https://marc.info/?l=openbsd-cvs&m=155044187426287&w=2 a recent commit changed] the behaviour, analysis needed; [https://man.openbsd.org/fsync.2 man page updated] to say "To guard against potential inconsistency, future calls will continue failing until all references to the file are closed.", which is good as long as someone holds the file open, but that isn't guaranteed in PostgreSQL (it probably should be)<br />
<br />
=== Closed source kernels ===<br />
<br />
* AIX: unknown<br />
* HPUX: unknown<br />
* Solaris: maybe the same as Illumos, but there was apparently a [https://lists.freebsd.org/pipermail/freebsd-hackers/2016-July/049665.html great VM allocator rewrite] after Solaris reverted to closed source <br />
* Windows: unknown<br />
<br />
=== Special cases ===<br />
<br />
Note that ZFS is likely to be a special case even on Linux, because it doesn't use the regular page cache and has special handling for failures. More information needed.<br />
<br />
There is ongoing discussion regarding flushing and error handling in the Linux kernel, such as that occurring in the fsinfo patch sets.<br />
<br />
=== History and notes ===<br />
<br />
Archeological notes: All BSD-derived systems probably inherited that brelse() logic from their [https://github.com/weiss/original-bsd/blob/master/sys/kern/vfs_bio.c#L377 common ancestor], but FreeBSD [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 changed it in 1999] and DragonflyBSD forked from FreeBSD in 2003 but apparently rewrote the bio code significantly. Darwin inherited code directly from ancient BSD via NeXT, and later took more code from FreeBSD but apparently not the behaviour discussed above. Ancient Bell UNIX was [https://github.com/dspinellis/unix-history-repo/blob/Bell-32V-Snapshot-Development/usr/src/sys/sys/bio.c#L196 conceptually had the same problem] but since it didn't have fsync(), that's somewhat moot. According to various man pages, fsync() was introduced by 4.2BSD (1983, not sure if fsync was added a bit later), developed around the same time and same place as POSTGRES (1986), and said in its man page it for making transactional facilities. Also fsync(1) appeared in FreeBSD 4.3 (2001), a command line tool that lets you sync a named file, which probably only makes sense if you have a certain model of how I/O errors and buffering work.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Fsync_Errors&diff=35371Fsync Errors2020-09-25T06:39:34Z<p>Ringerc: Mention fsync request queue, fsinfo patchsets</p>
<hr />
<div>This article covers the current status, history, and OS and OS version differences relating to the circa 2018 fsync() reliability issues discussed on the PostgreSQL mailing list and elsewhere. It has sometimes been referred to as "fsyncgate 2018".<br />
<br />
== Current status ==<br />
<br />
As of [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 this PostgreSQL 12 commit], PostgreSQL will now PANIC on fsync() failure. It was backpatched to PostgreSQL 11, 10, 9.6, 9.5 and 9.4. Thanks to Thomas Munro, Andres Freund, Robert Haas, and Craig Ringer.<br />
<br />
Linux kernel 4.13 improved <code>fsync()</code> error handling and the [https://linux.die.net/man/2/fsync man page for <code>fsync()</code> is somewhat improved] as well. See:<br />
<br />
* [https://kernelnewbies.org/Linux_4.13#Improved_block_layer_and_background_writes_error_handling Kernelnewbies for 4.13]<br />
* Particularly significant 4.13 commits include:<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5660e13d2fd6af1903d4b0b98020af95ca2d638a "fs: new infrastructure for writeback error handling and reporting"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6acec592c6bc9a4c3136e46430e14767b07f9f1a "ext4: use errseq_t based error handling for reporting data writeback errors"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=acbf3c3452c3729829fdb0e5a52fed3cce556eb2 "Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed1e46aaf1bec6a12f4c89637f2c3ef4c70f18e "mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error"]<br />
<br />
Many thanks to Jeff Layton for work done in this area.<br />
<br />
Similar changes were made in [https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a#diff-28ed40e7ccec6683ebd46da2ca82d01c InnoDB/MySQL], [https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96 WiredTiger/MongoDB] and no doubt other software as a result of the PR around this.<br />
<br />
A proposed follow-up change to PostgreSQL was discussed in the thread [Refactoring the checkpointer's fsync request queue](https://www.postgresql.org/message-id/flat/CAEepm%3D2gTANm%3De3ARnJT%3Dn0h8hf88wqmaZxk0JYkxw%2Bb21fNrw%40mail.gmail.com). The [patch that was committed](https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3eb77eba5a51780d5cf52cd66a9844cd4d26feb0) did not incorporate the file-descriptor passing changes proposed. There is still discussion open on some additional safeguards that may use file system error counters and/or filesystem-wide flushing.<br />
<br />
== Articles and news ==<br />
<br />
* [https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com The "fsyncgate 2018" mailing list thread] <br />
* [https://lwn.net/Articles/752063/ LWN.net article "PostgreSQL's fsync() surprise"]<br />
* [https://lwn.net/Articles/724307/ LWN.net article "Improved block-layer error handling"]<br />
* [https://www.usenix.org/system/files/atc20-rebello.pdf Can Applications Recover from fsync Failures?] - a USENIX 2020 paper discussing some of these topics<br />
<br />
== Research notes and OS differences ==<br />
<br />
Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).<br />
<br />
What we want to know is: when can write-back errors be forgotten and go unreported to userspace? Arbitrarily, if errors are detected during asynchronous write-back? What about errors that occurred before you opened the file and got a new file descriptor and called fsync()? If fsync() reports failure and then you call fsync() again, can it falsely report success? PostgreSQL believes that a successful call to fsync() means that *all* data for a file is on disk, as part of its checkpointing protocol. Apparently that is not the case on some operating systems, leading to the potential for unreported data loss. .<br />
<br />
If you see a mistake or know something I don't, please update this document with supporting references, or ping thomas.munro@gmail.com!<br />
<br />
=== Open source kernels ===<br />
<br />
* Darwin/macOS: [https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695 buffers are invalidated], code similar to NetBSD<br />
* DragonflyBSD: not analysed -- the source of [https://github.com/DragonFlyBSD/DragonFlyBSD/blob/00639b5df9c853cf6136257b6c6db6739c3ba189/sys/kern/vfs_bio.c#L1278 brelse] might tell us<br />
* FreeBSD: [https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2639 buffers remain dirty] (and from version 11.1 on, [https://reviews.freebsd.org/rS316941 they are dropped on failure after the device goes away]) so future fsync() calls will try again and presumably fail; [https://www.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk recent testing report], [https://lists.freebsd.org/pipermail/freebsd-current/2007-August/076578.html 10 year old testing report] [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 commit from over 20 years ago fixing the issue]<br />
* Illumos: [https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441 writes are retried], at least in the case of asynchronous write-back. Not yet clear to me whether failure provoked by a synchronous fsync() call leaves buffers valid and dirty.<br />
* Linux < 4.13: [https://lwn.net/Articles/724307/ fsync() errors can be lost in various ways]; also buffers are marked clean after errors, so retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure<br />
* Linux 4.13 and 4.15: [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750 fsync() only reports writeback errors that occurred after you called open()] so our schemes for closing and opening files LRU-style and handing fsync() work off to the checkpointer process can hide write-back errors; also buffers are marked clean after errors so even if you opened the file before the failure, retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure.<br />
* Linux 4.14 and Linux >= 4.16 [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3 write-back error counter is initialised differently] so that somebody gets the inode's first error even if the file was closed and opened in between, but you still only get the error once (so retrying fsync() is not OK) and the error can be forgotten if the inode falls out of the inode cache (unlikely since all file descriptors referencing the inode must be closed first, and close calls fsync); buffers are still thrown away (either immediately or on memory pressure, depending on choice of fs) so you might read back an older version of the page than you most recently wrote<br />
* NetBSD: [http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html buffers are invalidated] [https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059 here] so future fsync() calls may return success despite data loss; there may also be other problems according to a [http://gnats.netbsd.org/53152 netbsd.org bug report] that was triggered by our discussion<br />
* OpenBSD: [https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867 buffers are invalidated], code similar to NetBSD; [https://marc.info/?l=openbsd-tech&m=152357572903197&w=2 OpenBSD hackers pinged for comment] [https://marc.info/?l=openbsd-tech&m=154850737529764&w=2 new OpenBSD hackers thread]; UPDATE: [https://marc.info/?l=openbsd-cvs&m=155044187426287&w=2 a recent commit changed] the behaviour, analysis needed; [https://man.openbsd.org/fsync.2 man page updated] to say "To guard against potential inconsistency, future calls will continue failing until all references to the file are closed.", which is good as long as someone holds the file open, but that isn't guaranteed in PostgreSQL (it probably should be)<br />
<br />
=== Closed source kernels ===<br />
<br />
* AIX: unknown<br />
* HPUX: unknown<br />
* Solaris: maybe the same as Illumos, but there was apparently a [https://lists.freebsd.org/pipermail/freebsd-hackers/2016-July/049665.html great VM allocator rewrite] after Solaris reverted to closed source <br />
* Windows: unknown<br />
<br />
=== Special cases ===<br />
<br />
Note that ZFS is likely to be a special case even on Linux, because it doesn't use the regular page cache and has special handling for failures. More information needed.<br />
<br />
There is ongoing discussion regarding flushing and error handling in the Linux kernel, such as that occurring in the fsinfo patch sets.<br />
<br />
=== History and notes ===<br />
<br />
Archeological notes: All BSD-derived systems probably inherited that brelse() logic from their [https://github.com/weiss/original-bsd/blob/master/sys/kern/vfs_bio.c#L377 common ancestor], but FreeBSD [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 changed it in 1999] and DragonflyBSD forked from FreeBSD in 2003 but apparently rewrote the bio code significantly. Darwin inherited code directly from ancient BSD via NeXT, and later took more code from FreeBSD but apparently not the behaviour discussed above. Ancient Bell UNIX was [https://github.com/dspinellis/unix-history-repo/blob/Bell-32V-Snapshot-Development/usr/src/sys/sys/bio.c#L196 conceptually had the same problem] but since it didn't have fsync(), that's somewhat moot. According to various man pages, fsync() was introduced by 4.2BSD (1983, not sure if fsync was added a bit later), developed around the same time and same place as POSTGRES (1986), and said in its man page it for making transactional facilities. Also fsync(1) appeared in FreeBSD 4.3 (2001), a command line tool that lets you sync a named file, which probably only makes sense if you have a certain model of how I/O errors and buffering work.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Fsync_Errors&diff=35370Fsync Errors2020-09-25T05:44:29Z<p>Ringerc: Formatting changes and heading fixes</p>
<hr />
<div>This article covers the current status, history, and OS and OS version differences relating to the circa 2018 fsync() reliability issues discussed on the PostgreSQL mailing list and elsewhere. It has sometimes been referred to as "fsyncgate 2018".<br />
<br />
== Current status ==<br />
<br />
As of [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 this PostgreSQL 12 commit], PostgreSQL will now PANIC on fsync() failure. It was backpatched to PostgreSQL 11, 10, 9.6, 9.5 and 9.4. Thanks to Thomas Munro, Andres Freund, Robert Haas, and Craig Ringer.<br />
<br />
Linux kernel 4.13 improved <code>fsync()</code> error handling and the [https://linux.die.net/man/2/fsync man page for <code>fsync()</code> is somewhat improved] as well. See:<br />
<br />
* [https://kernelnewbies.org/Linux_4.13#Improved_block_layer_and_background_writes_error_handling Kernelnewbies for 4.13]<br />
* Particularly significant 4.13 commits include:<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5660e13d2fd6af1903d4b0b98020af95ca2d638a "fs: new infrastructure for writeback error handling and reporting"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6acec592c6bc9a4c3136e46430e14767b07f9f1a "ext4: use errseq_t based error handling for reporting data writeback errors"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=acbf3c3452c3729829fdb0e5a52fed3cce556eb2 "Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed1e46aaf1bec6a12f4c89637f2c3ef4c70f18e "mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error"]<br />
<br />
Many thanks to Jeff Layton for work done in this area.<br />
<br />
Similar changes were made in [https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a#diff-28ed40e7ccec6683ebd46da2ca82d01c InnoDB/MySQL], [https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96 WiredTiger/MongoDB] and no doubt other software as a result of the PR around this.<br />
<br />
== Articles and news ==<br />
<br />
* [https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com The "fsyncgate 2018" mailing list thread] <br />
* [https://lwn.net/Articles/752063/ LWN.net article "PostgreSQL's fsync() surprise"]<br />
* [https://lwn.net/Articles/724307/ LWN.net article "Improved block-layer error handling"]<br />
* [https://www.usenix.org/system/files/atc20-rebello.pdf Can Applications Recover from fsync Failures?] - a USENIX 2020 paper discussing some of these topics<br />
<br />
== Research notes and OS differences ==<br />
<br />
Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).<br />
<br />
What we want to know is: when can write-back errors be forgotten and go unreported to userspace? Arbitrarily, if errors are detected during asynchronous write-back? What about errors that occurred before you opened the file and got a new file descriptor and called fsync()? If fsync() reports failure and then you call fsync() again, can it falsely report success? PostgreSQL believes that a successful call to fsync() means that *all* data for a file is on disk, as part of its checkpointing protocol. Apparently that is not the case on some operating systems, leading to the potential for unreported data loss. .<br />
<br />
If you see a mistake or know something I don't, please update this document with supporting references, or ping thomas.munro@gmail.com!<br />
<br />
=== Open source kernels ===<br />
<br />
* Darwin/macOS: [https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695 buffers are invalidated], code similar to NetBSD<br />
* DragonflyBSD: not analysed -- the source of [https://github.com/DragonFlyBSD/DragonFlyBSD/blob/00639b5df9c853cf6136257b6c6db6739c3ba189/sys/kern/vfs_bio.c#L1278 brelse] might tell us<br />
* FreeBSD: [https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2639 buffers remain dirty] (and from version 11.1 on, [https://reviews.freebsd.org/rS316941 they are dropped on failure after the device goes away]) so future fsync() calls will try again and presumably fail; [https://www.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk recent testing report], [https://lists.freebsd.org/pipermail/freebsd-current/2007-August/076578.html 10 year old testing report] [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 commit from over 20 years ago fixing the issue]<br />
* Illumos: [https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441 writes are retried], at least in the case of asynchronous write-back. Not yet clear to me whether failure provoked by a synchronous fsync() call leaves buffers valid and dirty.<br />
* Linux < 4.13: [https://lwn.net/Articles/724307/ fsync() errors can be lost in various ways]; also buffers are marked clean after errors, so retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure<br />
* Linux 4.13 and 4.15: [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750 fsync() only reports writeback errors that occurred after you called open()] so our schemes for closing and opening files LRU-style and handing fsync() work off to the checkpointer process can hide write-back errors; also buffers are marked clean after errors so even if you opened the file before the failure, retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure.<br />
* Linux 4.14 and Linux >= 4.16 [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3 write-back error counter is initialised differently] so that somebody gets the inode's first error even if the file was closed and opened in between, but you still only get the error once (so retrying fsync() is not OK) and the error can be forgotten if the inode falls out of the inode cache (unlikely since all file descriptors referencing the inode must be closed first, and close calls fsync); buffers are still thrown away (either immediately or on memory pressure, depending on choice of fs) so you might read back an older version of the page than you most recently wrote<br />
* NetBSD: [http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html buffers are invalidated] [https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059 here] so future fsync() calls may return success despite data loss; there may also be other problems according to a [http://gnats.netbsd.org/53152 netbsd.org bug report] that was triggered by our discussion<br />
* OpenBSD: [https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867 buffers are invalidated], code similar to NetBSD; [https://marc.info/?l=openbsd-tech&m=152357572903197&w=2 OpenBSD hackers pinged for comment] [https://marc.info/?l=openbsd-tech&m=154850737529764&w=2 new OpenBSD hackers thread]; UPDATE: [https://marc.info/?l=openbsd-cvs&m=155044187426287&w=2 a recent commit changed] the behaviour, analysis needed; [https://man.openbsd.org/fsync.2 man page updated] to say "To guard against potential inconsistency, future calls will continue failing until all references to the file are closed.", which is good as long as someone holds the file open, but that isn't guaranteed in PostgreSQL (it probably should be)<br />
<br />
=== Closed source kernels ===<br />
<br />
* AIX: unknown<br />
* HPUX: unknown<br />
* Solaris: maybe the same as Illumos, but there was apparently a [https://lists.freebsd.org/pipermail/freebsd-hackers/2016-July/049665.html great VM allocator rewrite] after Solaris reverted to closed source <br />
* Windows: unknown<br />
<br />
=== Special cases ===<br />
<br />
Note that ZFS is likely to be a special case even on Linux, because it doesn't use the regular page cache and has special handling for failures. More information needed.<br />
<br />
=== History and notes ===<br />
<br />
Archeological notes: All BSD-derived systems probably inherited that brelse() logic from their [https://github.com/weiss/original-bsd/blob/master/sys/kern/vfs_bio.c#L377 common ancestor], but FreeBSD [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 changed it in 1999] and DragonflyBSD forked from FreeBSD in 2003 but apparently rewrote the bio code significantly. Darwin inherited code directly from ancient BSD via NeXT, and later took more code from FreeBSD but apparently not the behaviour discussed above. Ancient Bell UNIX was [https://github.com/dspinellis/unix-history-repo/blob/Bell-32V-Snapshot-Development/usr/src/sys/sys/bio.c#L196 conceptually had the same problem] but since it didn't have fsync(), that's somewhat moot. According to various man pages, fsync() was introduced by 4.2BSD (1983, not sure if fsync was added a bit later), developed around the same time and same place as POSTGRES (1986), and said in its man page it for making transactional facilities. Also fsync(1) appeared in FreeBSD 4.3 (2001), a command line tool that lets you sync a named file, which probably only makes sense if you have a certain model of how I/O errors and buffering work.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Fsync_Errors&diff=35355Fsync Errors2020-09-22T02:31:45Z<p>Ringerc: It's not markdown. Fix list formatting.</p>
<hr />
<div>This article covers the current status, history, and OS and OS version differences relating to the circa 2018 fsync() reliability issues discussed on the PostgreSQL mailing list and elsewhere. It has sometimes been referred to as "fsyncgate 2018".<br />
<br />
== Current status ==<br />
<br />
As of [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 this PostgreSQL 12 commit], PostgreSQL will now PANIC on fsync() failure. It was backpatched to PostgreSQL 11, 10, 9.6, 9.5 and 9.4. Thanks to Thomas Munro, Andres Freund, Robert Haas, and Craig Ringer.<br />
<br />
Linux kernel 4.13 improved <code>fsync()</code> error handling and the [https://linux.die.net/man/2/fsync man page for <code>fsync()</code> is somewhat improved] as well. See:<br />
<br />
* [https://kernelnewbies.org/Linux_4.13#Improved_block_layer_and_background_writes_error_handling Kernelnewbies for 4.13]<br />
* Particularly significant 4.13 commits include:<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5660e13d2fd6af1903d4b0b98020af95ca2d638a "fs: new infrastructure for writeback error handling and reporting"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6acec592c6bc9a4c3136e46430e14767b07f9f1a "ext4: use errseq_t based error handling for reporting data writeback errors"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=acbf3c3452c3729829fdb0e5a52fed3cce556eb2 "Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors"]<br />
** [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed1e46aaf1bec6a12f4c89637f2c3ef4c70f18e "mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error"]<br />
<br />
Many thanks to Jeff Layton for work done in this area.<br />
<br />
Similar changes were made in [https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a#diff-28ed40e7ccec6683ebd46da2ca82d01c InnoDB/MySQL], [https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96 WiredTiger/MongoDB] and no doubt other software as a result of the PR around this.<br />
<br />
== Articles and news ==<br />
<br />
* [https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com The "fsyncgate 2018" mailing list thread] <br />
* [https://lwn.net/Articles/752063/ LWN.net article "PostgreSQL's fsync() surprise"]<br />
* [https://lwn.net/Articles/724307/ LWN.net article "Improved block-layer error handling"]<br />
* [https://www.usenix.org/system/files/atc20-rebello.pdf Can Applications Recover from fsync Failures?] - a USENIX 2020 paper discussing some of these topics<br />
<br />
== Research notes and OS differences ==Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).<br />
<br />
Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).<br />
<br />
What we want to know is: when can write-back errors be forgotten and go unreported to userspace? Arbitrarily, if errors are detected during asynchronous write-back? What about errors that occurred before you opened the file and got a new file descriptor and called fsync()? If fsync() reports failure and then you call fsync() again, can it falsely report success? PostgreSQL believes that a successful call to fsync() means that *all* data for a file is on disk, as part of its checkpointing protocol. Apparently that is not the case on some operating systems, leading to the potential for unreported data loss. .<br />
<br />
If you see a mistake or know something I don't, please update this document with supporting references, or ping thomas.munro@gmail.com!<br />
<br />
Open source kernels:<br />
<br />
* Darwin/macOS: [https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695 buffers are invalidated], code similar to NetBSD<br />
* DragonflyBSD: not analysed -- the source of [https://github.com/DragonFlyBSD/DragonFlyBSD/blob/00639b5df9c853cf6136257b6c6db6739c3ba189/sys/kern/vfs_bio.c#L1278 brelse] might tell us<br />
* FreeBSD: [https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2639 buffers remain dirty] (and from version 11.1 on, [https://reviews.freebsd.org/rS316941 they are dropped on failure after the device goes away]) so future fsync() calls will try again and presumably fail; [https://www.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk recent testing report], [https://lists.freebsd.org/pipermail/freebsd-current/2007-August/076578.html 10 year old testing report] [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 commit from over 20 years ago fixing the issue]<br />
* Illumos: [https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441 writes are retried], at least in the case of asynchronous write-back. Not yet clear to me whether failure provoked by a synchronous fsync() call leaves buffers valid and dirty.<br />
* Linux < 4.13: [https://lwn.net/Articles/724307/ fsync() errors can be lost in various ways]; also buffers are marked clean after errors, so retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure<br />
* Linux 4.13 and 4.15: [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750 fsync() only reports writeback errors that occurred after you called open()] so our schemes for closing and opening files LRU-style and handing fsync() work off to the checkpointer process can hide write-back errors; also buffers are marked clean after errors so even if you opened the file before the failure, retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure.<br />
* Linux 4.14 and Linux >= 4.16 [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3 write-back error counter is initialised differently] so that somebody gets the inode's first error even if the file was closed and opened in between, but you still only get the error once (so retrying fsync() is not OK) and the error can be forgotten if the inode falls out of the inode cache (unlikely since all file descriptors referencing the inode must be closed first, and close calls fsync); buffers are still thrown away (either immediately or on memory pressure, depending on choice of fs) so you might read back an older version of the page than you most recently wrote<br />
* NetBSD: [http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html buffers are invalidated] [https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059 here] so future fsync() calls may return success despite data loss; there may also be other problems according to a [http://gnats.netbsd.org/53152 netbsd.org bug report] that was triggered by our discussion<br />
* OpenBSD: [https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867 buffers are invalidated], code similar to NetBSD; [https://marc.info/?l=openbsd-tech&m=152357572903197&w=2 OpenBSD hackers pinged for comment] [https://marc.info/?l=openbsd-tech&m=154850737529764&w=2 new OpenBSD hackers thread]; UPDATE: [https://marc.info/?l=openbsd-cvs&m=155044187426287&w=2 a recent commit changed] the behaviour, analysis needed; [https://man.openbsd.org/fsync.2 man page updated] to say "To guard against potential inconsistency, future calls will continue failing until all references to the file are closed.", which is good as long as someone holds the file open, but that isn't guaranteed in PostgreSQL (it probably should be)<br />
<br />
Closed source kernels:<br />
<br />
* AIX: unknown<br />
* HPUX: unknown<br />
* Solaris: maybe the same as Illumos, but there was apparently a [https://lists.freebsd.org/pipermail/freebsd-hackers/2016-July/049665.html great VM allocator rewrite] after Solaris reverted to closed source <br />
* Windows: unknown<br />
<br />
Note that ZFS is likely to be a special case even on Linux, because it doesn't use the regular page cache and has special handling for failures. More information needed.<br />
<br />
Archeological notes: All BSD-derived systems probably inherited that brelse() logic from their [https://github.com/weiss/original-bsd/blob/master/sys/kern/vfs_bio.c#L377 common ancestor], but FreeBSD [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 changed it in 1999] and DragonflyBSD forked from FreeBSD in 2003 but apparently rewrote the bio code significantly. Darwin inherited code directly from ancient BSD via NeXT, and later took more code from FreeBSD but apparently not the behaviour discussed above. Ancient Bell UNIX was [https://github.com/dspinellis/unix-history-repo/blob/Bell-32V-Snapshot-Development/usr/src/sys/sys/bio.c#L196 conceptually had the same problem] but since it didn't have fsync(), that's somewhat moot. According to various man pages, fsync() was introduced by 4.2BSD (1983, not sure if fsync was added a bit later), developed around the same time and same place as POSTGRES (1986), and said in its man page it for making transactional facilities. Also fsync(1) appeared in FreeBSD 4.3 (2001), a command line tool that lets you sync a named file, which probably only makes sense if you have a certain model of how I/O errors and buffering work.<br />
<br />
== Relevant PostgreSQL commits ==</div>Ringerchttps://wiki.postgresql.org/index.php?title=Fsync_Errors&diff=35354Fsync Errors2020-09-22T02:30:53Z<p>Ringerc: Link formatting fix</p>
<hr />
<div>This article covers the current status, history, and OS and OS version differences relating to the circa 2018 fsync() reliability issues discussed on the PostgreSQL mailing list and elsewhere. It has sometimes been referred to as "fsyncgate 2018".<br />
<br />
== Current status ==<br />
<br />
As of [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 this PostgreSQL 12 commit], PostgreSQL will now PANIC on fsync() failure. It was backpatched to PostgreSQL 11, 10, 9.6, 9.5 and 9.4. Thanks to Thomas Munro, Andres Freund, Robert Haas, and Craig Ringer.<br />
<br />
Linux kernel 4.13 improved <code>fsync()</code> error handling and the [https://linux.die.net/man/2/fsync man page for <code>fsync()</code> is somewhat improved] as well. See:<br />
<br />
* [https://kernelnewbies.org/Linux_4.13#Improved_block_layer_and_background_writes_error_handling Kernelnewbies for 4.13]<br />
* Particularly significant 4.13 commits include:<br />
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5660e13d2fd6af1903d4b0b98020af95ca2d638a "fs: new infrastructure for writeback error handling and reporting"]<br />
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6acec592c6bc9a4c3136e46430e14767b07f9f1a "ext4: use errseq_t based error handling for reporting data writeback errors"]<br />
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=acbf3c3452c3729829fdb0e5a52fed3cce556eb2 "Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors"]<br />
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed1e46aaf1bec6a12f4c89637f2c3ef4c70f18e "mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error"]<br />
<br />
Many thanks to Jeff Layton for work done in this area.<br />
<br />
Similar changes were made in [https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a#diff-28ed40e7ccec6683ebd46da2ca82d01c InnoDB/MySQL], [https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96 WiredTiger/MongoDB] and no doubt other software as a result of the PR around this.<br />
<br />
== Articles and news ==<br />
<br />
* [https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com The "fsyncgate 2018" mailing list thread] <br />
* [https://lwn.net/Articles/752063/ LWN.net article "PostgreSQL's fsync() surprise"]<br />
* [https://lwn.net/Articles/724307/ LWN.net article "Improved block-layer error handling"]<br />
* [https://www.usenix.org/system/files/atc20-rebello.pdf Can Applications Recover from fsync Failures?] - a USENIX 2020 paper discussing some of these topics<br />
<br />
== Research notes and OS differences ==Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).<br />
<br />
Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).<br />
<br />
What we want to know is: when can write-back errors be forgotten and go unreported to userspace? Arbitrarily, if errors are detected during asynchronous write-back? What about errors that occurred before you opened the file and got a new file descriptor and called fsync()? If fsync() reports failure and then you call fsync() again, can it falsely report success? PostgreSQL believes that a successful call to fsync() means that *all* data for a file is on disk, as part of its checkpointing protocol. Apparently that is not the case on some operating systems, leading to the potential for unreported data loss. .<br />
<br />
If you see a mistake or know something I don't, please update this document with supporting references, or ping thomas.munro@gmail.com!<br />
<br />
Open source kernels:<br />
<br />
* Darwin/macOS: [https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695 buffers are invalidated], code similar to NetBSD<br />
* DragonflyBSD: not analysed -- the source of [https://github.com/DragonFlyBSD/DragonFlyBSD/blob/00639b5df9c853cf6136257b6c6db6739c3ba189/sys/kern/vfs_bio.c#L1278 brelse] might tell us<br />
* FreeBSD: [https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2639 buffers remain dirty] (and from version 11.1 on, [https://reviews.freebsd.org/rS316941 they are dropped on failure after the device goes away]) so future fsync() calls will try again and presumably fail; [https://www.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk recent testing report], [https://lists.freebsd.org/pipermail/freebsd-current/2007-August/076578.html 10 year old testing report] [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 commit from over 20 years ago fixing the issue]<br />
* Illumos: [https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441 writes are retried], at least in the case of asynchronous write-back. Not yet clear to me whether failure provoked by a synchronous fsync() call leaves buffers valid and dirty.<br />
* Linux < 4.13: [https://lwn.net/Articles/724307/ fsync() errors can be lost in various ways]; also buffers are marked clean after errors, so retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure<br />
* Linux 4.13 and 4.15: [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750 fsync() only reports writeback errors that occurred after you called open()] so our schemes for closing and opening files LRU-style and handing fsync() work off to the checkpointer process can hide write-back errors; also buffers are marked clean after errors so even if you opened the file before the failure, retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure.<br />
* Linux 4.14 and Linux >= 4.16 [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3 write-back error counter is initialised differently] so that somebody gets the inode's first error even if the file was closed and opened in between, but you still only get the error once (so retrying fsync() is not OK) and the error can be forgotten if the inode falls out of the inode cache (unlikely since all file descriptors referencing the inode must be closed first, and close calls fsync); buffers are still thrown away (either immediately or on memory pressure, depending on choice of fs) so you might read back an older version of the page than you most recently wrote<br />
* NetBSD: [http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html buffers are invalidated] [https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059 here] so future fsync() calls may return success despite data loss; there may also be other problems according to a [http://gnats.netbsd.org/53152 netbsd.org bug report] that was triggered by our discussion<br />
* OpenBSD: [https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867 buffers are invalidated], code similar to NetBSD; [https://marc.info/?l=openbsd-tech&m=152357572903197&w=2 OpenBSD hackers pinged for comment] [https://marc.info/?l=openbsd-tech&m=154850737529764&w=2 new OpenBSD hackers thread]; UPDATE: [https://marc.info/?l=openbsd-cvs&m=155044187426287&w=2 a recent commit changed] the behaviour, analysis needed; [https://man.openbsd.org/fsync.2 man page updated] to say "To guard against potential inconsistency, future calls will continue failing until all references to the file are closed.", which is good as long as someone holds the file open, but that isn't guaranteed in PostgreSQL (it probably should be)<br />
<br />
Closed source kernels:<br />
<br />
* AIX: unknown<br />
* HPUX: unknown<br />
* Solaris: maybe the same as Illumos, but there was apparently a [https://lists.freebsd.org/pipermail/freebsd-hackers/2016-July/049665.html great VM allocator rewrite] after Solaris reverted to closed source <br />
* Windows: unknown<br />
<br />
Note that ZFS is likely to be a special case even on Linux, because it doesn't use the regular page cache and has special handling for failures. More information needed.<br />
<br />
Archeological notes: All BSD-derived systems probably inherited that brelse() logic from their [https://github.com/weiss/original-bsd/blob/master/sys/kern/vfs_bio.c#L377 common ancestor], but FreeBSD [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 changed it in 1999] and DragonflyBSD forked from FreeBSD in 2003 but apparently rewrote the bio code significantly. Darwin inherited code directly from ancient BSD via NeXT, and later took more code from FreeBSD but apparently not the behaviour discussed above. Ancient Bell UNIX was [https://github.com/dspinellis/unix-history-repo/blob/Bell-32V-Snapshot-Development/usr/src/sys/sys/bio.c#L196 conceptually had the same problem] but since it didn't have fsync(), that's somewhat moot. According to various man pages, fsync() was introduced by 4.2BSD (1983, not sure if fsync was added a bit later), developed around the same time and same place as POSTGRES (1986), and said in its man page it for making transactional facilities. Also fsync(1) appeared in FreeBSD 4.3 (2001), a command line tool that lets you sync a named file, which probably only makes sense if you have a certain model of how I/O errors and buffering work.<br />
<br />
== Relevant PostgreSQL commits ==</div>Ringerchttps://wiki.postgresql.org/index.php?title=Fsync_Errors&diff=35353Fsync Errors2020-09-22T02:30:14Z<p>Ringerc: Update 2018 fsync page to reflect current status</p>
<hr />
<div>This article covers the current status, history, and OS and OS version differences relating to the circa 2018 fsync() reliability issues discussed on the PostgreSQL mailing list and elsewhere. It has sometimes been referred to as "fsyncgate 2018".<br />
<br />
== Current status ==<br />
<br />
As of [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 this PostgreSQL 12 commit], PostgreSQL will now PANIC on fsync() failure. It was backpatched to PostgreSQL 11, 10, 9.6, 9.5 and 9.4. Thanks to Thomas Munro, Andres Freund, Robert Haas, and Craig Ringer.<br />
<br />
[Linux kernel 4.13 improved <code>fsync()</code> error handling]() and the [man page for <code>fsync()</code> is somewhat improved](https://linux.die.net/man/2/fsync) as well. See:<br />
<br />
* [https://kernelnewbies.org/Linux_4.13#Improved_block_layer_and_background_writes_error_handling Kernelnewbies for 4.13]<br />
* Particularly significant 4.13 commits include:<br />
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5660e13d2fd6af1903d4b0b98020af95ca2d638a "fs: new infrastructure for writeback error handling and reporting"]<br />
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6acec592c6bc9a4c3136e46430e14767b07f9f1a "ext4: use errseq_t based error handling for reporting data writeback errors"]<br />
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=acbf3c3452c3729829fdb0e5a52fed3cce556eb2 "Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors"]<br />
* [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed1e46aaf1bec6a12f4c89637f2c3ef4c70f18e "mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error"]<br />
<br />
Many thanks to Jeff Layton for work done in this area.<br />
<br />
Similar changes were made in [https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a#diff-28ed40e7ccec6683ebd46da2ca82d01c InnoDB/MySQL], [https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96 WiredTiger/MongoDB] and no doubt other software as a result of the PR around this.<br />
<br />
== Articles and news ==<br />
<br />
* [https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com The "fsyncgate 2018" mailing list thread] <br />
* [https://lwn.net/Articles/752063/ LWN.net article "PostgreSQL's fsync() surprise"]<br />
* [https://lwn.net/Articles/724307/ LWN.net article "Improved block-layer error handling"]<br />
* [https://www.usenix.org/system/files/atc20-rebello.pdf Can Applications Recover from fsync Failures?] - a USENIX 2020 paper discussing some of these topics<br />
<br />
== Research notes and OS differences ==Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).<br />
<br />
Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).<br />
<br />
What we want to know is: when can write-back errors be forgotten and go unreported to userspace? Arbitrarily, if errors are detected during asynchronous write-back? What about errors that occurred before you opened the file and got a new file descriptor and called fsync()? If fsync() reports failure and then you call fsync() again, can it falsely report success? PostgreSQL believes that a successful call to fsync() means that *all* data for a file is on disk, as part of its checkpointing protocol. Apparently that is not the case on some operating systems, leading to the potential for unreported data loss. .<br />
<br />
If you see a mistake or know something I don't, please update this document with supporting references, or ping thomas.munro@gmail.com!<br />
<br />
Open source kernels:<br />
<br />
* Darwin/macOS: [https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695 buffers are invalidated], code similar to NetBSD<br />
* DragonflyBSD: not analysed -- the source of [https://github.com/DragonFlyBSD/DragonFlyBSD/blob/00639b5df9c853cf6136257b6c6db6739c3ba189/sys/kern/vfs_bio.c#L1278 brelse] might tell us<br />
* FreeBSD: [https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2639 buffers remain dirty] (and from version 11.1 on, [https://reviews.freebsd.org/rS316941 they are dropped on failure after the device goes away]) so future fsync() calls will try again and presumably fail; [https://www.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk recent testing report], [https://lists.freebsd.org/pipermail/freebsd-current/2007-August/076578.html 10 year old testing report] [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 commit from over 20 years ago fixing the issue]<br />
* Illumos: [https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441 writes are retried], at least in the case of asynchronous write-back. Not yet clear to me whether failure provoked by a synchronous fsync() call leaves buffers valid and dirty.<br />
* Linux < 4.13: [https://lwn.net/Articles/724307/ fsync() errors can be lost in various ways]; also buffers are marked clean after errors, so retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure<br />
* Linux 4.13 and 4.15: [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750 fsync() only reports writeback errors that occurred after you called open()] so our schemes for closing and opening files LRU-style and handing fsync() work off to the checkpointer process can hide write-back errors; also buffers are marked clean after errors so even if you opened the file before the failure, retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure.<br />
* Linux 4.14 and Linux >= 4.16 [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3 write-back error counter is initialised differently] so that somebody gets the inode's first error even if the file was closed and opened in between, but you still only get the error once (so retrying fsync() is not OK) and the error can be forgotten if the inode falls out of the inode cache (unlikely since all file descriptors referencing the inode must be closed first, and close calls fsync); buffers are still thrown away (either immediately or on memory pressure, depending on choice of fs) so you might read back an older version of the page than you most recently wrote<br />
* NetBSD: [http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html buffers are invalidated] [https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059 here] so future fsync() calls may return success despite data loss; there may also be other problems according to a [http://gnats.netbsd.org/53152 netbsd.org bug report] that was triggered by our discussion<br />
* OpenBSD: [https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867 buffers are invalidated], code similar to NetBSD; [https://marc.info/?l=openbsd-tech&m=152357572903197&w=2 OpenBSD hackers pinged for comment] [https://marc.info/?l=openbsd-tech&m=154850737529764&w=2 new OpenBSD hackers thread]; UPDATE: [https://marc.info/?l=openbsd-cvs&m=155044187426287&w=2 a recent commit changed] the behaviour, analysis needed; [https://man.openbsd.org/fsync.2 man page updated] to say "To guard against potential inconsistency, future calls will continue failing until all references to the file are closed.", which is good as long as someone holds the file open, but that isn't guaranteed in PostgreSQL (it probably should be)<br />
<br />
Closed source kernels:<br />
<br />
* AIX: unknown<br />
* HPUX: unknown<br />
* Solaris: maybe the same as Illumos, but there was apparently a [https://lists.freebsd.org/pipermail/freebsd-hackers/2016-July/049665.html great VM allocator rewrite] after Solaris reverted to closed source <br />
* Windows: unknown<br />
<br />
Note that ZFS is likely to be a special case even on Linux, because it doesn't use the regular page cache and has special handling for failures. More information needed.<br />
<br />
Archeological notes: All BSD-derived systems probably inherited that brelse() logic from their [https://github.com/weiss/original-bsd/blob/master/sys/kern/vfs_bio.c#L377 common ancestor], but FreeBSD [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 changed it in 1999] and DragonflyBSD forked from FreeBSD in 2003 but apparently rewrote the bio code significantly. Darwin inherited code directly from ancient BSD via NeXT, and later took more code from FreeBSD but apparently not the behaviour discussed above. Ancient Bell UNIX was [https://github.com/dspinellis/unix-history-repo/blob/Bell-32V-Snapshot-Development/usr/src/sys/sys/bio.c#L196 conceptually had the same problem] but since it didn't have fsync(), that's somewhat moot. According to various man pages, fsync() was introduced by 4.2BSD (1983, not sure if fsync was added a bit later), developed around the same time and same place as POSTGRES (1986), and said in its man page it for making transactional facilities. Also fsync(1) appeared in FreeBSD 4.3 (2001), a command line tool that lets you sync a named file, which probably only makes sense if you have a certain model of how I/O errors and buffering work.<br />
<br />
== Relevant PostgreSQL commits ==</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33928Todo:HooksAndTracePoints2019-08-09T05:10:37Z<p>Ringerc: /* Logical rep related trace events (perf/dtrace/systemtap etc) */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Existing APIs usable from extensions ==<br />
<br />
There are a great many existing extension points in PostgreSQL. The article [[PostgresServerExtensionPoints]] lists them with references to core documentation, entrypoints in core code, etc.<br />
<br />
== TODO: New hooks, callbacks and tracepoints ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding etc ===<br />
<br />
'''CR'''<br />
<br />
Changes to improve visibility and management for logical decoding, reorder buffering, historic snapshot management and output plugin API.<br />
<br />
The reorderbuffer's resource use is totally opaque to monitoring right now. There's no useful way to know much about how it's using memory and all that can really be known about transactions spilled to disk is their xid.<br />
<br />
It's also nearly impossible to know much about what reorder buffering and logical decoding are working on right now without the use of gdb or something like perf dynamic userspace probes. That's not always practical in production environments.<br />
<br />
Suggestions:<br />
<br />
==== Logical decoding and reorder buffering stats in '''struct WalSnd''' ====<br />
<br />
Add some basic running accounting of reorder buffer stats to '''struct WalSnd''' per the following sample:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
/* Statistics for total reorder buffered txns */<br />
int32 reorderBufferedTxns;<br />
int32 reorderBufferedSnapshots;<br />
int64 reorderBufferedEventCount;<br />
int64 reorderBufferedBytes;<br />
<br />
/* Statistics for transactions spilled to disk. */<br />
int32 spillTxns;<br />
int32 spillSnapshots;<br />
int64 spillEventCount;<br />
int64 spillBytes;<br />
<br />
/*<br />
* When in ReorderBufferCommit for a txn, basic info about<br />
* the txn being processed.<br />
* <br />
* We already report the progress<br />
* lsn as the sent lsn, but it can't go backwards so we expose<br />
* the txn-specific lsn here too. And the oldest lsn relevant<br />
* to the txn is also worth knowing to give an indication of<br />
* xact duration and to compare to restart_lsn.<br />
*/<br />
TransactionId reorderBufferCommitXid;<br />
XLogRecPtr reorderBufferCommitRecEndLSN;<br />
TimestampTz reorderBufferCommitTimestamp;<br />
XLogRecPtr reorderBufferCommitXactBeginLSN;<br />
XLogRecPtr reorderBufferCommitSentRecLSN;<br />
<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Reorder buffer inspection functions ====<br />
<br />
Also (maybe) add reorderbuffer API functions to let output plugins and other tools inspect reorder buffer state within a walsender in finer detail.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
These would be fairly simple to add given that we'd already need to add a running memory accounting counter and entry counter to each ReorderBufferTXN:<br />
<br />
* '''List *ReorderBufferGetTXNs(ReorderBuffer *rb)''' or something like it to list all toplevel xids for which there are reorder buffers. Maybe just expose an iterator over '''ReorderBuffer.toplevel_by_lsn''' to avoid lots of copies?<br />
* '''void ReorderBufferTXNGetSize(ReorderBuffer *rb, ReorderBufferTXN *txn, size_t *inmemory, size_t *ondisk, uint64 *allchangecount, uint64 *rowchangecount, bool *has_catalog_changes)''' - get stats on one reorder buffered top-level txn.<br />
<br />
These would be very useful for output plugins that wish to offer some insight into xact progress, plugins that do their own spooling of processed txns, etc. They'd also be great when debugging the server.<br />
<br />
</div><br />
<br />
==== Logical rep related trace events (perf/dtrace/systemtap etc) ====<br />
<br />
Add a bunch of '''TRACE_POSTGRESQL_''' trace events for perf/dtrace/systemtap/etc for the following activities within postgres. <br />
Proposed events list follows.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
Statically defined trace events are 'very' cheap - effectively free when not in use. We already have them in some extremely hot paths in PostgreSQL like the '''BUFFER_READ''' events and the '''LWLOCK_ACQUIRE''' event. They offer a huge insight into the system's operation. Placing static events instead of relying on dynamic probes:<br />
<br />
* gives insight into production servers where debuginfo may not be present<br />
* lets us expose more useful arguments<br />
* serves to document points of interest and make them discoverable<br />
* works across server versions better since they're more stable and consistent<br />
* frees the user from having to find relevant function names and args<br />
* ... and they can be used in gdb too<br />
<br />
Events proposed:<br />
<br />
''walsender:''<br />
<br />
* walsender started<br />
* walsender sleeping<br />
** waiting for more WAL to be flushed, client activity or timeout<br />
** waiting for socket to be writeable<br />
* walsender woken (perhaps make this a postgres wide event on wait start and wait finish with reason like latch set)<br />
** tracepoint argument for how long it slept for?<br />
* walsender send buffer flushed (bytes_sent, bytes_left)<br />
* walsender sent keepalive request (lsns)<br />
* walsender got keepalive reply (lsns)<br />
* walsender sent replication data message (size)<br />
* walsender signalled<br />
* walsender state change<br />
* walsender exiting<br />
<br />
''xlogreader:''<br />
<br />
* xlogreader switched to a new segment<br />
* xlogreader fetched new page<br />
* xlogreader returned a record<br />
<br />
logical decoding:<br />
<br />
* decoding context created<br />
* decoding for new slot creation started<br />
* decoding for new slot creation finished, slot ready<br />
* logical decoding processed any record from any rmgr (start_lsn, end_lsn)<br />
* logical trace events for each rmgr and record-type<br />
* logical decoding end of txn<br />
<br />
snapbuild:<br />
<br />
* snapbuild state change (newstate)<br />
* snapbuild build snapshot<br />
* snapbuild free snapshot<br />
* snapbuild discard snapshot<br />
* serialized snapshot to disk<br />
* deserialized snapshot from disk<br />
* snapbuild export full data snapshot<br />
<br />
''Reorder buffering:''<br />
<br />
* reorder buffer created for newly seen xid (xid)<br />
* detected toplevel xid has catalog changes (rbtxn, xid)<br />
* add event to reorder buffer<br />
** All traces have (rbtxn, xid, lsn, event_kind, event_size)<br />
** change event traces also report affected relfilenode<br />
* discarded reorder buffer (rbtxn, xid)<br />
* started to spill reorder buffer to disk (rbtxn, xid)<br />
* finished spilling reorder buffer to disk (rbtxn, xid, byte_size, event_count)<br />
* discarded spilled reorder buffer (rbtxn, xid)<br />
<br />
''output plugins:''<br />
<br />
* before-ReorderBufferCommit (rbtxn, xid, commit_lsn, from_disk, has_catalog_changes)<br />
* before and after all output plugin callbacks<br />
* output plugin wrote data (size in bytes)<br />
<br />
Also add functions output plugins may call to report trace events based on what they do with rows in their callbacks, in particular one to report if the plugin discarded skipped over (discarded) a change.<br />
<br />
</div><br />
<br />
==== Logical decoding output plugin reorder buffer event filter callback ====<br />
<br />
Logical decoding output plugin callback to filter row-change events as they are decoded, before they are added to the reorder buffer.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
This one is trickier than it looks because when we're doing logical decoding we don't generally have an active historic snapshot. We only have a reliable snapshot during '''ReorderBufferCommit''' processing once the transaction commit has been seen and the already buffered changes are being passed from the reorder buffer to the output plugin's callbacks.<br />
<br />
The problem with that is that the output plugin may have no interest in potentially large amounts of the data being reorder-buffered, so PostgreSQL does a lot of work and disk writes decoding and buffering the data that the output plugin will throw away as soon as it sees it.<br />
<br />
The reorder buffer can already skip over changes that are from a different database. There's also a filter callback output plugins can use to skip reorder buffering of whole transactions based on their replication origin ID which helps replication systems not reorder-buffer transactions that were replicated from peer nodes.<br />
<br />
But plugins have 'no way to filter the data going into the reorder buffer by table or key.' All data for all tables in a non-excluded transaction is always reorder-buffered in full.<br />
<br />
That's a big problem for a few use cases including:<br />
<br />
* Replication slots that are only interested in one specific table, e.g. during a resynchronization operation<br />
* Workloads with very high churn queue tables etc that are not replicated but coexist in a database with data that must be replicated<br />
<br />
</div><br />
<br />
== TODO: New kinds of extension point ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend as described below. described below.<br />
<br />
=== Cache management and cache invalidation ===<br />
<br />
PostgreSQL has a solid cache management system in the form of its relcache and catcache. See '''utils/relcache.h''', '''utils/catcache.h''' and '''utils/inval.h'''.<br />
<br />
Complex extensions often need to perform their own caching and invalidation on various extension-defined state, such as extension configuration read from tables. At time of writing there is no way for extensions to register their own cache types and use PostgreSQL's cache management for this, so they must implement their own cache management and invalidations - usually on top of a dynahash ('''utils/dynahash.h''').<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc. They'd also be available to the user for monitoring in '''pg_locks'''.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Deadlock detection ===<br />
<br />
Extensions that make PostgreSQL instances participate in distributed systems can suffer from distributed deadlocks. PostgreSQL's existing deadlock detector could help with this if it was extensible with ways to discover new process dependency edges.<br />
<br />
Alternately, if distributed locks could be represented as PostgreSQL heavyweight locks they'd be visible in '''pg_locks''' for monitoring and the deadlock detector could possibly handle them with its existing capabilities.<br />
<br />
=== Transaction log, transaction visibility and commit ===<br />
<br />
Some kinds of distributed database systems need a distributed transaction log. <br />
<br />
Right now the PostgreSQL transaction log a.k.a. commit log ('''access/clog.h''') isn't at all extensible and is backed by a SLRU ('''access/slru.h''') on disk.<br />
<br />
There have been discussions on -hackers about adding a transaction manager abstraction but none have led to any pluggable transaction management being committed.<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== Invoking extension code for existing '''TRACE_POSTGRESQL_''' tracepoints ===<br />
<br />
Currently PosgreSQL defines '''TRACE_POSTGRESQL_''' tracepoints as thin wrappers around DTrace (see below).<br />
<br />
It'd be very useful if it were possible for extensions to intercept these and run their own code. The most obvious case is to integrate other tracing systems, particularly things like distributed / full-stack tracing engines such as Zipkin, Jaeger (OpenTracing), OpenCensus, etc.<br />
<br />
This would have to be done with extreme care as there are tracepoints in very hot paths like LWLock acquisition. We could possibly get away with a hook function pointer, but even that might have too much impact.<br />
<br />
=== Extension-defined DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
Give extensions an easy way to add new trace events to their own code, to be exposed to SDT when the extension is loaded. This probably means PGXS support for processing an extension specific '''.d''' file and linking it in + possibly some runtime hint to tell the tracing provider to look for it.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.<br />
<br />
</div></div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33927Todo:HooksAndTracePoints2019-08-09T04:41:50Z<p>Ringerc: /* Logical rep related trace events (perf/dtrace/systemtap etc) */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Existing APIs usable from extensions ==<br />
<br />
There are a great many existing extension points in PostgreSQL. The article [[PostgresServerExtensionPoints]] lists them with references to core documentation, entrypoints in core code, etc.<br />
<br />
== TODO: New hooks, callbacks and tracepoints ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding etc ===<br />
<br />
'''CR'''<br />
<br />
Changes to improve visibility and management for logical decoding, reorder buffering, historic snapshot management and output plugin API.<br />
<br />
The reorderbuffer's resource use is totally opaque to monitoring right now. There's no useful way to know much about how it's using memory and all that can really be known about transactions spilled to disk is their xid.<br />
<br />
It's also nearly impossible to know much about what reorder buffering and logical decoding are working on right now without the use of gdb or something like perf dynamic userspace probes. That's not always practical in production environments.<br />
<br />
Suggestions:<br />
<br />
==== Logical decoding and reorder buffering stats in '''struct WalSnd''' ====<br />
<br />
Add some basic running accounting of reorder buffer stats to '''struct WalSnd''' per the following sample:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
/* Statistics for total reorder buffered txns */<br />
int32 reorderBufferedTxns;<br />
int32 reorderBufferedSnapshots;<br />
int64 reorderBufferedEventCount;<br />
int64 reorderBufferedBytes;<br />
<br />
/* Statistics for transactions spilled to disk. */<br />
int32 spillTxns;<br />
int32 spillSnapshots;<br />
int64 spillEventCount;<br />
int64 spillBytes;<br />
<br />
/*<br />
* When in ReorderBufferCommit for a txn, basic info about<br />
* the txn being processed.<br />
* <br />
* We already report the progress<br />
* lsn as the sent lsn, but it can't go backwards so we expose<br />
* the txn-specific lsn here too. And the oldest lsn relevant<br />
* to the txn is also worth knowing to give an indication of<br />
* xact duration and to compare to restart_lsn.<br />
*/<br />
TransactionId reorderBufferCommitXid;<br />
XLogRecPtr reorderBufferCommitRecEndLSN;<br />
TimestampTz reorderBufferCommitTimestamp;<br />
XLogRecPtr reorderBufferCommitXactBeginLSN;<br />
XLogRecPtr reorderBufferCommitSentRecLSN;<br />
<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Reorder buffer inspection functions ====<br />
<br />
Also (maybe) add reorderbuffer API functions to let output plugins and other tools inspect reorder buffer state within a walsender in finer detail.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
These would be fairly simple to add given that we'd already need to add a running memory accounting counter and entry counter to each ReorderBufferTXN:<br />
<br />
* '''List *ReorderBufferGetTXNs(ReorderBuffer *rb)''' or something like it to list all toplevel xids for which there are reorder buffers. Maybe just expose an iterator over '''ReorderBuffer.toplevel_by_lsn''' to avoid lots of copies?<br />
* '''void ReorderBufferTXNGetSize(ReorderBuffer *rb, ReorderBufferTXN *txn, size_t *inmemory, size_t *ondisk, uint64 *allchangecount, uint64 *rowchangecount, bool *has_catalog_changes)''' - get stats on one reorder buffered top-level txn.<br />
<br />
These would be very useful for output plugins that wish to offer some insight into xact progress, plugins that do their own spooling of processed txns, etc. They'd also be great when debugging the server.<br />
<br />
</div><br />
<br />
==== Logical rep related trace events (perf/dtrace/systemtap etc) ====<br />
<br />
Add a bunch of '''TRACE_POSTGRESQL_''' trace events for perf/dtrace/systemtap/etc for the following activities within postgres.<br />
<br />
Statically defined trace events are *very* cheap, effectively free, when unused and offer a huge insight into the system's operation. Placing static events instead of relying on dynamic probes:<br />
<br />
* gives insight into production servers where debuginfo may not be present<br />
* lets us expose more useful arguments<br />
* serves to document points of interest and make them discoverable<br />
* works across server versions better since they're more stable and consistent<br />
* frees the user from having to find relevant function names and args<br />
* ... and they can be used in gdb too<br />
<br />
Proposed events list follows.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
''walsender:''<br />
<br />
* walsender started<br />
* walsender sleeping<br />
** waiting for more WAL to be flushed, client activity or timeout<br />
** waiting for socket to be writeable<br />
* walsender woken (perhaps make this a postgres wide event on wait start and wait finish with reason like latch set)<br />
** tracepoint argument for how long it slept for?<br />
* walsender send buffer flushed (bytes_sent, bytes_left)<br />
* walsender sent keepalive request (lsns)<br />
* walsender got keepalive reply (lsns)<br />
* walsender sent replication data message (size)<br />
* walsender signalled<br />
* walsender state change<br />
* walsender exiting<br />
<br />
''xlogreader:''<br />
<br />
* xlogreader switched to a new segment<br />
* xlogreader fetched new page<br />
* xlogreader returned a record<br />
<br />
logical decoding:<br />
<br />
* decoding context created<br />
* decoding for new slot creation started<br />
* decoding for new slot creation finished, slot ready<br />
* logical decoding processed any record from any rmgr (start_lsn, end_lsn)<br />
* logical trace events for each rmgr and record-type<br />
* logical decoding end of txn<br />
<br />
snapbuild:<br />
<br />
* snapbuild state change (newstate)<br />
* snapbuild build snapshot<br />
* snapbuild free snapshot<br />
* snapbuild discard snapshot<br />
* serialized snapshot to disk<br />
* deserialized snapshot from disk<br />
* snapbuild export full data snapshot<br />
<br />
''Reorder buffering:''<br />
<br />
* reorder buffer created for newly seen xid (xid)<br />
* detected toplevel xid has catalog changes (rbtxn, xid)<br />
* add event to reorder buffer<br />
** All traces have (rbtxn, xid, lsn, event_kind, event_size)<br />
** change event traces also report affected relfilenode<br />
* discarded reorder buffer (rbtxn, xid)<br />
* started to spill reorder buffer to disk (rbtxn, xid)<br />
* finished spilling reorder buffer to disk (rbtxn, xid, byte_size, event_count)<br />
* discarded spilled reorder buffer (rbtxn, xid)<br />
<br />
''output plugins:''<br />
<br />
* before-ReorderBufferCommit (rbtxn, xid, commit_lsn, from_disk, has_catalog_changes)<br />
* before and after all output plugin callbacks<br />
* output plugin wrote data (size in bytes)<br />
<br />
Also add functions output plugins may call to report trace events based on what they do with rows in their callbacks, in particular one to report if the plugin discarded skipped over (discarded) a change.<br />
<br />
</div><br />
<br />
==== Logical decoding output plugin reorder buffer event filter callback ====<br />
<br />
Logical decoding output plugin callback to filter row-change events as they are decoded, before they are added to the reorder buffer.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
This one is trickier than it looks because when we're doing logical decoding we don't generally have an active historic snapshot. We only have a reliable snapshot during '''ReorderBufferCommit''' processing once the transaction commit has been seen and the already buffered changes are being passed from the reorder buffer to the output plugin's callbacks.<br />
<br />
The problem with that is that the output plugin may have no interest in potentially large amounts of the data being reorder-buffered, so PostgreSQL does a lot of work and disk writes decoding and buffering the data that the output plugin will throw away as soon as it sees it.<br />
<br />
The reorder buffer can already skip over changes that are from a different database. There's also a filter callback output plugins can use to skip reorder buffering of whole transactions based on their replication origin ID which helps replication systems not reorder-buffer transactions that were replicated from peer nodes.<br />
<br />
But plugins have 'no way to filter the data going into the reorder buffer by table or key.' All data for all tables in a non-excluded transaction is always reorder-buffered in full.<br />
<br />
That's a big problem for a few use cases including:<br />
<br />
* Replication slots that are only interested in one specific table, e.g. during a resynchronization operation<br />
* Workloads with very high churn queue tables etc that are not replicated but coexist in a database with data that must be replicated<br />
<br />
</div><br />
<br />
== TODO: New kinds of extension point ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend as described below. described below.<br />
<br />
=== Cache management and cache invalidation ===<br />
<br />
PostgreSQL has a solid cache management system in the form of its relcache and catcache. See '''utils/relcache.h''', '''utils/catcache.h''' and '''utils/inval.h'''.<br />
<br />
Complex extensions often need to perform their own caching and invalidation on various extension-defined state, such as extension configuration read from tables. At time of writing there is no way for extensions to register their own cache types and use PostgreSQL's cache management for this, so they must implement their own cache management and invalidations - usually on top of a dynahash ('''utils/dynahash.h''').<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc. They'd also be available to the user for monitoring in '''pg_locks'''.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Deadlock detection ===<br />
<br />
Extensions that make PostgreSQL instances participate in distributed systems can suffer from distributed deadlocks. PostgreSQL's existing deadlock detector could help with this if it was extensible with ways to discover new process dependency edges.<br />
<br />
Alternately, if distributed locks could be represented as PostgreSQL heavyweight locks they'd be visible in '''pg_locks''' for monitoring and the deadlock detector could possibly handle them with its existing capabilities.<br />
<br />
=== Transaction log, transaction visibility and commit ===<br />
<br />
Some kinds of distributed database systems need a distributed transaction log. <br />
<br />
Right now the PostgreSQL transaction log a.k.a. commit log ('''access/clog.h''') isn't at all extensible and is backed by a SLRU ('''access/slru.h''') on disk.<br />
<br />
There have been discussions on -hackers about adding a transaction manager abstraction but none have led to any pluggable transaction management being committed.<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== Invoking extension code for existing '''TRACE_POSTGRESQL_''' tracepoints ===<br />
<br />
Currently PosgreSQL defines '''TRACE_POSTGRESQL_''' tracepoints as thin wrappers around DTrace (see below).<br />
<br />
It'd be very useful if it were possible for extensions to intercept these and run their own code. The most obvious case is to integrate other tracing systems, particularly things like distributed / full-stack tracing engines such as Zipkin, Jaeger (OpenTracing), OpenCensus, etc.<br />
<br />
This would have to be done with extreme care as there are tracepoints in very hot paths like LWLock acquisition. We could possibly get away with a hook function pointer, but even that might have too much impact.<br />
<br />
=== Extension-defined DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
Give extensions an easy way to add new trace events to their own code, to be exposed to SDT when the extension is loaded. This probably means PGXS support for processing an extension specific '''.d''' file and linking it in + possibly some runtime hint to tell the tracing provider to look for it.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.<br />
<br />
</div></div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33926Todo:HooksAndTracePoints2019-08-09T04:41:19Z<p>Ringerc: /* Logical rep related trace events (perf/dtrace/systemtap etc) */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Existing APIs usable from extensions ==<br />
<br />
There are a great many existing extension points in PostgreSQL. The article [[PostgresServerExtensionPoints]] lists them with references to core documentation, entrypoints in core code, etc.<br />
<br />
== TODO: New hooks, callbacks and tracepoints ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding etc ===<br />
<br />
'''CR'''<br />
<br />
Changes to improve visibility and management for logical decoding, reorder buffering, historic snapshot management and output plugin API.<br />
<br />
The reorderbuffer's resource use is totally opaque to monitoring right now. There's no useful way to know much about how it's using memory and all that can really be known about transactions spilled to disk is their xid.<br />
<br />
It's also nearly impossible to know much about what reorder buffering and logical decoding are working on right now without the use of gdb or something like perf dynamic userspace probes. That's not always practical in production environments.<br />
<br />
Suggestions:<br />
<br />
==== Logical decoding and reorder buffering stats in '''struct WalSnd''' ====<br />
<br />
Add some basic running accounting of reorder buffer stats to '''struct WalSnd''' per the following sample:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
/* Statistics for total reorder buffered txns */<br />
int32 reorderBufferedTxns;<br />
int32 reorderBufferedSnapshots;<br />
int64 reorderBufferedEventCount;<br />
int64 reorderBufferedBytes;<br />
<br />
/* Statistics for transactions spilled to disk. */<br />
int32 spillTxns;<br />
int32 spillSnapshots;<br />
int64 spillEventCount;<br />
int64 spillBytes;<br />
<br />
/*<br />
* When in ReorderBufferCommit for a txn, basic info about<br />
* the txn being processed.<br />
* <br />
* We already report the progress<br />
* lsn as the sent lsn, but it can't go backwards so we expose<br />
* the txn-specific lsn here too. And the oldest lsn relevant<br />
* to the txn is also worth knowing to give an indication of<br />
* xact duration and to compare to restart_lsn.<br />
*/<br />
TransactionId reorderBufferCommitXid;<br />
XLogRecPtr reorderBufferCommitRecEndLSN;<br />
TimestampTz reorderBufferCommitTimestamp;<br />
XLogRecPtr reorderBufferCommitXactBeginLSN;<br />
XLogRecPtr reorderBufferCommitSentRecLSN;<br />
<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Reorder buffer inspection functions ====<br />
<br />
Also (maybe) add reorderbuffer API functions to let output plugins and other tools inspect reorder buffer state within a walsender in finer detail.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
These would be fairly simple to add given that we'd already need to add a running memory accounting counter and entry counter to each ReorderBufferTXN:<br />
<br />
* '''List *ReorderBufferGetTXNs(ReorderBuffer *rb)''' or something like it to list all toplevel xids for which there are reorder buffers. Maybe just expose an iterator over '''ReorderBuffer.toplevel_by_lsn''' to avoid lots of copies?<br />
* '''void ReorderBufferTXNGetSize(ReorderBuffer *rb, ReorderBufferTXN *txn, size_t *inmemory, size_t *ondisk, uint64 *allchangecount, uint64 *rowchangecount, bool *has_catalog_changes)''' - get stats on one reorder buffered top-level txn.<br />
<br />
These would be very useful for output plugins that wish to offer some insight into xact progress, plugins that do their own spooling of processed txns, etc. They'd also be great when debugging the server.<br />
<br />
</div><br />
<br />
==== Logical rep related trace events (perf/dtrace/systemtap etc) ====<br />
<br />
Add a bunch of '''TRACE_POSTGRESQL_''' trace events for perf/dtrace/systemtap/etc for the following activities within postgres.<br />
<br />
Statically defined trace events are *very* cheap, effectively free, when unused and offer a huge insight into the system's operation. Placing static events instead of relying on dynamic probes:<br />
<br />
* gives insight into production servers where debuginfo may not be present<br />
* lets us expose more useful arguments<br />
* serves to document points of interest and make them discoverable<br />
* works across server versions better since they're more stable and consistent<br />
* frees the user from having to find relevant function names and args<br />
* ... and they can be used in gdb too<br />
<br />
Proposed events list follows.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
''walsender:''<br />
<br />
* walsender started<br />
* walsender sleeping<br />
* waiting for more WAL to be flushed, client activity or timeout<br />
* waiting for socket to be writeable<br />
* walsender woken (perhaps make this a postgres wide event on wait start and wait finish with reason like latch set)<br />
* tracepoint argument for how long it slept for?<br />
* walsender send buffer flushed (bytes_sent, bytes_left)<br />
* walsender sent keepalive request (lsns)<br />
* walsender got keepalive reply (lsns)<br />
* walsender sent replication data message (size)<br />
* walsender signalled<br />
* walsender state change<br />
* walsender exiting<br />
<br />
''xlogreader:''<br />
<br />
* xlogreader switched to a new segment<br />
* xlogreader fetched new page<br />
* xlogreader returned a record<br />
<br />
logical decoding:<br />
<br />
* decoding context created<br />
* decoding for new slot creation started<br />
* decoding for new slot creation finished, slot ready<br />
* logical decoding processed any record from any rmgr (start_lsn, end_lsn)<br />
* logical trace events for each rmgr and record-type<br />
* logical decoding end of txn<br />
<br />
snapbuild:<br />
<br />
* snapbuild state change (newstate)<br />
* snapbuild build snapshot<br />
* snapbuild free snapshot<br />
* snapbuild discard snapshot<br />
* serialized snapshot to disk<br />
* deserialized snapshot from disk<br />
* snapbuild export full data snapshot<br />
<br />
''Reorder buffering:''<br />
<br />
* reorder buffer created for newly seen xid (xid)<br />
* detected toplevel xid has catalog changes (rbtxn, xid)<br />
* add event to reorder buffer<br />
* All traces have (rbtxn, xid, lsn, event_kind, event_size)<br />
* change event traces also report affected relfilenode<br />
* discarded reorder buffer (rbtxn, xid)<br />
* started to spill reorder buffer to disk (rbtxn, xid)<br />
* finished spilling reorder buffer to disk (rbtxn, xid, byte_size, event_count)<br />
* discarded spilled reorder buffer (rbtxn, xid)<br />
<br />
''output plugins:''<br />
<br />
* before-ReorderBufferCommit (rbtxn, xid, commit_lsn, from_disk, has_catalog_changes)<br />
* before and after all output plugin callbacks<br />
* output plugin wrote data (size in bytes)<br />
<br />
Also add functions output plugins may call to report trace events based on what they do with rows in their callbacks, in particular one to report if the plugin discarded skipped over (discarded) a change.<br />
<br />
</div><br />
<br />
==== Logical decoding output plugin reorder buffer event filter callback ====<br />
<br />
Logical decoding output plugin callback to filter row-change events as they are decoded, before they are added to the reorder buffer.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
This one is trickier than it looks because when we're doing logical decoding we don't generally have an active historic snapshot. We only have a reliable snapshot during '''ReorderBufferCommit''' processing once the transaction commit has been seen and the already buffered changes are being passed from the reorder buffer to the output plugin's callbacks.<br />
<br />
The problem with that is that the output plugin may have no interest in potentially large amounts of the data being reorder-buffered, so PostgreSQL does a lot of work and disk writes decoding and buffering the data that the output plugin will throw away as soon as it sees it.<br />
<br />
The reorder buffer can already skip over changes that are from a different database. There's also a filter callback output plugins can use to skip reorder buffering of whole transactions based on their replication origin ID which helps replication systems not reorder-buffer transactions that were replicated from peer nodes.<br />
<br />
But plugins have 'no way to filter the data going into the reorder buffer by table or key.' All data for all tables in a non-excluded transaction is always reorder-buffered in full.<br />
<br />
That's a big problem for a few use cases including:<br />
<br />
* Replication slots that are only interested in one specific table, e.g. during a resynchronization operation<br />
* Workloads with very high churn queue tables etc that are not replicated but coexist in a database with data that must be replicated<br />
<br />
</div><br />
<br />
== TODO: New kinds of extension point ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend as described below. described below.<br />
<br />
=== Cache management and cache invalidation ===<br />
<br />
PostgreSQL has a solid cache management system in the form of its relcache and catcache. See '''utils/relcache.h''', '''utils/catcache.h''' and '''utils/inval.h'''.<br />
<br />
Complex extensions often need to perform their own caching and invalidation on various extension-defined state, such as extension configuration read from tables. At time of writing there is no way for extensions to register their own cache types and use PostgreSQL's cache management for this, so they must implement their own cache management and invalidations - usually on top of a dynahash ('''utils/dynahash.h''').<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc. They'd also be available to the user for monitoring in '''pg_locks'''.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Deadlock detection ===<br />
<br />
Extensions that make PostgreSQL instances participate in distributed systems can suffer from distributed deadlocks. PostgreSQL's existing deadlock detector could help with this if it was extensible with ways to discover new process dependency edges.<br />
<br />
Alternately, if distributed locks could be represented as PostgreSQL heavyweight locks they'd be visible in '''pg_locks''' for monitoring and the deadlock detector could possibly handle them with its existing capabilities.<br />
<br />
=== Transaction log, transaction visibility and commit ===<br />
<br />
Some kinds of distributed database systems need a distributed transaction log. <br />
<br />
Right now the PostgreSQL transaction log a.k.a. commit log ('''access/clog.h''') isn't at all extensible and is backed by a SLRU ('''access/slru.h''') on disk.<br />
<br />
There have been discussions on -hackers about adding a transaction manager abstraction but none have led to any pluggable transaction management being committed.<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== Invoking extension code for existing '''TRACE_POSTGRESQL_''' tracepoints ===<br />
<br />
Currently PosgreSQL defines '''TRACE_POSTGRESQL_''' tracepoints as thin wrappers around DTrace (see below).<br />
<br />
It'd be very useful if it were possible for extensions to intercept these and run their own code. The most obvious case is to integrate other tracing systems, particularly things like distributed / full-stack tracing engines such as Zipkin, Jaeger (OpenTracing), OpenCensus, etc.<br />
<br />
This would have to be done with extreme care as there are tracepoints in very hot paths like LWLock acquisition. We could possibly get away with a hook function pointer, but even that might have too much impact.<br />
<br />
=== Extension-defined DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
Give extensions an easy way to add new trace events to their own code, to be exposed to SDT when the extension is loaded. This probably means PGXS support for processing an extension specific '''.d''' file and linking it in + possibly some runtime hint to tell the tracing provider to look for it.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.<br />
<br />
</div></div>Ringerchttps://wiki.postgresql.org/index.php?title=PostgresServerExtensionPoints&diff=33925PostgresServerExtensionPoints2019-08-09T04:37:07Z<p>Ringerc: </p>
<hr />
<div>= PostgreSQL server extension points =<br />
<br />
PostgreSQL is a very extensible and pluggable engine. This article seeks to list, categorize and explain the various ways the server can be extended.<br />
<br />
It covers mainly extension points that are less well documented in the existing official documentation - accordingly it's mainly focused on the extension points usable by 'C language extensions'.<br />
<br />
Most people know about the SQL-level customisation opportunities like custom aggregates so they won't be given much attention here.<br />
<br />
== Core docs ==<br />
<br />
It's assumed that you've already read the core documentation and are thoroughly familiar with most of it, especially the [https://www.postgresql.org/docs/current/extend.html "Extending SQL"] section. Make sure you have reviewed all these chapters:<br />
<br />
* [https://www.postgresql.org/docs/current/extend-extensions.html "Packing related objects into extensions"]<br />
* [https://www.postgresql.org/docs/current/xfunc-c.html "C-language functions"]<br />
* [https://www.postgresql.org/docs/current/extend-pgxs.html "Extension building infrastructure"] (PGXS)<br />
<br />
== SQL-level extensibility ==<br />
<br />
PostgreSQL offers tons of scope for extension without the need to write or compile C code, including:<br />
<br />
* User-defined functions in multiple different languages<br />
* User-defined operators<br />
* User-defined aggregates<br />
* User-defined composite types and domains<br />
* User-defined index access methods<br />
* User-defined data types (except type input and output functions)<br />
<br />
Most of this is very well documented and won't be covered in detail here.<br />
<br />
== C-level extensibility ==<br />
<br />
Can't do it from SQL? Read on.<br />
<br />
=== C Extensions (plugins) ===<br />
<br />
A [https://www.postgresql.org/docs/current/extend-extensions.html PostgreSQL extension] can just be a SQL script with a control file. But for the purposes of this document the extensions of interest are those written in (usually) C. They're compiled to loadable loadable modules - a regular shared library with some PostgreSQL metadata and some conventions for symbols that must have specific type signatures and behaviour if exposed. <br />
<br />
C extensions can use almost all the same API as core PostgreSQL code.<br />
<br />
See '''PG_MODULE_MAGIC()''', [https://www.postgresql.org/docs/current/extend-pgxs.html PGXS], [https://www.postgresql.org/docs/current/xfunc-c.html C language functions], etc.<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
==== Existing callbacks ====<br />
<br />
===== Lifecycle callbacks =====<br />
<br />
Extensions can use postmaster and backend lifecycle callbacks including<br />
<br />
* '''before_shmem_exit'''<br />
* '''on_proc_exit'''<br />
* '''on_shmem_exit'''<br />
<br />
There are also transaction lifecycle callbacks:<br />
<br />
* '''RegisterXactCallback'''<br />
<br />
Cache invalidation callbacks:<br />
<br />
* '''CacheRegisterRelcacheCallback'''<br />
* '''CacheRegisterSyscacheCallback'''<br />
<br />
and many many more.<br />
<br />
Most of these work more like overrideable hooks in that they're generally part of the process-wide state.<br />
<br />
===== errcontext callbacks =====<br />
<br />
Extensions can define their own errcontext callbacks. When log messages ('''elog''' or '''ereport''') are prepared these errcontext callbacks are called to annotate the error message by appending to the '''CONTEXT''' field. <br />
<br />
errcontext callbacks generally follow the call-stack, with new entries pushed onto the errcontext callback stack on entry to a function and popped on exit. The errcontext stack is automatically unwound by PostgreSQL's exception handling macros '''PG_TRY()''' and '''PG_CATCH()''' so there is no need for a '''PG_CATCH()''' to restore the errcontext stack and '''PG_RE_THROW()'''.<br />
<br />
See existing usage in core for examples.<br />
<br />
'''Warning''': failing to pop an errcontext callback can have very confusing results as the context pointer will point to stack that has since been re-used so it will attempt to treat some unpredictable value as a function pointer for the errcontext callback. See [https://www.2ndquadrant.com/en/blog/dev-corner-error-context-stack-corruption/ this blog post for details].<br />
<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Extension of shared memory and IPC primitives ===<br />
<br />
Extensions may use a wide variety of core features relating to shared memory, registering their own:<br />
<br />
* shared memory segments - '''RequestAddinShmemSpace''', '''shmem_startup_hook''' and '''ShmemInitStruct''' in '''storage/shmem.h'''<br />
* lightweight lock tranches (LWLock) - '''LWLockRegisterTranche''' etc in '''storage/lwlock.h'''<br />
* latches - '''storage/latch.h'''<br />
* dynamic shared memory (DSM) - '''storage/dsm.h'''<br />
* dynamic shared memory areas (DSA) - '''utils/dsa.h'''<br />
* shared-memory queues (shm_mq) - '''storage/shm_mq.h'''<br />
* condition variables - '''storage/condition_variable.h'''<br />
<br />
Extensions may use PostgreSQL's process latches too; most of the time they can just use their own '''&MyProc->procLatch''' or set another backend's latch from its '''PGPROC''' entry.<br />
<br />
=== Background workers (bgworkers) ===<br />
<br />
Extensions may register new PostgreSQL backends that exist independently of any client connection.<br />
<br />
A bgworker runs as a child of the postmaster. It usually has full access to a particular database as if it was a user backend started on that DB. bgworkers control their own signal handling, run their own event loop, can do their own file and socket I/O, link to or dynamically load arbitrary C libraries, etc etc. They have broad access to most PostgreSQL internal facilities - they can do low level table manipulation with genam or heapam, they can use the SPI, they can start other bgworkers, etc.<br />
<br />
There are two kinds of bgworker, static and dynamic. Static workers can only be registered at '''_PG_init''' time in '''shared_preload_libraries'''. Dynamic workers can be launched at any time *after* startup completes. New code usually uses dynamic workers launched from a hook on <br />
<br />
Considerable care is needed to get background worker implementations correct. At time of writing they do not have any way to use <br />
<br />
=== Logical decoding output plugins ===<br />
<br />
The walsender and a related SQL-callable set of functions has support for plugins that interpret pre-processed WAL and transform it. This is used for logical replication amongst other things. See [https://www.postgresql.org/docs/current/logicaldecoding.html the documentation on logical decoding].<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
A non-exhaustive list includes:<br />
<br />
==== SQL-callable C functions ====<br />
<br />
==== Data types ====<br />
<br />
==== Security label providers ====<br />
<br />
=== Generic WAL (generic xlog) ===<br />
<br />
Generic WAL is a PostgreSQL feature that lets extensions create and use relations with pages in an extension-defined format.<br />
<br />
The extension writes custom WAL records with extension-defined payloads. PostgreSQL applies the WAL in a crash-safe, consistent manner on the master and any physical replicas. The extension may then read pages from the relation for whatever purpose it needs.<br />
<br />
See '''generic_xlog.h''' and '''generic_xlog.c'''.<br />
<br />
Note that 'extensions may not register redo callbacks for generic WAL' so they cannot run their own code during crash-recovery or replica WAL replay. Extensions may only read the relation's pages once the changes are applied.<br />
<br />
See '''contrib/bloom.c''' for an index implementation built on top of generic WAL.<br />
<br />
There is not currently any logical decoding support for generic WAL records. They cannot be reorder-buffered and there is no output plugin callback that accepts them.<br />
<br />
=== Logical WAL messages ===<br />
<br />
Logical WAL messages provide a WAL-consistent, crash-safe and optionally-transactional one-way communication channel from upstream bgworkers and user backends/functions to downstream receivers of logical decoding output plugin data streams.<br />
<br />
Extensions may write "logical WAL messages" with a label string and an arbitrary extension-defined payload to WAL. The label is used to allow extensions to identify their own messages and ignore messages from other extensions. These logical WAL messages are passed to a message handler callback on all logical decoding output plugins that implement the handler. The output plugin is expected to know which messages it is interested in and ignore the rest. The output plugin may use the message content to change plugin state internally and/or write a message in a plugin-defined format to its client output stream.<br />
<br />
There are two message types. Transactional messages are reorder-buffered and decoded as part of a transaction. Non-transactional messages are not reorder-buffered; instead the output plugin's message handler callback is invoked as soon as the message is decoded from WAL.<br />
<br />
See '''replication/message.h''' and the '''message_cb''' callback in '''struct OutputPluginCallbacks''' in '''replication/output_plugin.h'''.<br />
<br />
Logical WAL messages are treated as no-ops during crash recovery redo and physical replica replay. They have no effect on the heap and there are no callbacks or hooks that can handle them at redo time. They're ignored by everything except logical decoding.</div>Ringerchttps://wiki.postgresql.org/index.php?title=PostgresServerExtensionPoints&diff=33924PostgresServerExtensionPoints2019-08-09T04:35:37Z<p>Ringerc: Created page with "# PostgreSQL server extension points PostgreSQL is a very extensible and pluggable engine. This article seeks to list, categorize and explain the various ways the server can ..."</p>
<hr />
<div># PostgreSQL server extension points<br />
<br />
PostgreSQL is a very extensible and pluggable engine. This article seeks to list, categorize and explain the various ways the server can be extended.<br />
<br />
It covers mainly extension points that are less well documented in the existing official documentation - accordingly it's mainly focused on the extension points usable by 'C language extensions'.<br />
<br />
Most people know about the SQL-level customisation opportunities like custom aggregates so they won't be given much attention here.<br />
<br />
## Core docs<br />
<br />
It's assumed that you've already read the core documentation and are thoroughly familiar with most of it, especially the [https://www.postgresql.org/docs/current/extend.html "Extending SQL"] section. Make sure you have reviewed all these chapters:<br />
<br />
* [https://www.postgresql.org/docs/current/extend-extensions.html "Packing related objects into extensions"]<br />
* [https://www.postgresql.org/docs/current/xfunc-c.html "C-language functions"]<br />
* [https://www.postgresql.org/docs/current/extend-pgxs.html "Extension building infrastructure"] (PGXS)<br />
<br />
## SQL-level extensibility<br />
<br />
PostgreSQL offers tons of scope for extension without the need to write or compile C code, including:<br />
<br />
* User-defined functions in multiple different languages<br />
* User-defined operators<br />
* User-defined aggregates<br />
* User-defined composite types and domains<br />
* User-defined index access methods<br />
* User-defined data types (except type input and output functions)<br />
<br />
Most of this is very well documented and won't be covered in detail here.<br />
<br />
## C-level extensibility<br />
<br />
<br />
=== C Extensions (plugins) ===<br />
<br />
A [https://www.postgresql.org/docs/current/extend-extensions.html PostgreSQL extension] can just be a SQL script with a control file. But for the purposes of this document the extensions of interest are those written in (usually) C. They're compiled to loadable loadable modules - a regular shared library with some PostgreSQL metadata and some conventions for symbols that must have specific type signatures and behaviour if exposed. <br />
<br />
C extensions can use almost all the same API as core PostgreSQL code.<br />
<br />
See '''PG_MODULE_MAGIC()''', [https://www.postgresql.org/docs/current/extend-pgxs.html PGXS], [https://www.postgresql.org/docs/current/xfunc-c.html C language functions], etc.<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
==== Existing callbacks ====<br />
<br />
===== Lifecycle callbacks =====<br />
<br />
Extensions can use postmaster and backend lifecycle callbacks including<br />
<br />
* '''before_shmem_exit'''<br />
* '''on_proc_exit'''<br />
* '''on_shmem_exit'''<br />
<br />
There are also transaction lifecycle callbacks:<br />
<br />
* '''RegisterXactCallback'''<br />
<br />
Cache invalidation callbacks:<br />
<br />
* '''CacheRegisterRelcacheCallback'''<br />
* '''CacheRegisterSyscacheCallback'''<br />
<br />
and many many more.<br />
<br />
Most of these work more like overrideable hooks in that they're generally part of the process-wide state.<br />
<br />
===== errcontext callbacks =====<br />
<br />
Extensions can define their own errcontext callbacks. When log messages ('''elog''' or '''ereport''') are prepared these errcontext callbacks are called to annotate the error message by appending to the '''CONTEXT''' field. <br />
<br />
errcontext callbacks generally follow the call-stack, with new entries pushed onto the errcontext callback stack on entry to a function and popped on exit. The errcontext stack is automatically unwound by PostgreSQL's exception handling macros '''PG_TRY()''' and '''PG_CATCH()''' so there is no need for a '''PG_CATCH()''' to restore the errcontext stack and '''PG_RE_THROW()'''.<br />
<br />
See existing usage in core for examples.<br />
<br />
'''Warning''': failing to pop an errcontext callback can have very confusing results as the context pointer will point to stack that has since been re-used so it will attempt to treat some unpredictable value as a function pointer for the errcontext callback. See [https://www.2ndquadrant.com/en/blog/dev-corner-error-context-stack-corruption/ this blog post for details].<br />
<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Extension of shared memory and IPC primitives ===<br />
<br />
Extensions may use a wide variety of core features relating to shared memory, registering their own:<br />
<br />
* shared memory segments - '''RequestAddinShmemSpace''', '''shmem_startup_hook''' and '''ShmemInitStruct''' in '''storage/shmem.h'''<br />
* lightweight lock tranches (LWLock) - '''LWLockRegisterTranche''' etc in '''storage/lwlock.h'''<br />
* latches - '''storage/latch.h'''<br />
* dynamic shared memory (DSM) - '''storage/dsm.h'''<br />
* dynamic shared memory areas (DSA) - '''utils/dsa.h'''<br />
* shared-memory queues (shm_mq) - '''storage/shm_mq.h'''<br />
* condition variables - '''storage/condition_variable.h'''<br />
<br />
Extensions may use PostgreSQL's process latches too; most of the time they can just use their own '''&MyProc->procLatch''' or set another backend's latch from its '''PGPROC''' entry.<br />
<br />
=== Background workers (bgworkers) ===<br />
<br />
Extensions may register new PostgreSQL backends that exist independently of any client connection.<br />
<br />
A bgworker runs as a child of the postmaster. It usually has full access to a particular database as if it was a user backend started on that DB. bgworkers control their own signal handling, run their own event loop, can do their own file and socket I/O, link to or dynamically load arbitrary C libraries, etc etc. They have broad access to most PostgreSQL internal facilities - they can do low level table manipulation with genam or heapam, they can use the SPI, they can start other bgworkers, etc.<br />
<br />
There are two kinds of bgworker, static and dynamic. Static workers can only be registered at '''_PG_init''' time in '''shared_preload_libraries'''. Dynamic workers can be launched at any time *after* startup completes. New code usually uses dynamic workers launched from a hook on <br />
<br />
Considerable care is needed to get background worker implementations correct. At time of writing they do not have any way to use <br />
<br />
=== Logical decoding output plugins ===<br />
<br />
The walsender and a related SQL-callable set of functions has support for plugins that interpret pre-processed WAL and transform it. This is used for logical replication amongst other things. See [https://www.postgresql.org/docs/current/logicaldecoding.html the documentation on logical decoding].<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
A non-exhaustive list includes:<br />
<br />
==== SQL-callable C functions ====<br />
<br />
==== Data types ====<br />
<br />
==== Security label providers ====<br />
<br />
=== Generic WAL (generic xlog) ===<br />
<br />
Generic WAL is a PostgreSQL feature that lets extensions create and use relations with pages in an extension-defined format.<br />
<br />
The extension writes custom WAL records with extension-defined payloads. PostgreSQL applies the WAL in a crash-safe, consistent manner on the master and any physical replicas. The extension may then read pages from the relation for whatever purpose it needs.<br />
<br />
See '''generic_xlog.h''' and '''generic_xlog.c'''.<br />
<br />
Note that 'extensions may not register redo callbacks for generic WAL' so they cannot run their own code during crash-recovery or replica WAL replay. Extensions may only read the relation's pages once the changes are applied.<br />
<br />
See '''contrib/bloom.c''' for an index implementation built on top of generic WAL.<br />
<br />
There is not currently any logical decoding support for generic WAL records. They cannot be reorder-buffered and there is no output plugin callback that accepts them.<br />
<br />
=== Logical WAL messages ===<br />
<br />
Logical WAL messages provide a WAL-consistent, crash-safe and optionally-transactional one-way communication channel from upstream bgworkers and user backends/functions to downstream receivers of logical decoding output plugin data streams.<br />
<br />
Extensions may write "logical WAL messages" with a label string and an arbitrary extension-defined payload to WAL. The label is used to allow extensions to identify their own messages and ignore messages from other extensions. These logical WAL messages are passed to a message handler callback on all logical decoding output plugins that implement the handler. The output plugin is expected to know which messages it is interested in and ignore the rest. The output plugin may use the message content to change plugin state internally and/or write a message in a plugin-defined format to its client output stream.<br />
<br />
There are two message types. Transactional messages are reorder-buffered and decoded as part of a transaction. Non-transactional messages are not reorder-buffered; instead the output plugin's message handler callback is invoked as soon as the message is decoded from WAL.<br />
<br />
See '''replication/message.h''' and the '''message_cb''' callback in '''struct OutputPluginCallbacks''' in '''replication/output_plugin.h'''.<br />
<br />
Logical WAL messages are treated as no-ops during crash recovery redo and physical replica replay. They have no effect on the heap and there are no callbacks or hooks that can handle them at redo time. They're ignored by everything except logical decoding.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33923Todo:HooksAndTracePoints2019-08-09T04:33:44Z<p>Ringerc: </p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Existing APIs usable from extensions ==<br />
<br />
There are a great many existing extension points in PostgreSQL. The article [[PostgresServerExtensionPoints]] lists them with references to core documentation, entrypoints in core code, etc.<br />
<br />
== TODO: New hooks, callbacks and tracepoints ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding etc ===<br />
<br />
'''CR'''<br />
<br />
Changes to improve visibility and management for logical decoding, reorder buffering, historic snapshot management and output plugin API.<br />
<br />
The reorderbuffer's resource use is totally opaque to monitoring right now. There's no useful way to know much about how it's using memory and all that can really be known about transactions spilled to disk is their xid.<br />
<br />
It's also nearly impossible to know much about what reorder buffering and logical decoding are working on right now without the use of gdb or something like perf dynamic userspace probes. That's not always practical in production environments.<br />
<br />
Suggestions:<br />
<br />
==== Logical decoding and reorder buffering stats in '''struct WalSnd''' ====<br />
<br />
Add some basic running accounting of reorder buffer stats to '''struct WalSnd''' per the following sample:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
/* Statistics for total reorder buffered txns */<br />
int32 reorderBufferedTxns;<br />
int32 reorderBufferedSnapshots;<br />
int64 reorderBufferedEventCount;<br />
int64 reorderBufferedBytes;<br />
<br />
/* Statistics for transactions spilled to disk. */<br />
int32 spillTxns;<br />
int32 spillSnapshots;<br />
int64 spillEventCount;<br />
int64 spillBytes;<br />
<br />
/*<br />
* When in ReorderBufferCommit for a txn, basic info about<br />
* the txn being processed.<br />
* <br />
* We already report the progress<br />
* lsn as the sent lsn, but it can't go backwards so we expose<br />
* the txn-specific lsn here too. And the oldest lsn relevant<br />
* to the txn is also worth knowing to give an indication of<br />
* xact duration and to compare to restart_lsn.<br />
*/<br />
TransactionId reorderBufferCommitXid;<br />
XLogRecPtr reorderBufferCommitRecEndLSN;<br />
TimestampTz reorderBufferCommitTimestamp;<br />
XLogRecPtr reorderBufferCommitXactBeginLSN;<br />
XLogRecPtr reorderBufferCommitSentRecLSN;<br />
<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Reorder buffer inspection functions ====<br />
<br />
Also (maybe) add reorderbuffer API functions to let output plugins and other tools inspect reorder buffer state within a walsender in finer detail.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
These would be fairly simple to add given that we'd already need to add a running memory accounting counter and entry counter to each ReorderBufferTXN:<br />
<br />
* '''List *ReorderBufferGetTXNs(ReorderBuffer *rb)''' or something like it to list all toplevel xids for which there are reorder buffers. Maybe just expose an iterator over '''ReorderBuffer.toplevel_by_lsn''' to avoid lots of copies?<br />
* '''void ReorderBufferTXNGetSize(ReorderBuffer *rb, ReorderBufferTXN *txn, size_t *inmemory, size_t *ondisk, uint64 *allchangecount, uint64 *rowchangecount, bool *has_catalog_changes)''' - get stats on one reorder buffered top-level txn.<br />
<br />
These would be very useful for output plugins that wish to offer some insight into xact progress, plugins that do their own spooling of processed txns, etc. They'd also be great when debugging the server.<br />
<br />
</div><br />
<br />
==== Logical rep related trace events (perf/dtrace/systemtap etc) ====<br />
<br />
Add a bunch of '''TRACE_POSTGRESQL_''' trace events for perf/dtrace/systemtap/etc for the following activities within postgres.<br />
<br />
Statically defined trace events are *very* cheap, effectively free, when unused and offer a huge insight into the system's operation. Placing static events instead of relying on dynamic probes:<br />
<br />
* gives insight into production servers where debuginfo may not be present<br />
* lets us expose more useful arguments<br />
* serves to document points of interest and make them discoverable<br />
* works across server versions better since they're more stable and consistent<br />
* frees the user from having to find relevant function names and args<br />
* ... and they can be used in gdb too<br />
<br />
Proposed events list follows.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
''walsender:''<br />
<br />
* walsender started<br />
* walsender sleeping<br />
* waiting for more WAL to be flushed, client activity or timeout<br />
* waiting for socket to be writeable<br />
* walsender woken (perhaps make this a postgres wide event on wait start and wait finish with reason like latch set)<br />
* walsender send buffer flushed<br />
* walsender send buffer appended to (size)<br />
* walsender signalled<br />
* walsender state change<br />
* walsender exiting<br />
<br />
''xlogreader:''<br />
<br />
* xlogreader switched to a new segment<br />
* xlogreader fetched new page<br />
* xlogreader returned a record<br />
<br />
logical decoding:<br />
<br />
* decoding context created<br />
* decoding for new slot creation started<br />
* decoding for new slot creation finished, slot ready<br />
* logical decoding processed any record from any rmgr (start_lsn, end_lsn)<br />
* logical trace events for each rmgr and record-type<br />
* logical decoding end of txn<br />
<br />
snapbuild:<br />
<br />
* snapbuild state change (newstate)<br />
* snapbuild build snapshot<br />
* snapbuild free snapshot<br />
* snapbuild discard snapshot<br />
* serialized snapshot to disk<br />
* deserialized snapshot from disk<br />
* snapbuild export full data snapshot<br />
<br />
''Reorder buffering:''<br />
<br />
* reorder buffer created for newly seen xid (xid)<br />
* detected toplevel xid has catalog changes (rbtxn, xid)<br />
* add event to reorder buffer<br />
* All traces have (rbtxn, xid, lsn, event_kind, event_size)<br />
* change event traces also report affected relfilenode<br />
* discarded reorder buffer (rbtxn, xid)<br />
* started to spill reorder buffer to disk (rbtxn, xid)<br />
* finished spilling reorder buffer to disk (rbtxn, xid, byte_size, event_count)<br />
* discarded spilled reorder buffer (rbtxn, xid)<br />
<br />
''output plugins:''<br />
<br />
* before-ReorderBufferCommit (rbtxn, xid, commit_lsn, from_disk, has_catalog_changes)<br />
* before and after all output plugin callbacks<br />
* output plugin wrote data (size in bytes)<br />
<br />
Also add functions output plugins may call to report trace events based on what they do with rows in their callbacks, in particular one to report if the plugin discarded skipped over (discarded) a change.<br />
<br />
</div><br />
<br />
==== Logical decoding output plugin reorder buffer event filter callback ====<br />
<br />
Logical decoding output plugin callback to filter row-change events as they are decoded, before they are added to the reorder buffer.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
This one is trickier than it looks because when we're doing logical decoding we don't generally have an active historic snapshot. We only have a reliable snapshot during '''ReorderBufferCommit''' processing once the transaction commit has been seen and the already buffered changes are being passed from the reorder buffer to the output plugin's callbacks.<br />
<br />
The problem with that is that the output plugin may have no interest in potentially large amounts of the data being reorder-buffered, so PostgreSQL does a lot of work and disk writes decoding and buffering the data that the output plugin will throw away as soon as it sees it.<br />
<br />
The reorder buffer can already skip over changes that are from a different database. There's also a filter callback output plugins can use to skip reorder buffering of whole transactions based on their replication origin ID which helps replication systems not reorder-buffer transactions that were replicated from peer nodes.<br />
<br />
But plugins have 'no way to filter the data going into the reorder buffer by table or key.' All data for all tables in a non-excluded transaction is always reorder-buffered in full.<br />
<br />
That's a big problem for a few use cases including:<br />
<br />
* Replication slots that are only interested in one specific table, e.g. during a resynchronization operation<br />
* Workloads with very high churn queue tables etc that are not replicated but coexist in a database with data that must be replicated<br />
<br />
</div><br />
<br />
== TODO: New kinds of extension point ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend as described below. described below.<br />
<br />
=== Cache management and cache invalidation ===<br />
<br />
PostgreSQL has a solid cache management system in the form of its relcache and catcache. See '''utils/relcache.h''', '''utils/catcache.h''' and '''utils/inval.h'''.<br />
<br />
Complex extensions often need to perform their own caching and invalidation on various extension-defined state, such as extension configuration read from tables. At time of writing there is no way for extensions to register their own cache types and use PostgreSQL's cache management for this, so they must implement their own cache management and invalidations - usually on top of a dynahash ('''utils/dynahash.h''').<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc. They'd also be available to the user for monitoring in '''pg_locks'''.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Deadlock detection ===<br />
<br />
Extensions that make PostgreSQL instances participate in distributed systems can suffer from distributed deadlocks. PostgreSQL's existing deadlock detector could help with this if it was extensible with ways to discover new process dependency edges.<br />
<br />
Alternately, if distributed locks could be represented as PostgreSQL heavyweight locks they'd be visible in '''pg_locks''' for monitoring and the deadlock detector could possibly handle them with its existing capabilities.<br />
<br />
=== Transaction log, transaction visibility and commit ===<br />
<br />
Some kinds of distributed database systems need a distributed transaction log. <br />
<br />
Right now the PostgreSQL transaction log a.k.a. commit log ('''access/clog.h''') isn't at all extensible and is backed by a SLRU ('''access/slru.h''') on disk.<br />
<br />
There have been discussions on -hackers about adding a transaction manager abstraction but none have led to any pluggable transaction management being committed.<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== Invoking extension code for existing '''TRACE_POSTGRESQL_''' tracepoints ===<br />
<br />
Currently PosgreSQL defines '''TRACE_POSTGRESQL_''' tracepoints as thin wrappers around DTrace (see below).<br />
<br />
It'd be very useful if it were possible for extensions to intercept these and run their own code. The most obvious case is to integrate other tracing systems, particularly things like distributed / full-stack tracing engines such as Zipkin, Jaeger (OpenTracing), OpenCensus, etc.<br />
<br />
This would have to be done with extreme care as there are tracepoints in very hot paths like LWLock acquisition. We could possibly get away with a hook function pointer, but even that might have too much impact.<br />
<br />
=== Extension-defined DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
Give extensions an easy way to add new trace events to their own code, to be exposed to SDT when the extension is loaded. This probably means PGXS support for processing an extension specific '''.d''' file and linking it in + possibly some runtime hint to tell the tracing provider to look for it.<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.<br />
<br />
</div></div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33919Todo:HooksAndTracePoints2019-08-08T06:08:49Z<p>Ringerc: /* Definitions with existing examples */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Wishlist ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding ===<br />
<br />
* Hooks in reorder buffer management for memory accounting<br />
* Hooks in reorder buffer on spill to disk for memory accounting<br />
* Logical decoding output plugin callback to filter events as they are added to the reorder buffer<br />
<br />
== Definitions with existing examples ==<br />
<br />
=== C Extensions (plugins) ===<br />
<br />
A [https://www.postgresql.org/docs/current/extend-extensions.html PostgreSQL extension] can just be a SQL script with a control file. But for the purposes of this document the extensions of interest are those written in (usually) C. They're compiled to loadable loadable modules - a regular shared library with some PostgreSQL metadata and some conventions for symbols that must have specific type signatures and behaviour if exposed. <br />
<br />
C extensions can use almost all the same API as core PostgreSQL code.<br />
<br />
See '''PG_MODULE_MAGIC()''', [https://www.postgresql.org/docs/current/extend-pgxs.html PGXS], [https://www.postgresql.org/docs/current/xfunc-c.html C language functions], etc.<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
==== Existing callbacks ====<br />
<br />
===== Lifecycle callbacks =====<br />
<br />
Extensions can use postmaster and backend lifecycle callbacks including<br />
<br />
* '''before_shmem_exit'''<br />
* '''on_proc_exit'''<br />
* '''on_shmem_exit'''<br />
<br />
There are also transaction lifecycle callbacks:<br />
<br />
* '''RegisterXactCallback'''<br />
<br />
Cache invalidation callbacks:<br />
<br />
* '''CacheRegisterRelcacheCallback'''<br />
* '''CacheRegisterSyscacheCallback'''<br />
<br />
and many many more.<br />
<br />
Most of these work more like overrideable hooks in that they're generally part of the process-wide state.<br />
<br />
===== errcontext callbacks =====<br />
<br />
Extensions can define their own errcontext callbacks. When log messages ('''elog''' or '''ereport''') are prepared these errcontext callbacks are called to annotate the error message by appending to the '''CONTEXT''' field. <br />
<br />
errcontext callbacks generally follow the call-stack, with new entries pushed onto the errcontext callback stack on entry to a function and popped on exit. The errcontext stack is automatically unwound by PostgreSQL's exception handling macros '''PG_TRY()''' and '''PG_CATCH()''' so there is no need for a '''PG_CATCH()''' to restore the errcontext stack and '''PG_RE_THROW()'''.<br />
<br />
See existing usage in core for examples.<br />
<br />
'''Warning''': failing to pop an errcontext callback can have very confusing results as the context pointer will point to stack that has since been re-used so it will attempt to treat some unpredictable value as a function pointer for the errcontext callback. See [https://www.2ndquadrant.com/en/blog/dev-corner-error-context-stack-corruption/ this blog post for details].<br />
<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Extension of shared memory and IPC primitives ===<br />
<br />
Extensions may use a wide variety of core features relating to shared memory, registering their own:<br />
<br />
* shared memory segments - '''RequestAddinShmemSpace''', '''shmem_startup_hook''' and '''ShmemInitStruct''' in '''storage/shmem.h'''<br />
* lightweight lock tranches (LWLock) - '''LWLockRegisterTranche''' etc in '''storage/lwlock.h'''<br />
* latches - '''storage/latch.h'''<br />
* dynamic shared memory (DSM) - '''storage/dsm.h'''<br />
* dynamic shared memory areas (DSA) - '''utils/dsa.h'''<br />
* shared-memory queues (shm_mq) - '''storage/shm_mq.h'''<br />
* condition variables - '''storage/condition_variable.h'''<br />
<br />
Extensions may use PostgreSQL's process latches too; most of the time they can just use their own '''&MyProc->procLatch''' or set another backend's latch from its '''PGPROC''' entry.<br />
<br />
=== Background workers (bgworkers) ===<br />
<br />
Extensions may register new PostgreSQL backends that exist independently of any client connection.<br />
<br />
A bgworker runs as a child of the postmaster. It usually has full access to a particular database as if it was a user backend started on that DB. bgworkers control their own signal handling, run their own event loop, can do their own file and socket I/O, link to or dynamically load arbitrary C libraries, etc etc. They have broad access to most PostgreSQL internal facilities - they can do low level table manipulation with genam or heapam, they can use the SPI, they can start other bgworkers, etc.<br />
<br />
There are two kinds of bgworker, static and dynamic. Static workers can only be registered at '''_PG_init''' time in '''shared_preload_libraries'''. Dynamic workers can be launched at any time *after* startup completes. New code usually uses dynamic workers launched from a hook on <br />
<br />
Considerable care is needed to get background worker implementations correct. At time of writing they do not have any way to use <br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
A non-exhaustive list includes:<br />
<br />
==== SQL-callable C functions ====<br />
<br />
==== Data types ====<br />
<br />
==== Security label providers ====<br />
<br />
=== Generic WAL (generic xlog) ===<br />
<br />
Generic WAL is a PostgreSQL feature that lets extensions create and use relations with pages in an extension-defined format.<br />
<br />
The extension writes custom WAL records with extension-defined payloads. PostgreSQL applies the WAL in a crash-safe, consistent manner on the master and any physical replicas. The extension may then read pages from the relation for whatever purpose it needs.<br />
<br />
See '''generic_xlog.h''' and '''generic_xlog.c'''.<br />
<br />
Note that 'extensions may not register redo callbacks for generic WAL' so they cannot run their own code during crash-recovery or replica WAL replay. Extensions may only read the relation's pages once the changes are applied.<br />
<br />
See '''contrib/bloom.c''' for an index implementation built on top of generic WAL.<br />
<br />
There is not currently any logical decoding support for generic WAL records. They cannot be reorder-buffered and there is no output plugin callback that accepts them.<br />
<br />
=== Logical WAL messages ===<br />
<br />
Logical WAL messages provide a WAL-consistent, crash-safe and optionally-transactional one-way communication channel from upstream bgworkers and user backends/functions to downstream receivers of logical decoding output plugin data streams.<br />
<br />
Extensions may write "logical WAL messages" with a label string and an arbitrary extension-defined payload to WAL. The label is used to allow extensions to identify their own messages and ignore messages from other extensions. These logical WAL messages are passed to a message handler callback on all logical decoding output plugins that implement the handler. The output plugin is expected to know which messages it is interested in and ignore the rest. The output plugin may use the message content to change plugin state internally and/or write a message in a plugin-defined format to its client output stream.<br />
<br />
There are two message types. Transactional messages are reorder-buffered and decoded as part of a transaction. Non-transactional messages are not reorder-buffered; instead the output plugin's message handler callback is invoked as soon as the message is decoded from WAL.<br />
<br />
See '''replication/message.h''' and the '''message_cb''' callback in '''struct OutputPluginCallbacks''' in '''replication/output_plugin.h'''.<br />
<br />
Logical WAL messages are treated as no-ops during crash recovery redo and physical replica replay. They have no effect on the heap and there are no callbacks or hooks that can handle them at redo time. They're ignored by everything except logical decoding.<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Cache management and cache invalidation ===<br />
<br />
PostgreSQL has a solid cache management system in the form of its relcache and catcache. See '''utils/relcache.h''', '''utils/catcache.h''' and '''utils/inval.h'''.<br />
<br />
Complex extensions often need to perform their own caching and invalidation on various extension-defined state, such as extension configuration read from tables. At time of writing there is no way for extensions to register their own cache types and use PostgreSQL's cache management for this, so they must implement their own cache management and invalidations - usually on top of a dynahash ('''utils/dynahash.h''').<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc. They'd also be available to the user for monitoring in '''pg_locks'''.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Deadlock detection ===<br />
<br />
Extensions that make PostgreSQL instances participate in distributed systems can suffer from distributed deadlocks. PostgreSQL's existing deadlock detector could help with this if it was extensible with ways to discover new process dependency edges.<br />
<br />
Alternately, if distributed locks could be represented as PostgreSQL heavyweight locks they'd be visible in '''pg_locks''' for monitoring and the deadlock detector could possibly handle them with its existing capabilities.<br />
<br />
=== Transaction log, transaction visibility and commit ===<br />
<br />
Some kinds of distributed database systems need a distributed transaction log. <br />
<br />
Right now the PostgreSQL transaction log a.k.a. commit log ('''access/clog.h''') isn't at all extensible and is backed by a SLRU ('''access/slru.h''') on disk.<br />
<br />
There have been discussions on -hackers about adding a transaction manager abstraction but none have led to any pluggable transaction management being committed.<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33918Todo:HooksAndTracePoints2019-08-08T06:02:09Z<p>Ringerc: /* Wishlist for other extension point types */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Wishlist ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding ===<br />
<br />
* Hooks in reorder buffer management for memory accounting<br />
* Hooks in reorder buffer on spill to disk for memory accounting<br />
* Logical decoding output plugin callback to filter events as they are added to the reorder buffer<br />
<br />
== Definitions with existing examples ==<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
==== Existing callbacks ====<br />
<br />
===== Lifecycle callbacks =====<br />
<br />
Extensions can use postmaster and backend lifecycle callbacks including<br />
<br />
* '''before_shmem_exit'''<br />
* '''on_proc_exit'''<br />
* '''on_shmem_exit'''<br />
<br />
There are also transaction lifecycle callbacks:<br />
<br />
* '''RegisterXactCallback'''<br />
<br />
Cache invalidation callbacks:<br />
<br />
* '''CacheRegisterRelcacheCallback'''<br />
* '''CacheRegisterSyscacheCallback'''<br />
<br />
and many many more.<br />
<br />
Most of these work more like overrideable hooks in that they're generally part of the process-wide state.<br />
<br />
===== errcontext callbacks =====<br />
<br />
Extensions can define their own errcontext callbacks. When log messages ('''elog''' or '''ereport''') are prepared these errcontext callbacks are called to annotate the error message by appending to the '''CONTEXT''' field. <br />
<br />
errcontext callbacks generally follow the call-stack, with new entries pushed onto the errcontext callback stack on entry to a function and popped on exit. The errcontext stack is automatically unwound by PostgreSQL's exception handling macros '''PG_TRY()''' and '''PG_CATCH()''' so there is no need for a '''PG_CATCH()''' to restore the errcontext stack and '''PG_RE_THROW()'''.<br />
<br />
See existing usage in core for examples.<br />
<br />
'''Warning''': failing to pop an errcontext callback can have very confusing results as the context pointer will point to stack that has since been re-used so it will attempt to treat some unpredictable value as a function pointer for the errcontext callback. See [https://www.2ndquadrant.com/en/blog/dev-corner-error-context-stack-corruption/ this blog post for details].<br />
<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Extension of shared memory and IPC primitives ===<br />
<br />
Extensions may use a wide variety of core features relating to shared memory, registering their own:<br />
<br />
* shared memory segments - '''RequestAddinShmemSpace''', '''shmem_startup_hook''' and '''ShmemInitStruct''' in '''storage/shmem.h'''<br />
* lightweight lock tranches (LWLock) - '''LWLockRegisterTranche''' etc in '''storage/lwlock.h'''<br />
* latches - '''storage/latch.h'''<br />
* dynamic shared memory (DSM) - '''storage/dsm.h'''<br />
* dynamic shared memory areas (DSA) - '''utils/dsa.h'''<br />
* shared-memory queues (shm_mq) - '''storage/shm_mq.h'''<br />
* condition variables - '''storage/condition_variable.h'''<br />
<br />
Extensions may use PostgreSQL's process latches too; most of the time they can just use their own '''&MyProc->procLatch''' or set another backend's latch from its '''PGPROC''' entry.<br />
<br />
=== Background workers (bgworkers) ===<br />
<br />
Extensions may register new PostgreSQL backends that exist independently of any client connection.<br />
<br />
A bgworker runs as a child of the postmaster. It usually has full access to a particular database as if it was a user backend started on that DB. bgworkers control their own signal handling, run their own event loop, can do their own file and socket I/O, link to or dynamically load arbitrary C libraries, etc etc. They have broad access to most PostgreSQL internal facilities - they can do low level table manipulation with genam or heapam, they can use the SPI, they can start other bgworkers, etc.<br />
<br />
There are two kinds of bgworker, static and dynamic. Static workers can only be registered at '''_PG_init''' time in '''shared_preload_libraries'''. Dynamic workers can be launched at any time *after* startup completes. New code usually uses dynamic workers launched from a hook on <br />
<br />
Considerable care is needed to get background worker implementations correct. At time of writing they do not have any way to use <br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
A non-exhaustive list includes:<br />
<br />
==== SQL-callable C functions ====<br />
<br />
==== Data types ====<br />
<br />
==== Security label providers ====<br />
<br />
=== Generic WAL (generic xlog) ===<br />
<br />
Generic WAL is a PostgreSQL feature that lets extensions create and use relations with pages in an extension-defined format.<br />
<br />
The extension writes custom WAL records with extension-defined payloads. PostgreSQL applies the WAL in a crash-safe, consistent manner on the master and any physical replicas. The extension may then read pages from the relation for whatever purpose it needs.<br />
<br />
See '''generic_xlog.h''' and '''generic_xlog.c'''.<br />
<br />
Note that 'extensions may not register redo callbacks for generic WAL' so they cannot run their own code during crash-recovery or replica WAL replay. Extensions may only read the relation's pages once the changes are applied.<br />
<br />
See '''contrib/bloom.c''' for an index implementation built on top of generic WAL.<br />
<br />
There is not currently any logical decoding support for generic WAL records. They cannot be reorder-buffered and there is no output plugin callback that accepts them.<br />
<br />
=== Logical WAL messages ===<br />
<br />
Logical WAL messages provide a WAL-consistent, crash-safe and optionally-transactional one-way communication channel from upstream bgworkers and user backends/functions to downstream receivers of logical decoding output plugin data streams.<br />
<br />
Extensions may write "logical WAL messages" with a label string and an arbitrary extension-defined payload to WAL. The label is used to allow extensions to identify their own messages and ignore messages from other extensions. These logical WAL messages are passed to a message handler callback on all logical decoding output plugins that implement the handler. The output plugin is expected to know which messages it is interested in and ignore the rest. The output plugin may use the message content to change plugin state internally and/or write a message in a plugin-defined format to its client output stream.<br />
<br />
There are two message types. Transactional messages are reorder-buffered and decoded as part of a transaction. Non-transactional messages are not reorder-buffered; instead the output plugin's message handler callback is invoked as soon as the message is decoded from WAL.<br />
<br />
See '''replication/message.h''' and the '''message_cb''' callback in '''struct OutputPluginCallbacks''' in '''replication/output_plugin.h'''.<br />
<br />
Logical WAL messages are treated as no-ops during crash recovery redo and physical replica replay. They have no effect on the heap and there are no callbacks or hooks that can handle them at redo time. They're ignored by everything except logical decoding.<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Cache management and cache invalidation ===<br />
<br />
PostgreSQL has a solid cache management system in the form of its relcache and catcache. See '''utils/relcache.h''', '''utils/catcache.h''' and '''utils/inval.h'''.<br />
<br />
Complex extensions often need to perform their own caching and invalidation on various extension-defined state, such as extension configuration read from tables. At time of writing there is no way for extensions to register their own cache types and use PostgreSQL's cache management for this, so they must implement their own cache management and invalidations - usually on top of a dynahash ('''utils/dynahash.h''').<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc. They'd also be available to the user for monitoring in '''pg_locks'''.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Deadlock detection ===<br />
<br />
Extensions that make PostgreSQL instances participate in distributed systems can suffer from distributed deadlocks. PostgreSQL's existing deadlock detector could help with this if it was extensible with ways to discover new process dependency edges.<br />
<br />
Alternately, if distributed locks could be represented as PostgreSQL heavyweight locks they'd be visible in '''pg_locks''' for monitoring and the deadlock detector could possibly handle them with its existing capabilities.<br />
<br />
=== Transaction log, transaction visibility and commit ===<br />
<br />
Some kinds of distributed database systems need a distributed transaction log. <br />
<br />
Right now the PostgreSQL transaction log a.k.a. commit log ('''access/clog.h''') isn't at all extensible and is backed by a SLRU ('''access/slru.h''') on disk.<br />
<br />
There have been discussions on -hackers about adding a transaction manager abstraction but none have led to any pluggable transaction management being committed.<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33917Todo:HooksAndTracePoints2019-08-08T05:53:14Z<p>Ringerc: /* Definitions with existing examples */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Wishlist ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding ===<br />
<br />
* Hooks in reorder buffer management for memory accounting<br />
* Hooks in reorder buffer on spill to disk for memory accounting<br />
* Logical decoding output plugin callback to filter events as they are added to the reorder buffer<br />
<br />
== Definitions with existing examples ==<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
==== Existing callbacks ====<br />
<br />
===== Lifecycle callbacks =====<br />
<br />
Extensions can use postmaster and backend lifecycle callbacks including<br />
<br />
* '''before_shmem_exit'''<br />
* '''on_proc_exit'''<br />
* '''on_shmem_exit'''<br />
<br />
There are also transaction lifecycle callbacks:<br />
<br />
* '''RegisterXactCallback'''<br />
<br />
Cache invalidation callbacks:<br />
<br />
* '''CacheRegisterRelcacheCallback'''<br />
* '''CacheRegisterSyscacheCallback'''<br />
<br />
and many many more.<br />
<br />
Most of these work more like overrideable hooks in that they're generally part of the process-wide state.<br />
<br />
===== errcontext callbacks =====<br />
<br />
Extensions can define their own errcontext callbacks. When log messages ('''elog''' or '''ereport''') are prepared these errcontext callbacks are called to annotate the error message by appending to the '''CONTEXT''' field. <br />
<br />
errcontext callbacks generally follow the call-stack, with new entries pushed onto the errcontext callback stack on entry to a function and popped on exit. The errcontext stack is automatically unwound by PostgreSQL's exception handling macros '''PG_TRY()''' and '''PG_CATCH()''' so there is no need for a '''PG_CATCH()''' to restore the errcontext stack and '''PG_RE_THROW()'''.<br />
<br />
See existing usage in core for examples.<br />
<br />
'''Warning''': failing to pop an errcontext callback can have very confusing results as the context pointer will point to stack that has since been re-used so it will attempt to treat some unpredictable value as a function pointer for the errcontext callback. See [https://www.2ndquadrant.com/en/blog/dev-corner-error-context-stack-corruption/ this blog post for details].<br />
<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Extension of shared memory and IPC primitives ===<br />
<br />
Extensions may use a wide variety of core features relating to shared memory, registering their own:<br />
<br />
* shared memory segments - '''RequestAddinShmemSpace''', '''shmem_startup_hook''' and '''ShmemInitStruct''' in '''storage/shmem.h'''<br />
* lightweight lock tranches (LWLock) - '''LWLockRegisterTranche''' etc in '''storage/lwlock.h'''<br />
* latches - '''storage/latch.h'''<br />
* dynamic shared memory (DSM) - '''storage/dsm.h'''<br />
* dynamic shared memory areas (DSA) - '''utils/dsa.h'''<br />
* shared-memory queues (shm_mq) - '''storage/shm_mq.h'''<br />
* condition variables - '''storage/condition_variable.h'''<br />
<br />
Extensions may use PostgreSQL's process latches too; most of the time they can just use their own '''&MyProc->procLatch''' or set another backend's latch from its '''PGPROC''' entry.<br />
<br />
=== Background workers (bgworkers) ===<br />
<br />
Extensions may register new PostgreSQL backends that exist independently of any client connection.<br />
<br />
A bgworker runs as a child of the postmaster. It usually has full access to a particular database as if it was a user backend started on that DB. bgworkers control their own signal handling, run their own event loop, can do their own file and socket I/O, link to or dynamically load arbitrary C libraries, etc etc. They have broad access to most PostgreSQL internal facilities - they can do low level table manipulation with genam or heapam, they can use the SPI, they can start other bgworkers, etc.<br />
<br />
There are two kinds of bgworker, static and dynamic. Static workers can only be registered at '''_PG_init''' time in '''shared_preload_libraries'''. Dynamic workers can be launched at any time *after* startup completes. New code usually uses dynamic workers launched from a hook on <br />
<br />
Considerable care is needed to get background worker implementations correct. At time of writing they do not have any way to use <br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
A non-exhaustive list includes:<br />
<br />
==== SQL-callable C functions ====<br />
<br />
==== Data types ====<br />
<br />
==== Security label providers ====<br />
<br />
=== Generic WAL (generic xlog) ===<br />
<br />
Generic WAL is a PostgreSQL feature that lets extensions create and use relations with pages in an extension-defined format.<br />
<br />
The extension writes custom WAL records with extension-defined payloads. PostgreSQL applies the WAL in a crash-safe, consistent manner on the master and any physical replicas. The extension may then read pages from the relation for whatever purpose it needs.<br />
<br />
See '''generic_xlog.h''' and '''generic_xlog.c'''.<br />
<br />
Note that 'extensions may not register redo callbacks for generic WAL' so they cannot run their own code during crash-recovery or replica WAL replay. Extensions may only read the relation's pages once the changes are applied.<br />
<br />
See '''contrib/bloom.c''' for an index implementation built on top of generic WAL.<br />
<br />
There is not currently any logical decoding support for generic WAL records. They cannot be reorder-buffered and there is no output plugin callback that accepts them.<br />
<br />
=== Logical WAL messages ===<br />
<br />
Logical WAL messages provide a WAL-consistent, crash-safe and optionally-transactional one-way communication channel from upstream bgworkers and user backends/functions to downstream receivers of logical decoding output plugin data streams.<br />
<br />
Extensions may write "logical WAL messages" with a label string and an arbitrary extension-defined payload to WAL. The label is used to allow extensions to identify their own messages and ignore messages from other extensions. These logical WAL messages are passed to a message handler callback on all logical decoding output plugins that implement the handler. The output plugin is expected to know which messages it is interested in and ignore the rest. The output plugin may use the message content to change plugin state internally and/or write a message in a plugin-defined format to its client output stream.<br />
<br />
There are two message types. Transactional messages are reorder-buffered and decoded as part of a transaction. Non-transactional messages are not reorder-buffered; instead the output plugin's message handler callback is invoked as soon as the message is decoded from WAL.<br />
<br />
See '''replication/message.h''' and the '''message_cb''' callback in '''struct OutputPluginCallbacks''' in '''replication/output_plugin.h'''.<br />
<br />
Logical WAL messages are treated as no-ops during crash recovery redo and physical replica replay. They have no effect on the heap and there are no callbacks or hooks that can handle them at redo time. They're ignored by everything except logical decoding.<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33916Todo:HooksAndTracePoints2019-08-08T05:47:29Z<p>Ringerc: /* Defining various server objects from extensions */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Wishlist ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding ===<br />
<br />
* Hooks in reorder buffer management for memory accounting<br />
* Hooks in reorder buffer on spill to disk for memory accounting<br />
* Logical decoding output plugin callback to filter events as they are added to the reorder buffer<br />
<br />
== Definitions with existing examples ==<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Extension of shared memory and IPC primitives ===<br />
<br />
Extensions may use a wide variety of core features relating to shared memory, registering their own:<br />
<br />
* shared memory segments - '''RequestAddinShmemSpace''', '''shmem_startup_hook''' and '''ShmemInitStruct''' in '''storage/shmem.h'''<br />
* lightweight lock tranches (LWLock) - '''LWLockRegisterTranche''' etc in '''storage/lwlock.h'''<br />
* latches - '''storage/latch.h'''<br />
* dynamic shared memory (DSM) - '''storage/dsm.h'''<br />
* dynamic shared memory areas (DSA) - '''utils/dsa.h'''<br />
* shared-memory queues (shm_mq) - '''storage/shm_mq.h'''<br />
* condition variables - '''storage/condition_variable.h'''<br />
<br />
Extensions may use PostgreSQL's process latches too; most of the time they can just use their own '''&MyProc->procLatch''' or set another backend's latch from its '''PGPROC''' entry.<br />
<br />
=== Background workers (bgworkers) ===<br />
<br />
Extensions may register new PostgreSQL backends that exist independently of any client connection.<br />
<br />
A bgworker runs as a child of the postmaster. It usually has full access to a particular database as if it was a user backend started on that DB. bgworkers control their own signal handling, run their own event loop, can do their own file and socket I/O, link to or dynamically load arbitrary C libraries, etc etc. They have broad access to most PostgreSQL internal facilities - they can do low level table manipulation with genam or heapam, they can use the SPI, they can start other bgworkers, etc.<br />
<br />
There are two kinds of bgworker, static and dynamic. Static workers can only be registered at '''_PG_init''' time in '''shared_preload_libraries'''. Dynamic workers can be launched at any time *after* startup completes. New code usually uses dynamic workers launched from a hook on <br />
<br />
Considerable care is needed to get background worker implementations correct. At time of writing they do not have any way to use <br />
<br />
=== Lifecycle callbacks ===<br />
<br />
Extensions can use postmaster and backend lifecycle callbacks including<br />
<br />
* '''before_shmem_exit'''<br />
* '''on_proc_exit'''<br />
* '''on_shmem_exit'''<br />
<br />
=== errcontext callbacks ===<br />
<br />
Extensions can define their own errcontext callbacks. When log messages ('''elog''' or '''ereport''') are prepared these errcontext callbacks are called to annotate the error message by appending to the '''CONTEXT''' field. <br />
<br />
errcontext callbacks generally follow the call-stack, with new entries pushed onto the errcontext callback stack on entry to a function and popped on exit. The errcontext stack is automatically unwound by PostgreSQL's exception handling macros '''PG_TRY()''' and '''PG_CATCH()''' so there is no need for a '''PG_CATCH()''' to restore the errcontext stack and '''PG_RE_THROW()'''.<br />
<br />
See existing usage in core for examples.<br />
<br />
'''Warning''': failing to pop an errcontext callback can have very confusing results as the context pointer will point to stack that has since been re-used so it will attempt to treat some unpredictable value as a function pointer for the errcontext callback. See [https://www.2ndquadrant.com/en/blog/dev-corner-error-context-stack-corruption/ this blog post for details].<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
A non-exhaustive list includes:<br />
<br />
==== SQL-callable C functions ====<br />
<br />
==== Data types ====<br />
<br />
==== Security label providers ====<br />
<br />
=== Generic WAL (generic xlog) ===<br />
<br />
Generic WAL is a PostgreSQL feature that lets extensions create and use relations with pages in an extension-defined format.<br />
<br />
The extension writes custom WAL records with extension-defined payloads. PostgreSQL applies the WAL in a crash-safe, consistent manner on the master and any physical replicas. The extension may then read pages from the relation for whatever purpose it needs.<br />
<br />
See '''generic_xlog.h''' and '''generic_xlog.c'''.<br />
<br />
Note that 'extensions may not register redo callbacks for generic WAL' so they cannot run their own code during crash-recovery or replica WAL replay. Extensions may only read the relation's pages once the changes are applied.<br />
<br />
See '''contrib/bloom.c''' for an index implementation built on top of generic WAL.<br />
<br />
There is not currently any logical decoding support for generic WAL records. They cannot be reorder-buffered and there is no output plugin callback that accepts them.<br />
<br />
=== Logical WAL messages ===<br />
<br />
Logical WAL messages provide a WAL-consistent, crash-safe and optionally-transactional one-way communication channel from upstream bgworkers and user backends/functions to downstream receivers of logical decoding output plugin data streams.<br />
<br />
Extensions may write "logical WAL messages" with a label string and an arbitrary extension-defined payload to WAL. The label is used to allow extensions to identify their own messages and ignore messages from other extensions. These logical WAL messages are passed to a message handler callback on all logical decoding output plugins that implement the handler. The output plugin is expected to know which messages it is interested in and ignore the rest. The output plugin may use the message content to change plugin state internally and/or write a message in a plugin-defined format to its client output stream.<br />
<br />
There are two message types. Transactional messages are reorder-buffered and decoded as part of a transaction. Non-transactional messages are not reorder-buffered; instead the output plugin's message handler callback is invoked as soon as the message is decoded from WAL.<br />
<br />
See '''replication/message.h''' and the '''message_cb''' callback in '''struct OutputPluginCallbacks''' in '''replication/output_plugin.h'''.<br />
<br />
Logical WAL messages are treated as no-ops during crash recovery redo and physical replica replay. They have no effect on the heap and there are no callbacks or hooks that can handle them at redo time. They're ignored by everything except logical decoding.<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33915Todo:HooksAndTracePoints2019-08-08T05:25:32Z<p>Ringerc: /* Definitions with existing examples */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Wishlist ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding ===<br />
<br />
* Hooks in reorder buffer management for memory accounting<br />
* Hooks in reorder buffer on spill to disk for memory accounting<br />
* Logical decoding output plugin callback to filter events as they are added to the reorder buffer<br />
<br />
== Definitions with existing examples ==<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Extension of shared memory and IPC primitives ===<br />
<br />
Extensions may use a wide variety of core features relating to shared memory, registering their own:<br />
<br />
* shared memory segments - '''RequestAddinShmemSpace''', '''shmem_startup_hook''' and '''ShmemInitStruct''' in '''storage/shmem.h'''<br />
* lightweight lock tranches (LWLock) - '''LWLockRegisterTranche''' etc in '''storage/lwlock.h'''<br />
* latches - '''storage/latch.h'''<br />
* dynamic shared memory (DSM) - '''storage/dsm.h'''<br />
* dynamic shared memory areas (DSA) - '''utils/dsa.h'''<br />
* shared-memory queues (shm_mq) - '''storage/shm_mq.h'''<br />
* condition variables - '''storage/condition_variable.h'''<br />
<br />
Extensions may use PostgreSQL's process latches too; most of the time they can just use their own '''&MyProc->procLatch''' or set another backend's latch from its '''PGPROC''' entry.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
A non-exhaustive list includes:<br />
<br />
==== SQL-callable C functions ====<br />
<br />
==== Data types ====<br />
<br />
==== Security label providers ====<br />
<br />
=== Generic WAL (generic xlog) ===<br />
<br />
Generic WAL is a PostgreSQL feature that lets extensions create and use relations with pages in an extension-defined format.<br />
<br />
The extension writes custom WAL records with extension-defined payloads. PostgreSQL applies the WAL in a crash-safe, consistent manner on the master and any physical replicas. The extension may then read pages from the relation for whatever purpose it needs.<br />
<br />
See '''generic_xlog.h''' and '''generic_xlog.c'''.<br />
<br />
Note that 'extensions may not register redo callbacks for generic WAL' so they cannot run their own code during crash-recovery or replica WAL replay. Extensions may only read the relation's pages once the changes are applied.<br />
<br />
See '''contrib/bloom.c''' for an index implementation built on top of generic WAL.<br />
<br />
There is not currently any logical decoding support for generic WAL records. They cannot be reorder-buffered and there is no output plugin callback that accepts them.<br />
<br />
=== Logical WAL messages ===<br />
<br />
Logical WAL messages provide a WAL-consistent, crash-safe and optionally-transactional one-way communication channel from upstream bgworkers and user backends/functions to downstream receivers of logical decoding output plugin data streams.<br />
<br />
Extensions may write "logical WAL messages" with a label string and an arbitrary extension-defined payload to WAL. The label is used to allow extensions to identify their own messages and ignore messages from other extensions. These logical WAL messages are passed to a message handler callback on all logical decoding output plugins that implement the handler. The output plugin is expected to know which messages it is interested in and ignore the rest. The output plugin may use the message content to change plugin state internally and/or write a message in a plugin-defined format to its client output stream.<br />
<br />
There are two message types. Transactional messages are reorder-buffered and decoded as part of a transaction. Non-transactional messages are not reorder-buffered; instead the output plugin's message handler callback is invoked as soon as the message is decoded from WAL.<br />
<br />
See '''replication/message.h''' and the '''message_cb''' callback in '''struct OutputPluginCallbacks''' in '''replication/output_plugin.h'''.<br />
<br />
Logical WAL messages are treated as no-ops during crash recovery redo and physical replica replay. They have no effect on the heap and there are no callbacks or hooks that can handle them at redo time. They're ignored by everything except logical decoding.<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33914Todo:HooksAndTracePoints2019-08-08T05:14:42Z<p>Ringerc: </p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Wishlist ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding ===<br />
<br />
* Hooks in reorder buffer management for memory accounting<br />
* Hooks in reorder buffer on spill to disk for memory accounting<br />
* Logical decoding output plugin callback to filter events as they are added to the reorder buffer<br />
<br />
== Definitions with existing examples ==<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
A non-exhaustive list includes:<br />
<br />
==== SQL-callable C functions ====<br />
<br />
==== Data types ====<br />
<br />
==== Security label providers ====<br />
<br />
=== Generic WAL (generic xlog) ===<br />
<br />
Generic WAL is a PostgreSQL feature that lets extensions create and use relations with pages in an extension-defined format.<br />
<br />
The extension writes custom WAL records with extension-defined payloads. PostgreSQL applies the WAL in a crash-safe, consistent manner on the master and any physical replicas. The extension may then read pages from the relation for whatever purpose it needs.<br />
<br />
See '''generic_xlog.h''' and '''generic_xlog.c'''.<br />
<br />
Note that 'extensions may not register redo callbacks for generic WAL' so they cannot run their own code during crash-recovery or replica WAL replay. Extensions may only read the relation's pages once the changes are applied.<br />
<br />
See '''contrib/bloom.c''' for an index implementation built on top of generic WAL.<br />
<br />
There is not currently any logical decoding support for generic WAL records. They cannot be reorder-buffered and there is no output plugin callback that accepts them.<br />
<br />
=== Logical WAL messages ===<br />
<br />
Logical WAL messages provide a WAL-consistent, crash-safe and optionally-transactional one-way communication channel from upstream bgworkers and user backends/functions to downstream receivers of logical decoding output plugin data streams.<br />
<br />
Extensions may write "logical WAL messages" with a label string and an arbitrary extension-defined payload to WAL. The label is used to allow extensions to identify their own messages and ignore messages from other extensions. These logical WAL messages are passed to a message handler callback on all logical decoding output plugins that implement the handler. The output plugin is expected to know which messages it is interested in and ignore the rest. The output plugin may use the message content to change plugin state internally and/or write a message in a plugin-defined format to its client output stream.<br />
<br />
There are two message types. Transactional messages are reorder-buffered and decoded as part of a transaction. Non-transactional messages are not reorder-buffered; instead the output plugin's message handler callback is invoked as soon as the message is decoded from WAL.<br />
<br />
See '''replication/message.h''' and the '''message_cb''' callback in '''struct OutputPluginCallbacks''' in '''replication/output_plugin.h'''.<br />
<br />
Logical WAL messages are treated as no-ops during crash recovery redo and physical replica replay. They have no effect on the heap and there are no callbacks or hooks that can handle them at redo time. They're ignored by everything except logical decoding.<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33883Todo:HooksAndTracePoints2019-08-06T05:16:11Z<p>Ringerc: /* TODO: Hooks, callbacks and trace points */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Wishlist ==<br />
<br />
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.<br />
<br />
=== Logical decoding ===<br />
<br />
* Hooks in reorder buffer management for memory accounting<br />
* Hooks in reorder buffer on spill to disk for memory accounting<br />
* Logical decoding output plugin callback to filter events as they are added to the reorder buffer<br />
<br />
== Definitions with existing examples ==<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
TODO: list them?<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.<br />
<br />
=== DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===<br />
<br />
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.<br />
<br />
Events are defined as markers in the source code as '''TRACE_POSTGRESQL_EVENTNAME(...)''' function-like macros, which are no-ops unless trace events generation are enabled.<br />
<br />
These events can be used by trace-event aware utilities including '''perf''' (Linux), '''ebpf-tools''' (Linux), '''systemtap''' (Linux), '''DTrace''' (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be [https://sourceware.org/gdb/onlinedocs/gdb/Set-Tracepoints.html used by gdb]).<br />
<br />
The PostgreSQL implementation translates '''src/backend/utils/probes.d''' to a C header '''src/backend/utils/probes.h''' that defines '''TRACE_POSTGRESQL_''' events as wrappers for '''DTRACE_PROBE''' macros, which in turn are defined by '''/usr/include/sys/sdt.h''' as wrappers for '''_STAP_PROBE''' . That injects some asm placeholders that're used by tracing systems.<br />
<br />
At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.<br />
<br />
Extensions may duplicate the same build logic and define their own providers though.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33882Todo:HooksAndTracePoints2019-08-06T04:53:32Z<p>Ringerc: /* Definitions with existing examples = */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Definitions with existing examples ==<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
TODO: list them?<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33881Todo:HooksAndTracePoints2019-08-06T04:53:13Z<p>Ringerc: /* Wait Event types */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Definitions with existing examples ===<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
TODO: list them?<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity''' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33880Todo:HooksAndTracePoints2019-08-06T04:52:31Z<p>Ringerc: /* Abstract interfaces with function pointer implementations */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Definitions with existing examples ===<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
</syntaxhighlight><br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
TODO: list them?<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity'' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33879Todo:HooksAndTracePoints2019-08-06T04:51:53Z<p>Ringerc: /* Hooks */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Definitions with existing examples ===<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
```<br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
```<br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
TODO: list them?<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity'' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33878Todo:HooksAndTracePoints2019-08-06T04:47:46Z<p>Ringerc: /* Hooks */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Definitions with existing examples ===<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
'''/* Click to expand */'''<br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
```<br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
```<br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
TODO: list them?<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity'' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33877Todo:HooksAndTracePoints2019-08-06T04:47:19Z<p>Ringerc: /* Hooks */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Definitions with existing examples ===<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
''Click to expand''<br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
```<br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
```<br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
TODO: list them?<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity'' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33876Todo:HooksAndTracePoints2019-08-06T04:46:53Z<p>Ringerc: /* Hooks */</p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Definitions with existing examples ===<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo is:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
/* Click to show example */<br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
```<br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
```<br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
TODO: list them?<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity'' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.</div>Ringerchttps://wiki.postgresql.org/index.php?title=Todo:HooksAndTracePoints&diff=33875Todo:HooksAndTracePoints2019-08-06T04:46:15Z<p>Ringerc: </p>
<hr />
<div>= TODO: Hooks, callbacks and trace points =<br />
<br />
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.<br />
<br />
== Definitions with existing examples ===<br />
<br />
=== C implementations of SQL-callable functions ===<br />
<br />
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:<br />
<br />
<pre><br />
CREATE FUNCTION ... LANGUAGE 'c'<br />
</pre><br />
<br />
to expose it to SQL callers.<br />
<br />
=== Pre-defined '''dlsym''' extension points ===<br />
<br />
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.<br />
<br />
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.<br />
<br />
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.<br />
<br />
=== Rendezvous variables ===<br />
<br />
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.<br />
<br />
==== Why rendezvous variables? ====<br />
<br />
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.<br />
<br />
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).<br />
<br />
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.<br />
<br />
==== Using rendezvous variables ====<br />
<br />
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
extern void **find_rendezvous_variable(const char *varName);<br />
</syntaxhighlight><br />
<br />
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.<br />
<br />
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.<br />
<br />
=== Hooks ===<br />
<br />
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.<br />
<br />
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.<br />
<br />
See separate article on entry points for extending PostgreSQL for list of existing hooks.<br />
<br />
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo is:<br />
<br />
<div class="toccolours mw-collapsible" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
<br />
static ProcessUtility_hook_type next_ProcessUtility_hook;<br />
<br />
static void<br />
demo_ProcessUtility_hook(PlannedStmt *pstmt,<br />
const char *queryString, ProcessUtilityContext context,<br />
ParamListInfo params,<br />
QueryEnvironment *queryEnv,<br />
DestReceiver *dest, char *completionTag)<br />
{<br />
/* Do something silly to show how the hook can work */<br />
if (IsA(parsetree, TransactionStmt))<br />
{<br />
TransactionStmt *stmt = (TransactionStatement)parsetree;<br />
if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())<br />
ereport(ERROR,<br />
(errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));<br />
}<br />
<br />
/* Call next hook if registered, or original postgres stmt */<br />
if (next_ProcessUtility_hook)<br />
next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
else<br />
standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);<br />
<br />
if (completionTag)<br />
ereport(LOG,<br />
(errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));<br />
}<br />
<br />
void<br />
_PG_init(void)<br />
{<br />
next_ProcessUtility_hook = ProcessUtility_hook;<br />
ProcessUtility_hook = demo_ProcessUtility_hook;<br />
}<br />
</syntaxhighlight><br />
</div><br />
<br />
==== Existing hooks ====<br />
<br />
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run<br />
<br />
<pre><br />
git grep "PGDLLIMPORT .*_hook_type" src/include/<br />
</pre><br />
<br />
At time of writing these hooks were:<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;"><br />
<br/><br />
<syntaxhighlight lang="C" line='line'><br />
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;<br />
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;<br />
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;<br />
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;<br />
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;<br />
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;<br />
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;<br />
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;<br />
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;<br />
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;<br />
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;<br />
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;<br />
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;<br />
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;<br />
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;<br />
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;<br />
</syntaxhighlight><br />
</div><br />
<br />
=== Callbacks ===<br />
<br />
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.<br />
<br />
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
/*...*/<br />
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);<br />
/*...*/<br />
typedef void (*GucStringAssignHook) (const char *newval, void *extra);<br />
/*...*/<br />
extern void DefineCustomStringVariable(const char *name,<br />
/*...*/<br />
GucStringCheckHook check_hook,<br />
GucStringAssignHook assign_hook,<br />
GucShowHook show_hook);<br />
<br />
</syntaxhighlight><br />
<br />
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.<br />
<br />
=== Abstract interfaces with function pointer implementations ===<br />
<br />
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.<br />
<br />
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.<br />
<br />
One of many examples is the logical decoding interface. PostgreSQL calls:<br />
<br />
<syntaxhighlight lang="C" line='line'><br />
void _PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
</syntaxhighlight><br />
<br />
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.<br />
<br />
```<br />
void<br />
_PG_output_plugin_init(OutputPluginCallbacks *cb)<br />
{<br />
AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);<br />
<br />
cb->startup_cb = pg_decode_startup;<br />
cb->begin_cb = pg_decode_begin_txn;<br />
cb->change_cb = pg_decode_change;<br />
cb->truncate_cb = pg_decode_truncate;<br />
cb->commit_cb = pg_decode_commit_txn;<br />
cb->filter_by_origin_cb = pg_decode_filter;<br />
cb->shutdown_cb = pg_decode_shutdown;<br />
cb->message_cb = pg_decode_message;<br />
}<br />
```<br />
<br />
... each of which conforms to a specific signature and is invoked at specific points in execution.<br />
<br />
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.<br />
<br />
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.<br />
<br />
See '''contrib/test_decoding/test_decoding.c''' for example usage.<br />
<br />
=== Defining various server objects from extensions ===<br />
<br />
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.<br />
<br />
TODO: list them?<br />
<br />
== Wishlist for other extension point types ==<br />
<br />
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.<br />
<br />
=== Wait Event types ===<br />
<br />
Extensions have access to the '''PG_WAIT_EXTENSION''' WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via '''pg_stat_activity'' and other wait-event aware interfaces.<br />
<br />
=== Heavyweight lock types and tags ===<br />
<br />
Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc.<br />
<br />
TODO: set out example for how it might work<br />
<br />
=== Parser syntax extension points ===<br />
<br />
Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's '''flex''' and '''bison''' based '''LALR(1)''' parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.<br />
<br />
Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in '''WITH(...)''' lists such as in '''COPY ... WITH (FORMAT CSV, ...)'''. <br />
<br />
Add your proposed points and use cases here.</div>Ringerc