TODO: Hooks, callbacks and trace points

This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.

Existing APIs usable from extensions

There are a great many existing extension points in PostgreSQL. The article PostgresServerExtensionPoints lists them with references to core documentation, entrypoints in core code, etc.

TODO: New hooks, callbacks and tracepoints

Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.

Logical decoding etc

CR

Changes to improve visibility and management for logical decoding, reorder buffering, historic snapshot management and output plugin API.

The reorderbuffer's resource use is totally opaque to monitoring right now. There's no useful way to know much about how it's using memory and all that can really be known about transactions spilled to disk is their xid.

It's also nearly impossible to know much about what reorder buffering and logical decoding are working on right now without the use of gdb or something like perf dynamic userspace probes. That's not always practical in production environments.

Suggestions:

Logical decoding and reorder buffering stats in struct WalSnd

Add some basic running accounting of reorder buffer stats to struct WalSnd per the following sample:

 1        /* Statistics for total reorder buffered txns */
 2        int32           reorderBufferedTxns;
 3        int32           reorderBufferedSnapshots;
 4        int64           reorderBufferedEventCount;
 5        int64           reorderBufferedBytes;
 6
 7        /* Statistics for transactions spilled to disk. */
 8        int32           spillTxns;
 9        int32           spillSnapshots;
10        int64           spillEventCount;
11        int64           spillBytes;
12
13        /*
14         * When in ReorderBufferCommit for a txn, basic info about
15         * the txn being processed.
16         * 
17         * We already report the progress
18         * lsn as the sent lsn, but it can't go backwards so we expose
19         * the txn-specific lsn here too. And the oldest lsn relevant
20         * to the txn is also worth knowing to give an indication of
21         * xact duration and to compare to restart_lsn.
22         */
23        TransactionId   reorderBufferCommitXid;
24        XLogRecPtr      reorderBufferCommitRecEndLSN;
25        TimestampTz     reorderBufferCommitTimestamp;
26        XLogRecPtr      reorderBufferCommitXactBeginLSN;
27        XLogRecPtr      reorderBufferCommitSentRecLSN;

Reorder buffer inspection functions

Also (maybe) add reorderbuffer API functions to let output plugins and other tools inspect reorder buffer state within a walsender in finer detail.

These would be fairly simple to add given that we'd already need to add a running memory accounting counter and entry counter to each ReorderBufferTXN:

List *ReorderBufferGetTXNs(ReorderBuffer *rb) or something like it to list all toplevel xids for which there are reorder buffers. Maybe just expose an iterator over ReorderBuffer.toplevel_by_lsn to avoid lots of copies?
void ReorderBufferTXNGetSize(ReorderBuffer *rb, ReorderBufferTXN *txn, size_t *inmemory, size_t *ondisk, uint64 *allchangecount, uint64 *rowchangecount, bool *has_catalog_changes) - get stats on one reorder buffered top-level txn.

These would be very useful for output plugins that wish to offer some insight into xact progress, plugins that do their own spooling of processed txns, etc. They'd also be great when debugging the server.

Logical rep related trace events (perf/dtrace/systemtap etc)

Add a bunch of TRACE_POSTGRESQL_ trace events for perf/dtrace/systemtap/etc for the following activities within postgres. Proposed events list follows.

Statically defined trace events are 'very' cheap - effectively free when not in use. We already have them in some extremely hot paths in PostgreSQL like the BUFFER_READ events and the LWLOCK_ACQUIRE event. They offer a huge insight into the system's operation. Placing static events instead of relying on dynamic probes:

gives insight into production servers where debuginfo may not be present
lets us expose more useful arguments
serves to document points of interest and make them discoverable
works across server versions better since they're more stable and consistent
frees the user from having to find relevant function names and args
... and they can be used in gdb too

Events proposed:

walsender:

walsender started
walsender sleeping
- waiting for more WAL to be flushed, client activity or timeout
- waiting for socket to be writeable
walsender woken (perhaps make this a postgres wide event on wait start and wait finish with reason like latch set)
- tracepoint argument for how long it slept for?
walsender send buffer flushed (bytes_sent, bytes_left)
walsender sent keepalive request (lsns)
walsender got keepalive reply (lsns)
walsender sent replication data message (size)
walsender signalled
walsender state change
walsender exiting

xlogreader:

xlogreader switched to a new segment
xlogreader fetched new page
xlogreader returned a record

logical decoding:

decoding context created
decoding for new slot creation started
decoding for new slot creation finished, slot ready
logical decoding processed any record from any rmgr (start_lsn, end_lsn)
logical trace events for each rmgr and record-type
logical decoding end of txn

snapbuild:

snapbuild state change (newstate)
snapbuild build snapshot
snapbuild free snapshot
snapbuild discard snapshot
serialized snapshot to disk
deserialized snapshot from disk
snapbuild export full data snapshot

Reorder buffering:

reorder buffer created for newly seen xid (xid)
detected toplevel xid has catalog changes (rbtxn, xid)
add event to reorder buffer
- All traces have (rbtxn, xid, lsn, event_kind, event_size)
- change event traces also report affected relfilenode
discarded reorder buffer (rbtxn, xid)
started to spill reorder buffer to disk (rbtxn, xid)
finished spilling reorder buffer to disk (rbtxn, xid, byte_size, event_count)
discarded spilled reorder buffer (rbtxn, xid)

output plugins:

before-ReorderBufferCommit (rbtxn, xid, commit_lsn, from_disk, has_catalog_changes)
before and after all output plugin callbacks
output plugin wrote data (size in bytes)

Also add functions output plugins may call to report trace events based on what they do with rows in their callbacks, in particular one to report if the plugin discarded skipped over (discarded) a change.

Logical decoding output plugin reorder buffer event filter callback

Logical decoding output plugin callback to filter row-change events as they are decoded, before they are added to the reorder buffer.

This one is trickier than it looks because when we're doing logical decoding we don't generally have an active historic snapshot. We only have a reliable snapshot during ReorderBufferCommit processing once the transaction commit has been seen and the already buffered changes are being passed from the reorder buffer to the output plugin's callbacks.

The problem with that is that the output plugin may have no interest in potentially large amounts of the data being reorder-buffered, so PostgreSQL does a lot of work and disk writes decoding and buffering the data that the output plugin will throw away as soon as it sees it.

The reorder buffer can already skip over changes that are from a different database. There's also a filter callback output plugins can use to skip reorder buffering of whole transactions based on their replication origin ID which helps replication systems not reorder-buffer transactions that were replicated from peer nodes.

But plugins have 'no way to filter the data going into the reorder buffer by table or key.' All data for all tables in a non-excluded transaction is always reorder-buffered in full.

That's a big problem for a few use cases including:

Replication slots that are only interested in one specific table, e.g. during a resynchronization operation
Workloads with very high churn queue tables etc that are not replicated but coexist in a database with data that must be replicated

TODO: New kinds of extension point

There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend as described below. described below.

Cache management and cache invalidation

PostgreSQL has a solid cache management system in the form of its relcache and catcache. See utils/relcache.h, utils/catcache.h and utils/inval.h.

Complex extensions often need to perform their own caching and invalidation on various extension-defined state, such as extension configuration read from tables. At time of writing there is no way for extensions to register their own cache types and use PostgreSQL's cache management for this, so they must implement their own cache management and invalidations - usually on top of a dynahash (utils/dynahash.h).

Wait Event types

Extensions have access to the PG_WAIT_EXTENSION WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via pg_stat_activity and other wait-event aware interfaces.

Heavyweight lock types and tags

Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc. They'd also be available to the user for monitoring in pg_locks.

TODO: set out example for how it might work

Deadlock detection

Extensions that make PostgreSQL instances participate in distributed systems can suffer from distributed deadlocks. PostgreSQL's existing deadlock detector could help with this if it was extensible with ways to discover new process dependency edges.

Alternately, if distributed locks could be represented as PostgreSQL heavyweight locks they'd be visible in pg_locks for monitoring and the deadlock detector could possibly handle them with its existing capabilities.

Transaction log, transaction visibility and commit

Some kinds of distributed database systems need a distributed transaction log.

Right now the PostgreSQL transaction log a.k.a. commit log (access/clog.h) isn't at all extensible and is backed by a SLRU (access/slru.h) on disk.

There have been discussions on -hackers about adding a transaction manager abstraction but none have led to any pluggable transaction management being committed.

Parser syntax extension points

Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's flex and bison based LALR(1) parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.

Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in WITH(...) lists such as in COPY ... WITH (FORMAT CSV, ...).

Add your proposed points and use cases here.

Invoking extension code for existing TRACE_POSTGRESQL_ tracepoints

Currently PosgreSQL defines TRACE_POSTGRESQL_ tracepoints as thin wrappers around DTrace (see below).

It'd be very useful if it were possible for extensions to intercept these and run their own code. The most obvious case is to integrate other tracing systems, particularly things like distributed / full-stack tracing engines such as Zipkin, Jaeger (OpenTracing), OpenCensus, etc.

This would have to be done with extreme care as there are tracepoints in very hot paths like LWLock acquisition. We could possibly get away with a hook function pointer, but even that might have too much impact.

Extension-defined DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs)

Give extensions an easy way to add new trace events to their own code, to be exposed to SDT when the extension is loaded. This probably means PGXS support for processing an extension specific .d file and linking it in + possibly some runtime hint to tell the tracing provider to look for it.

PostgreSQL accepts the configure option --enable-dtrace to generate DTrace-compatible statically defined tracepoint events . Usually this uses systemtap on Linux.

Events are defined as markers in the source code as TRACE_POSTGRESQL_EVENTNAME(...) function-like macros, which are no-ops unless trace events generation are enabled.

These events can be used by trace-event aware utilities including perf (Linux), ebpf-tools (Linux), systemtap (Linux), DTrace (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be used by gdb).

The PostgreSQL implementation translates src/backend/utils/probes.d to a C header src/backend/utils/probes.h that defines TRACE_POSTGRESQL_ events as wrappers for DTRACE_PROBE macros, which in turn are defined by /usr/include/sys/sdt.h as wrappers for _STAP_PROBE . That injects some asm placeholders that're used by tracing systems.

At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.

Extensions may duplicate the same build logic and define their own providers though.

Todo:HooksAndTracePoints

Contents