Difference between revisions of "Todo:HooksAndTracePoints"

From PostgreSQL wiki
Jump to: navigation, search
(Definitions with existing examples)
Line 3: Line 3:
 
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.
 
This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.
  
== Wishlist ==
+
== Existing APIs usable from extensions ==
  
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.
+
There are a great many existing extension points in PostgreSQL. The article [[PostgresServerExtensionPoints]] lists them with references to core documentation, entrypoints in core code, etc.
  
=== Logical decoding ===
+
== TODO: New hooks, callbacks and tracepoints ==
  
* Hooks in reorder buffer management for memory accounting
+
Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.
* Hooks in reorder buffer on spill to disk for memory accounting
 
* Logical decoding output plugin callback to filter events as they are added to the reorder buffer
 
 
 
== Definitions with existing examples ==
 
 
 
=== C Extensions (plugins) ===
 
 
 
A [https://www.postgresql.org/docs/current/extend-extensions.html PostgreSQL extension] can just be a SQL script with a control file. But for the purposes of this document the extensions of interest are those written in (usually) C. They're compiled to loadable loadable modules - a regular shared library with some PostgreSQL metadata and some conventions for symbols that must have specific type signatures and behaviour if exposed.
 
 
 
C extensions can use almost all the same API as core PostgreSQL code.
 
 
 
See '''PG_MODULE_MAGIC()''', [https://www.postgresql.org/docs/current/extend-pgxs.html PGXS], [https://www.postgresql.org/docs/current/xfunc-c.html C language functions], etc.
 
 
 
=== C implementations of SQL-callable functions ===
 
 
 
This is the most common extension point and very well known so I won't go into detail here. Extensions expose a C linkage symbol with the signature '''Datum funcname(PG_FUNCTION_ARGS)''' for the function. It uses the PostgreSQL '''PG_FUNCTION_INFO_V1''' macro to define another with metadata about the function. Then registers it in its extension script with:
 
 
 
<pre>
 
CREATE FUNCTION ... LANGUAGE 'c'
 
</pre>
 
 
 
to expose it to SQL callers.
 
 
 
=== Pre-defined '''dlsym''' extension points ===
 
 
 
PostgreSQL defines a few function signatures that extensions may (or must) define. Each must expose a specific symbol and accept a specific signature. The most obvious is '''void _PG_init(void)''', which PostgreSQL calls when it loads an extension into the postmaster (if `shared_preload_libraries`) or a backend.
 
 
 
We try not to define too many of these as they're an inconvenient interface. The server must '''dlsym(...)''' them from the extension after '''dlopen(...)'''ing it so they're a bit clumsy.
 
 
 
Try to avoid adding these. It's better to use hooks, callbacks, etc, where possible, and then register them from '''_PG_init'''.
 
 
 
=== Rendezvous variables ===
 
 
 
Rendezvous variables are a PostgreSQL facility to allow extensions to connect with each other once they're loaded and share functionality. They use the '''find_rendezvous_variable(...)''' entrypoint.
 
 
 
==== Why rendezvous variables? ====
 
 
 
Extensions are compiled independently from each other. They generally don't want to rely on a specific extension load order and often cannot access the shared library of other extensions at compile-time. So they cannot generally pass other extension libraries as '''-l''' arguments to their linker at link-time. If they did it might confuse the other extension as it wouldn't get its '''_PG_init''' called at the right point in the extension lifecycle.
 
 
 
Use of '''extern''' symbols defined in other extensions will still create unresolved symbols to be resolved at dynamic link time. But extensions' symbols are not visible to the dynamic linker when it's resolving another extension's symbols and you'll get an unresolved symbol error at load-time. That's because PostgreSQL doesn't load extensions with '''RTLD_GLOBAL''' (for good reasons).
 
 
 
So the usual "call an '''extern''' function and let the dynamic linker sort it out" approach won't work.
 
  
==== Using rendezvous variables ====
+
=== Logical decoding etc ===
 
 
To handle these linkage difficulties PostgreSQL exposes 'rendezvous variables' via the fmgr. See '''include/fmgr.h''':
 
 
 
<syntaxhighlight lang="C" line='line'>
 
extern void **find_rendezvous_variable(const char *varName);
 
</syntaxhighlight>
 
  
These let one extension expose a named variable with a void pointer to a struct of extension-defined type. This is usually a struct full of callbacks to serve as an extension C API.
+
'''CR'''
  
For a core usage example see plpgsql's plugin support in '''src/pl/plpgsql/src/pl_handler.c'''.
+
Changes to improve visibility and management for logical decoding, reorder buffering, historic snapshot management and output plugin API.
  
=== Hooks ===
+
The reorderbuffer's resource use is totally opaque to monitoring right now. There's no useful way to know much about how it's using memory and all that can really be known about transactions spilled to disk is their xid.
  
A "hook" is a global variable of pointer-to-function type. PostgreSQL calls the hook function 'instead of' a standard postgres function if the variable is set at the relevant point in execution of some core routine. The hook variable is usually set by extension code to run new code before and/or after existing core code, usually from '''shared_preload_libraries''' or '''session_preload_libraries'''.
+
It's also nearly impossible to know much about what reorder buffering and logical decoding are working on right now without the use of gdb or something like perf dynamic userspace probes. That's not always practical in production environments.
  
If the hook variable was already set when an extension loads the extension must remember the previous hook value and call it; otherwise it generally calls the original core PostgreSQL routine.
+
Suggestions:
  
See separate article on entry points for extending PostgreSQL for list of existing hooks.
+
==== Logical decoding and reorder buffering stats in '''struct WalSnd''' ====
  
An example is the '''ProcessUtility_hook''' which is used to intercept and wrap, or entirely suppress, utility commands. A utility command is any "non plannable" SQL command, anything other than '''SELECT'''/'''INSERT'''/'''UPDATE'''/'''DELETE'''. An real example can be found in '''contrib/pg_stat_statements/pg_stat_statements.c''', but a trivial demo (click to expand) is:
+
Add some basic running accounting of reorder buffer stats to '''struct WalSnd''' per the following sample:
  
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;">
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;">
 
<syntaxhighlight lang="C" line='line'>
 
<syntaxhighlight lang="C" line='line'>
  
static ProcessUtility_hook_type next_ProcessUtility_hook;
+
        /* Statistics for total reorder buffered txns */
 
+
        int32          reorderBufferedTxns;
static void
+
        int32          reorderBufferedSnapshots;
demo_ProcessUtility_hook(PlannedStmt *pstmt,
+
        int64          reorderBufferedEventCount;
                                          const char *queryString, ProcessUtilityContext context,
+
         int64          reorderBufferedBytes;
                                          ParamListInfo params,
 
                                          QueryEnvironment *queryEnv,
 
                                          DestReceiver *dest, char *completionTag)
 
{
 
  /* Do something silly to show how the hook can work */
 
  if (IsA(parsetree, TransactionStmt))
 
  {
 
    TransactionStmt *stmt = (TransactionStatement)parsetree;
 
    if (stmt->kind == TRANS_STMT_PREPARE && !is_superuser())
 
         ereport(ERROR,
 
                (errmsg("MyDemoExtension prohibits non-superusers from using PREPARE TRANSACTION")));
 
  }
 
  
  /* Call next hook if registered, or original postgres stmt */
+
        /* Statistics for transactions spilled to disk. */
  if (next_ProcessUtility_hook)
+
        int32          spillTxns;
    next_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);
+
        int32          spillSnapshots;
  else
+
        int64          spillEventCount;
    standard_ProcessUtility_hook(pstmt, queryString, context, params, queryEnv, dest, completionTag);
+
        int64          spillBytes;
  
   if (completionTag)
+
        /*
     ereport(LOG,
+
        * When in ReorderBufferCommit for a txn, basic info about
            (errmsg("MyDemoExtension allowed utility statement %s to run", completionTag)));
+
        * the txn being processed.
}
+
        *
 +
        * We already report the progress
 +
        * lsn as the sent lsn, but it can't go backwards so we expose
 +
        * the txn-specific lsn here too. And the oldest lsn relevant
 +
        * to the txn is also worth knowing to give an indication of
 +
        * xact duration and to compare to restart_lsn.
 +
        */
 +
        TransactionId   reorderBufferCommitXid;
 +
        XLogRecPtr      reorderBufferCommitRecEndLSN;
 +
        TimestampTz     reorderBufferCommitTimestamp;
 +
        XLogRecPtr      reorderBufferCommitXactBeginLSN;
 +
        XLogRecPtr      reorderBufferCommitSentRecLSN;
  
void
 
_PG_init(void)
 
{
 
  next_ProcessUtility_hook = ProcessUtility_hook;
 
  ProcessUtility_hook = demo_ProcessUtility_hook;
 
}
 
 
</syntaxhighlight>
 
</syntaxhighlight>
 
</div>
 
</div>
  
==== Existing hooks ====
+
==== Reorder buffer inspection functions ====
  
To list all hooks that follow the convention of `HookName_hook_type HookName_hook` and are exposed as public API, run
+
Also (maybe) add reorderbuffer API functions to let output plugins and other tools inspect reorder buffer state within a walsender in finer detail.
 
 
<pre>
 
git grep "PGDLLIMPORT .*_hook_type" src/include/
 
</pre>
 
 
 
At time of writing these hooks were:
 
  
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;">
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;">
<br/>
 
<syntaxhighlight lang="C" line='line'>
 
src/include/catalog/objectaccess.h:extern PGDLLIMPORT object_access_hook_type object_access_hook;
 
src/include/commands/explain.h:extern PGDLLIMPORT ExplainOneQuery_hook_type ExplainOneQuery_hook;
 
src/include/commands/explain.h:extern PGDLLIMPORT explain_get_index_name_hook_type explain_get_index_name_hook;
 
src/include/commands/user.h:extern PGDLLIMPORT check_password_hook_type check_password_hook;
 
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorStart_hook_type ExecutorStart_hook;
 
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorRun_hook_type ExecutorRun_hook;
 
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorFinish_hook_type ExecutorFinish_hook;
 
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorEnd_hook_type ExecutorEnd_hook;
 
src/include/executor/executor.h:extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;
 
src/include/fmgr.h:extern PGDLLIMPORT needs_fmgr_hook_type needs_fmgr_hook;
 
src/include/fmgr.h:extern PGDLLIMPORT fmgr_hook_type fmgr_hook;
 
src/include/libpq/auth.h:extern PGDLLIMPORT ClientAuthentication_hook_type ClientAuthentication_hook;
 
src/include/optimizer/paths.h:extern PGDLLIMPORT set_rel_pathlist_hook_type set_rel_pathlist_hook;
 
src/include/optimizer/paths.h:extern PGDLLIMPORT set_join_pathlist_hook_type set_join_pathlist_hook;
 
src/include/optimizer/paths.h:extern PGDLLIMPORT join_search_hook_type join_search_hook;
 
src/include/optimizer/plancat.h:extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;
 
src/include/optimizer/planner.h:extern PGDLLIMPORT planner_hook_type planner_hook;
 
src/include/optimizer/planner.h:extern PGDLLIMPORT create_upper_paths_hook_type create_upper_paths_hook;
 
src/include/parser/analyze.h:extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
 
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_permissive;
 
src/include/rewrite/rowsecurity.h:extern PGDLLIMPORT row_security_policy_hook_type row_security_policy_hook_restrictive;
 
src/include/storage/ipc.h:extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
src/include/tcop/utility.h:extern PGDLLIMPORT ProcessUtility_hook_type ProcessUtility_hook;
 
src/include/utils/elog.h:extern PGDLLIMPORT emit_log_hook_type emit_log_hook;
 
src/include/utils/lsyscache.h:extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;
 
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_relation_stats_hook_type get_relation_stats_hook;
 
src/include/utils/selfuncs.h:extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;
 
</syntaxhighlight>
 
</div>
 
  
=== Callbacks ===
+
These would be fairly simple to add given that we'd already need to add a running memory accounting counter and entry counter to each ReorderBufferTXN:
  
PostgreSQL accepts callback functions in a wide variety of places. Function pointers can be passed to individual postgres API functions for immediate use or to store in created objects for later invocation. They're distinct from hooks mainly in that they're scoped to some object or function call, not chained off a global variable.
+
* '''List *ReorderBufferGetTXNs(ReorderBuffer *rb)''' or something like it to list all toplevel xids for which there are reorder buffers. Maybe just expose an iterator over '''ReorderBuffer.toplevel_by_lsn''' to avoid lots of copies?
 +
* '''void ReorderBufferTXNGetSize(ReorderBuffer *rb, ReorderBufferTXN *txn, size_t *inmemory, size_t *ondisk, uint64 *allchangecount, uint64 *rowchangecount, bool *has_catalog_changes)''' - get stats on one reorder buffered top-level txn.
  
For example, extension-defined GUCs can register hooks that're called before and after the GUC value is changed. See '''include/utils/guc.h''':
+
These would be very useful for output plugins that wish to offer some insight into xact progress, plugins that do their own spooling of processed txns, etc. They'd also be great when debugging the server.
  
<syntaxhighlight lang="C" line='line'>
+
</div>
/*...*/
 
typedef bool (*GucStringCheckHook) (char **newval, void **extra, GucSource source);
 
/*...*/
 
typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 
/*...*/
 
extern void DefineCustomStringVariable(const char *name,
 
                                      /*...*/
 
                                      GucStringCheckHook check_hook,
 
                                      GucStringAssignHook assign_hook,
 
                                      GucShowHook show_hook);
 
  
</syntaxhighlight>
+
==== Logical rep related trace events (perf/dtrace/systemtap etc) ====
  
Another example is '''MemoryContext''' callbacks, where a callback can be registered to perform destructor-like actions via '''MemoryContextRegisterResetCallback(...)'''.
+
Add a bunch of '''TRACE_POSTGRESQL_''' trace events for perf/dtrace/systemtap/etc for the following activities within postgres.
  
==== Existing callbacks ====
+
Statically defined trace events are *very* cheap, effectively free, when unused and offer a huge insight into the system's operation. Placing static events instead of relying on dynamic probes:
  
===== Lifecycle callbacks =====
+
* gives insight into production servers where debuginfo may not be present
 +
* lets us expose more useful arguments
 +
* serves to document points of interest and make them discoverable
 +
* works across server versions better since they're more stable and consistent
 +
* frees the user from having to find relevant function names and args
 +
* ... and they can be used in gdb too
  
Extensions can use postmaster and backend lifecycle callbacks including
+
Proposed events list follows.
  
* '''before_shmem_exit'''
+
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;">
* '''on_proc_exit'''
 
* '''on_shmem_exit'''
 
  
There are also transaction lifecycle callbacks:
+
''walsender:''
  
* '''RegisterXactCallback'''
+
* walsender started
 +
* walsender sleeping
 +
  * waiting for more WAL to be flushed, client activity or timeout
 +
  * waiting for socket to be writeable
 +
* walsender woken (perhaps make this a postgres wide event on wait start and wait finish with reason like latch set)
 +
* walsender send buffer flushed
 +
* walsender send buffer appended to (size)
 +
* walsender signalled
 +
* walsender state change
 +
* walsender exiting
  
Cache invalidation callbacks:
+
''xlogreader:''
  
* '''CacheRegisterRelcacheCallback'''
+
* xlogreader switched to a new segment
* '''CacheRegisterSyscacheCallback'''
+
* xlogreader fetched new page
 +
* xlogreader returned a record
  
and many many more.
+
logical decoding:
  
Most of these work more like overrideable hooks in that they're generally part of the process-wide state.
+
* decoding context created
 +
* decoding for new slot creation started
 +
* decoding for new slot creation finished, slot ready
 +
* logical decoding processed any record from any rmgr (start_lsn, end_lsn)
 +
* logical trace events for each rmgr and record-type
 +
* logical decoding end of txn
  
===== errcontext callbacks =====
+
snapbuild:
  
Extensions can define their own errcontext callbacks. When log messages ('''elog''' or '''ereport''') are prepared these errcontext callbacks are called to annotate the error message by appending to the '''CONTEXT''' field.
+
* snapbuild state change (newstate)
 +
* snapbuild build snapshot
 +
* snapbuild free snapshot
 +
* snapbuild discard snapshot
 +
* serialized snapshot to disk
 +
* deserialized snapshot from disk
 +
* snapbuild export full data snapshot
  
errcontext callbacks generally follow the call-stack, with new entries pushed onto the errcontext callback stack on entry to a function and popped on exit. The errcontext stack is automatically unwound by PostgreSQL's exception handling macros '''PG_TRY()''' and '''PG_CATCH()''' so there is no need for a '''PG_CATCH()''' to restore the errcontext stack and '''PG_RE_THROW()'''.
+
''Reorder buffering:''
  
See existing usage in core for examples.
+
* reorder buffer created for newly seen xid (xid)
 +
* detected toplevel xid has catalog changes (rbtxn, xid)
 +
* add event to reorder buffer
 +
  * All traces have (rbtxn, xid, lsn, event_kind, event_size)
 +
  * change event traces also report affected relfilenode
 +
* discarded reorder buffer (rbtxn, xid)
 +
* started to spill reorder buffer to disk (rbtxn, xid)
 +
* finished spilling reorder buffer to disk (rbtxn, xid, byte_size, event_count)
 +
* discarded spilled reorder buffer (rbtxn, xid)
  
'''Warning''': failing to pop an errcontext callback can have very confusing results as the context pointer will point to stack that has since been re-used so it will attempt to treat some unpredictable value as a function pointer for the errcontext callback. See [https://www.2ndquadrant.com/en/blog/dev-corner-error-context-stack-corruption/ this blog post for details].
+
''output plugins:''
  
 +
* before-ReorderBufferCommit (rbtxn, xid, commit_lsn, from_disk, has_catalog_changes)
 +
* before and after all output plugin callbacks
 +
* output plugin wrote data (size in bytes)
  
=== Abstract interfaces with function pointer implementations ===
+
Also add functions output plugins may call to report trace events based on what they do with rows in their callbacks, in particular one to report if the plugin discarded skipped over (discarded) a change.
  
In many places PostgreSQL follows the pseudo-OO C convention of defining an interface as a struct of function pointers, then calling methods of the interface via the function pointers.
+
</div>
  
Some other extension point is generally used to register these, such as a dlsym'd function, a callback, etc.
+
==== Logical decoding output plugin reorder buffer event filter callback ====
  
One of many examples is the logical decoding interface. PostgreSQL calls:
+
Logical decoding output plugin callback to filter row-change events as they are decoded, before they are added to the reorder buffer.
  
<syntaxhighlight lang="C" line='line'>
+
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;">
void _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
</syntaxhighlight>
 
  
when loading an extension library as an output plugin. This assigns extension-defined function pointers to members of the passed '''OutputPluginCallbacks''' struct, e.g.
+
This one is trickier than it looks because when we're doing logical decoding we don't generally have an active historic snapshot. We only have a reliable snapshot during '''ReorderBufferCommit''' processing once the transaction commit has been seen and the already buffered changes are being passed from the reorder buffer to the output plugin's callbacks.
  
<syntaxhighlight lang="C" line='line'>
+
The problem with that is that the output plugin may have no interest in potentially large amounts of the data being reorder-buffered, so PostgreSQL does a lot of work and disk writes decoding and buffering the data that the output plugin will throw away as soon as it sees it.
void
 
_PG_output_plugin_init(OutputPluginCallbacks *cb)
 
{
 
    AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
 
  
    cb->startup_cb = pg_decode_startup;
+
The reorder buffer can already skip over changes that are from a different database. There's also a filter callback output plugins can use to skip reorder buffering of whole transactions based on their replication origin ID which helps replication systems not reorder-buffer transactions that were replicated from peer nodes.
    cb->begin_cb = pg_decode_begin_txn;
 
    cb->change_cb = pg_decode_change;
 
    cb->truncate_cb = pg_decode_truncate;
 
    cb->commit_cb = pg_decode_commit_txn;
 
    cb->filter_by_origin_cb = pg_decode_filter;
 
    cb->shutdown_cb = pg_decode_shutdown;
 
    cb->message_cb = pg_decode_message;
 
}
 
</syntaxhighlight>
 
  
... each of which conforms to a specific signature and is invoked at specific points in execution.
+
But plugins have 'no way to filter the data going into the reorder buffer by table or key.' All data for all tables in a non-excluded transaction is always reorder-buffered in full.
  
Many of these share a common state structure defined in PostgreSQL's headers and passed to each callback in the interface. For logical decoding that's '''LogicalDecodingContext''' from '''include/replication/logical.h'''.
+
That's a big problem for a few use cases including:
  
To allow extensions to track their own state PostgreSQL usually defines a '''void*''' private data member in such state structures, e.g. '''output_plugin_private''' in '''LogicalDecodingContext'''.
+
* Replication slots that are only interested in one specific table, e.g. during a resynchronization operation
 +
* Workloads with very high churn queue tables etc that are not replicated but coexist in a database with data that must be replicated
  
See '''contrib/test_decoding/test_decoding.c''' for example usage.
+
</div>
 
 
=== Extension of shared memory and IPC primitives ===
 
 
 
Extensions may use a wide variety of core features relating to shared memory, registering their own:
 
 
 
* shared memory segments - '''RequestAddinShmemSpace''', '''shmem_startup_hook''' and '''ShmemInitStruct''' in '''storage/shmem.h'''
 
* lightweight lock tranches (LWLock) - '''LWLockRegisterTranche''' etc in '''storage/lwlock.h'''
 
* latches - '''storage/latch.h'''
 
* dynamic shared memory (DSM) - '''storage/dsm.h'''
 
* dynamic shared memory areas (DSA) - '''utils/dsa.h'''
 
* shared-memory queues (shm_mq) - '''storage/shm_mq.h'''
 
* condition variables - '''storage/condition_variable.h'''
 
 
 
Extensions may use PostgreSQL's process latches too; most of the time they can just use their own '''&MyProc->procLatch''' or set another backend's latch from its '''PGPROC''' entry.
 
 
 
=== Background workers (bgworkers) ===
 
 
 
Extensions may register new PostgreSQL backends that exist independently of any client connection.
 
 
 
A bgworker runs as a child of the postmaster. It usually has full access to a particular database as if it was a user backend started on that DB. bgworkers control their own signal handling, run their own event loop, can do their own file and socket I/O, link to or dynamically load arbitrary C libraries, etc etc. They have broad access to most PostgreSQL internal facilities - they can do low level table manipulation with genam or heapam, they can use the SPI, they can start other bgworkers, etc.
 
 
 
There are two kinds of bgworker, static and dynamic. Static workers can only be registered at '''_PG_init''' time in '''shared_preload_libraries'''. Dynamic workers can be launched at any time *after* startup completes. New code usually uses dynamic workers launched from a hook on
 
 
 
Considerable care is needed to get background worker implementations correct. At time of writing they do not have any way to use 
 
 
 
=== Defining various server objects from extensions ===
 
 
 
Extensions can create all sorts of server objects. GUCs (configuration variables) are one of many such examples along with all the usual SQL-visible stuff implemented with SQL-callable C functions like index access methods.
 
 
 
A non-exhaustive list includes:
 
 
 
==== SQL-callable C functions ====
 
 
 
==== Data types ====
 
  
==== Security label providers ====
+
== TODO: New kinds of extension point ==
  
=== Generic WAL (generic xlog) ===
+
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend as described below. described below.
 
 
Generic WAL is a PostgreSQL feature that lets extensions create and use relations with pages in an extension-defined format.
 
 
 
The extension writes custom WAL records with extension-defined payloads. PostgreSQL applies the WAL in a crash-safe, consistent manner on the master and any physical replicas. The extension may then read pages from the relation for whatever purpose it needs.
 
 
 
See '''generic_xlog.h''' and '''generic_xlog.c'''.
 
 
 
Note that 'extensions may not register redo callbacks for generic WAL' so they cannot run their own code during crash-recovery or replica WAL replay. Extensions may only read the relation's pages once the changes are applied.
 
 
 
See '''contrib/bloom.c''' for an index implementation built on top of generic WAL.
 
 
 
There is not currently any logical decoding support for generic WAL records. They cannot be reorder-buffered and there is no output plugin callback that accepts them.
 
 
 
=== Logical WAL messages ===
 
 
 
Logical WAL messages provide a WAL-consistent, crash-safe and optionally-transactional one-way communication channel from upstream bgworkers and user backends/functions to downstream receivers of logical decoding output plugin data streams.
 
 
 
Extensions may write "logical WAL messages" with a label string and an arbitrary extension-defined payload to WAL. The label is used to allow extensions to identify their own messages and ignore messages from other extensions. These logical WAL messages are passed to a message handler callback on all logical decoding output plugins that implement the handler. The output plugin is expected to know which messages it is interested in and ignore the rest. The output plugin may use the message content to change plugin state internally and/or write a message in a plugin-defined format to its client output stream.
 
 
 
There are two message types. Transactional messages are reorder-buffered and decoded as part of a transaction. Non-transactional messages are not reorder-buffered; instead the output plugin's message handler callback is invoked as soon as the message is decoded from WAL.
 
 
 
See '''replication/message.h''' and the '''message_cb''' callback in '''struct OutputPluginCallbacks''' in '''replication/output_plugin.h'''.
 
 
 
Logical WAL messages are treated as no-ops during crash recovery redo and physical replica replay. They have no effect on the heap and there are no callbacks or hooks that can handle them at redo time. They're ignored by everything except logical decoding.
 
 
 
== Wishlist for other extension point types ==
 
 
 
There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend.
 
  
 
=== Cache management and cache invalidation ===
 
=== Cache management and cache invalidation ===
Line 358: Line 216:
 
Add your proposed points and use cases here.
 
Add your proposed points and use cases here.
  
=== DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===
+
=== Invoking extension code for existing '''TRACE_POSTGRESQL_''' tracepoints ===
 +
 
 +
Currently PosgreSQL defines '''TRACE_POSTGRESQL_''' tracepoints as thin wrappers around DTrace (see below).
 +
 
 +
It'd be very useful if it were possible for extensions to intercept these and run their own code. The most obvious case is to integrate other tracing systems, particularly things like distributed / full-stack tracing engines such as Zipkin, Jaeger (OpenTracing), OpenCensus, etc.
 +
 
 +
This would have to be done with extreme care as there are tracepoints in very hot paths like LWLock acquisition. We could possibly get away with a hook function pointer, but even that might have too much impact.
 +
 
 +
=== Extension-defined DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs) ===
 +
 
 +
Give extensions an easy way to add new trace events to their own code, to be exposed to SDT when the extension is loaded. This probably means PGXS support for processing an extension specific '''.d''' file and linking it in + possibly some runtime hint to tell the tracing provider to look for it.
 +
 
 +
<div class="toccolours mw-collapsible mw-collapsed" style="width:100%; overflow:auto;">
  
 
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.
 
PostgreSQL accepts the configure option '''--enable-dtrace''' to generate [http://dtrace.org/guide/chp-sdt.html DTrace-compatible statically defined tracepoint events ]. Usually this uses [https://sourceware.org/systemtap/ systemtap] on Linux.
Line 371: Line 241:
  
 
Extensions may duplicate the same build logic and define their own providers though.
 
Extensions may duplicate the same build logic and define their own providers though.
 +
 +
</div>

Revision as of 04:33, 9 August 2019

TODO: Hooks, callbacks and trace points

This TODO/wishlist sub-section is intended for all users and developers to edit to add their own thoughts on desired extension points within the core PostgreSQL codebase.

Existing APIs usable from extensions

There are a great many existing extension points in PostgreSQL. The article PostgresServerExtensionPoints lists them with references to core documentation, entrypoints in core code, etc.

TODO: New hooks, callbacks and tracepoints

Add the hooks, callbacks, etc you'd like to see added here along with why they'd be useful and any considerations of performance impact etc, categorizing them where it makes sense.

Logical decoding etc

CR

Changes to improve visibility and management for logical decoding, reorder buffering, historic snapshot management and output plugin API.

The reorderbuffer's resource use is totally opaque to monitoring right now. There's no useful way to know much about how it's using memory and all that can really be known about transactions spilled to disk is their xid.

It's also nearly impossible to know much about what reorder buffering and logical decoding are working on right now without the use of gdb or something like perf dynamic userspace probes. That's not always practical in production environments.

Suggestions:

Logical decoding and reorder buffering stats in struct WalSnd

Add some basic running accounting of reorder buffer stats to struct WalSnd per the following sample:

 1         /* Statistics for total reorder buffered txns */
 2         int32           reorderBufferedTxns;
 3         int32           reorderBufferedSnapshots;
 4         int64           reorderBufferedEventCount;
 5         int64           reorderBufferedBytes;
 6 
 7         /* Statistics for transactions spilled to disk. */
 8         int32           spillTxns;
 9         int32           spillSnapshots;
10         int64           spillEventCount;
11         int64           spillBytes;
12 
13         /*
14          * When in ReorderBufferCommit for a txn, basic info about
15          * the txn being processed.
16          * 
17          * We already report the progress
18          * lsn as the sent lsn, but it can't go backwards so we expose
19          * the txn-specific lsn here too. And the oldest lsn relevant
20          * to the txn is also worth knowing to give an indication of
21          * xact duration and to compare to restart_lsn.
22          */
23         TransactionId   reorderBufferCommitXid;
24         XLogRecPtr      reorderBufferCommitRecEndLSN;
25         TimestampTz     reorderBufferCommitTimestamp;
26         XLogRecPtr      reorderBufferCommitXactBeginLSN;
27         XLogRecPtr      reorderBufferCommitSentRecLSN;

Reorder buffer inspection functions

Also (maybe) add reorderbuffer API functions to let output plugins and other tools inspect reorder buffer state within a walsender in finer detail.

These would be fairly simple to add given that we'd already need to add a running memory accounting counter and entry counter to each ReorderBufferTXN:

  • List *ReorderBufferGetTXNs(ReorderBuffer *rb) or something like it to list all toplevel xids for which there are reorder buffers. Maybe just expose an iterator over ReorderBuffer.toplevel_by_lsn to avoid lots of copies?
  • void ReorderBufferTXNGetSize(ReorderBuffer *rb, ReorderBufferTXN *txn, size_t *inmemory, size_t *ondisk, uint64 *allchangecount, uint64 *rowchangecount, bool *has_catalog_changes) - get stats on one reorder buffered top-level txn.

These would be very useful for output plugins that wish to offer some insight into xact progress, plugins that do their own spooling of processed txns, etc. They'd also be great when debugging the server.

Logical rep related trace events (perf/dtrace/systemtap etc)

Add a bunch of TRACE_POSTGRESQL_ trace events for perf/dtrace/systemtap/etc for the following activities within postgres.

Statically defined trace events are *very* cheap, effectively free, when unused and offer a huge insight into the system's operation. Placing static events instead of relying on dynamic probes:

  • gives insight into production servers where debuginfo may not be present
  • lets us expose more useful arguments
  • serves to document points of interest and make them discoverable
  • works across server versions better since they're more stable and consistent
  • frees the user from having to find relevant function names and args
  • ... and they can be used in gdb too

Proposed events list follows.

walsender:

  • walsender started
  • walsender sleeping
 * waiting for more WAL to be flushed, client activity or timeout
 * waiting for socket to be writeable
  • walsender woken (perhaps make this a postgres wide event on wait start and wait finish with reason like latch set)
  • walsender send buffer flushed
  • walsender send buffer appended to (size)
  • walsender signalled
  • walsender state change
  • walsender exiting

xlogreader:

  • xlogreader switched to a new segment
  • xlogreader fetched new page
  • xlogreader returned a record

logical decoding:

  • decoding context created
  • decoding for new slot creation started
  • decoding for new slot creation finished, slot ready
  • logical decoding processed any record from any rmgr (start_lsn, end_lsn)
  • logical trace events for each rmgr and record-type
  • logical decoding end of txn

snapbuild:

  • snapbuild state change (newstate)
  • snapbuild build snapshot
  • snapbuild free snapshot
  • snapbuild discard snapshot
  • serialized snapshot to disk
  • deserialized snapshot from disk
  • snapbuild export full data snapshot

Reorder buffering:

  • reorder buffer created for newly seen xid (xid)
  • detected toplevel xid has catalog changes (rbtxn, xid)
  • add event to reorder buffer
 * All traces have (rbtxn, xid, lsn, event_kind, event_size)
 * change event traces also report affected relfilenode
  • discarded reorder buffer (rbtxn, xid)
  • started to spill reorder buffer to disk (rbtxn, xid)
  • finished spilling reorder buffer to disk (rbtxn, xid, byte_size, event_count)
  • discarded spilled reorder buffer (rbtxn, xid)

output plugins:

  • before-ReorderBufferCommit (rbtxn, xid, commit_lsn, from_disk, has_catalog_changes)
  • before and after all output plugin callbacks
  • output plugin wrote data (size in bytes)

Also add functions output plugins may call to report trace events based on what they do with rows in their callbacks, in particular one to report if the plugin discarded skipped over (discarded) a change.

Logical decoding output plugin reorder buffer event filter callback

Logical decoding output plugin callback to filter row-change events as they are decoded, before they are added to the reorder buffer.

This one is trickier than it looks because when we're doing logical decoding we don't generally have an active historic snapshot. We only have a reliable snapshot during ReorderBufferCommit processing once the transaction commit has been seen and the already buffered changes are being passed from the reorder buffer to the output plugin's callbacks.

The problem with that is that the output plugin may have no interest in potentially large amounts of the data being reorder-buffered, so PostgreSQL does a lot of work and disk writes decoding and buffering the data that the output plugin will throw away as soon as it sees it.

The reorder buffer can already skip over changes that are from a different database. There's also a filter callback output plugins can use to skip reorder buffering of whole transactions based on their replication origin ID which helps replication systems not reorder-buffer transactions that were replicated from peer nodes.

But plugins have 'no way to filter the data going into the reorder buffer by table or key.' All data for all tables in a non-excluded transaction is always reorder-buffered in full.

That's a big problem for a few use cases including:

  • Replication slots that are only interested in one specific table, e.g. during a resynchronization operation
  • Workloads with very high churn queue tables etc that are not replicated but coexist in a database with data that must be replicated

TODO: New kinds of extension point

There are other sorts of functionality in PostgreSQL that are not presently extensible at all. Some of these would be wonderful to be able to extend as described below. described below.

Cache management and cache invalidation

PostgreSQL has a solid cache management system in the form of its relcache and catcache. See utils/relcache.h, utils/catcache.h and utils/inval.h.

Complex extensions often need to perform their own caching and invalidation on various extension-defined state, such as extension configuration read from tables. At time of writing there is no way for extensions to register their own cache types and use PostgreSQL's cache management for this, so they must implement their own cache management and invalidations - usually on top of a dynahash (utils/dynahash.h).

Wait Event types

Extensions have access to the PG_WAIT_EXTENSION WaitEvent type, but have no ability to define their own finer grained wait events. This limits how well complex extensions can be traced and monitored via pg_stat_activity and other wait-event aware interfaces.

Heavyweight lock types and tags

Being able to extend PostgreSQL's heavyweight locks with new lock types would be immensely useful for distributed and clustered applications. They often have to re-implement significant parts of the lock manager, and their own locks aren't then visible to the core deadlock detector etc. They'd also be available to the user for monitoring in pg_locks.

TODO: set out example for how it might work

Deadlock detection

Extensions that make PostgreSQL instances participate in distributed systems can suffer from distributed deadlocks. PostgreSQL's existing deadlock detector could help with this if it was extensible with ways to discover new process dependency edges.

Alternately, if distributed locks could be represented as PostgreSQL heavyweight locks they'd be visible in pg_locks for monitoring and the deadlock detector could possibly handle them with its existing capabilities.

Transaction log, transaction visibility and commit

Some kinds of distributed database systems need a distributed transaction log.

Right now the PostgreSQL transaction log a.k.a. commit log (access/clog.h) isn't at all extensible and is backed by a SLRU (access/slru.h) on disk.

There have been discussions on -hackers about adding a transaction manager abstraction but none have led to any pluggable transaction management being committed.

Parser syntax extension points

Mechanisms to allow the parser to be extended with addin-defined syntax are requested semi-regularly on the -hackers list. This is a much harder problem than it looks though, especially with PostgreSQL's flex and bison based LALR(1) parser, which is implemented using C code generation at compile time and compiled along with the rest of the server, then statically linked to the server executable.

Some more targeted extension points are probably possible in places where the syntax can be well-bounded. For example, it might be practical to allow extensions to register new elements in WITH(...) lists such as in COPY ... WITH (FORMAT CSV, ...).

Add your proposed points and use cases here.

Invoking extension code for existing TRACE_POSTGRESQL_ tracepoints

Currently PosgreSQL defines TRACE_POSTGRESQL_ tracepoints as thin wrappers around DTrace (see below).

It'd be very useful if it were possible for extensions to intercept these and run their own code. The most obvious case is to integrate other tracing systems, particularly things like distributed / full-stack tracing engines such as Zipkin, Jaeger (OpenTracing), OpenCensus, etc.

This would have to be done with extreme care as there are tracepoints in very hot paths like LWLock acquisition. We could possibly get away with a hook function pointer, but even that might have too much impact.

Extension-defined DTrace/Perf/SystemTAP/etc statically defined trace events (SDTs)

Give extensions an easy way to add new trace events to their own code, to be exposed to SDT when the extension is loaded. This probably means PGXS support for processing an extension specific .d file and linking it in + possibly some runtime hint to tell the tracing provider to look for it.

PostgreSQL accepts the configure option --enable-dtrace to generate DTrace-compatible statically defined tracepoint events . Usually this uses systemtap on Linux.

Events are defined as markers in the source code as TRACE_POSTGRESQL_EVENTNAME(...) function-like macros, which are no-ops unless trace events generation are enabled.

These events can be used by trace-event aware utilities including perf (Linux), ebpf-tools (Linux), systemtap (Linux), DTrace (Solaris/FreeBSD), etc to observe PostgreSQL's behaviour non-invasively. (They can also be used by gdb).

The PostgreSQL implementation translates src/backend/utils/probes.d to a C header src/backend/utils/probes.h that defines TRACE_POSTGRESQL_ events as wrappers for DTRACE_PROBE macros, which in turn are defined by /usr/include/sys/sdt.h as wrappers for _STAP_PROBE . That injects some asm placeholders that're used by tracing systems.

At present PostgreSQL extensions don't have any way to use PostgreSQL's own tracepoint generation to add their own tracepoints in extension code.

Extensions may duplicate the same build logic and define their own providers though.