Slony Feature Brainstorming

This collects ideas from meetings held in November/December 2010 for the next major version of Slony.

It is quite likely that many of these features won't get implemented, based on a "more work than they're worth" evaluation.

The point of the exercise has been to capture a goodly set of ideas to make it possible to effectively evaluate what might be worth implementing.

slon extensions

Slon monitoring

per Chris Browne
slon records in a queryable form what it's working on
Requires writing (+COMMIT) at the start of the event loop
[As noted by Yehuda] This ought to also be useful with slonik, to allow indicating "whazzup"?
Bug #175

Debate took place surrounding various mechanisms. The only one without dramatically unacceptable properties was to add a table to the Slony-I schema.

SNMP: Existing standard for network-based monitoring; Downside - requires extra infrastructure including libraries and possibly additional tooling to use it; Bigger downside: SNMP support planned to be removed - see Bug #170
Spread: Fast; Downside - requires extra infrastructure that's not particularly "standard"; Would require substantial coding effort to ensure there are components listening; Would require not-insubstantial effort to ensure that there is suitable configuration
NOTIFY/LISTEN: Built-in to Postgres - no new infrastructure needed; In Postgres 9.0+ can carry payload; Only becomes visible upon COMMIT; Results are not stored; listeners must pre-declare their interest
SQL table: Built-in to Postgres - no new infrastructure needed; Only becomes visible upon COMMIT; Another table to manage and clean up, but if we limit the data stored (e.g. - no history) there won't be much maintenance required.

Monitoring Requirements

Crucial facts that we want to know about:

Is replication behind?
What are components (e.g. - slon, slonik) doing?
Note that slon has numerous threads
Recent abnormal states for events (e.g. - error messages)
Are any non-SYNC events outstanding?
Backlog volume?
What is the cluster's configuration?

Replication Behind?

The existing view sl_status captures this from a what-is-confirmed perspective. That is not perfect, but it is not obvious that there is high priority to enhancing this.

What are components doing?

Nothing relevant is captured in a usable fashion.

It is thought that what we may do is to add a table where each thread would capture what am I doing? (which would replace whatever was previously being done)

This table would contain a tuple for:

Each remote worker thread
Cleanup thread
Each remote listener thread
Local SYNC thread

It would track things such as:

Time processing started
What thread/process am I?
What node am I for?
What am I doing?
Possibly in several pieces, to cover the following sorts of facts:
1. Event ID
2. Event type
  Though this could be pulled from sl_event, given node and event ID
3. Additional event activity
  Again, could be pulled from sl_event, given node and event ID
4. Backend PID
  This would support the slonik scenario; if the PID is dead, then the activity is clearly complete (or dead!), and may be safely cleaned out.

Note that the contents of this table should be quite tiny; a tuple per slon thread on a node.

This also needs to be able to capture what slonik is doing; this seems more troublesome.

It is possible to have multiple slonik instances acting concurrently - multiple concurrent events!
There is no natural "event loop" such that slonik activities would be expected to clean themselves up over time
Capturing backend PID for the connection allows knowing when slonik processes are done. They ought to be self-cleaning, but we have to be able to cope with failure, otherwise we're worsening things.

Suggested slon implementation

Two approaches emerged for establishing connections to capture this monitoring data

Each thread opens its own DB connection
Unacceptable: Leads to enormous increase in use of DB connections that are mostly basically idle
Establish a "monitoring thread"
1. A message queue allows other threads to stow entries (complete with timestamps) that the monitoring thread periodically flushes to the database
2. It is plausible that this thread could be merged into the existing local SYNC thread which isn't terribly busy. But that is a performance optimization.

Recent abnormal states for events

This captures messages about the most recent problem that occurred, storing:

Time of abnormality
Event ID
Node ID
Description / Error Message

non-SYNC events outstanding?

This information is already captured, and may be revealed by running a query that asks, on the source node, for all events that are:

Not SYNC events
Have not been confirmed by the subscriber

Backlog volume

Bug #166

This seems troublesome; calculating the number of sl_log_* tuples involved in a particular SYNC requires running the same complex query that the remote_worker thread uses to determine which tuples are to be applied.

This is a query that is complex to generate that is fairly expensive to run.

Note that Bug #167 is changing this query.

Cluster configuration

This is captured as Bug #176

There is an existing tool that does some analysis of cluster configuration; see test_slony_state.pl

It is desirable to have something that generates diagrams of the relationships between nodes, capturing:

Nodes
Subscription Sets, and the paths they take
Paths between nodes
Listen paths

It would be nice for the Subscription Set diagram to include indication of replication state/lag for each node, indicating things like:

Event Number
Events Behind Parent
Time Behind Parent
Events Behind Origin
Time Behind Origin

Faster Replication - COPY Protocol

Use COPY + Triggers on sl_log_*
per Jan Wieck
New encoding of tuple information
Triggers do direct heap updates
Eliminates overhead of parsing each statement
COPY implicitly introduces streaming
- Eliminates need for current memory management logic for processing large tuples
Should reduce amount of work done by slons to parse sl_log_* cursors, generating I/U/D streams
- Requires separating out log shipping into a separate daemon, as there are users known to depend on being able to parse log shipping data as INSERT/UPDATE/DELETE statements
Based on limited interest this does not appear to be a high priority at the moment

SYNC pipelining

per Jan Wieck
open 2 connections to source DB, start pulling new data while the previous request is pushing I/U/D requests to the subscriber
Might be unnecessary if using COPY+triggers to stream data
Maximum value comes if the time required to get to the point of %f seconds delay for first row is equal to the remainder of the time required to process the SYNC
- If query to start up the cursor on sl_log_* takes most of the time, then there's not much gained by starting the next one early
- If query to process sl_log_* data takes most of the time, again, there's not much gained by starting the next one early
The instances where this would actually make a performance difference seems to be very narrow

Compress sequences of DELETE requests

per Chris Browne
Note that TRUNCATE is already replicated
Two ways to interpret it...

log trigger compresses them upon receipt, essentially merging subsequent requests into a single sl_log_* tuple
This changes the INSERT-only understanding of sl_log_*
slon remote worker thread compresses them when processing a SYNC

Overall, this seems a pretty dubious feature.

SNMP

per Chris Browne
Some minimal SNMP functionality was added in 2005
Untouched since; requires --with-netsnmp option to ./Configure
Should we improve it, or drop it?
list thread

In discussions of 2010-11-30, some debate over whether this is worth augmenting to support some of the monitoring requirements discussed elsewhere
- In favour, SNMP exists and is a standard protocol for network-based monitoring
- Contrary, several things were observed:
  - Unknown whether the present code works
  - For us to use SNMP within Slony-I for monitoring draws in a dependency on SNMP libraries and possibly other tooling.
    We sure wouldn't want to mandate installing and configuring something like Nagios as a component of Slony-I.
  - In contrast, consider that we already have libpq used everywhere; for us to capture some monitoring information in a table requires no additional components.

Proposal to remove it: Bug #170

Deal better with sl_log backlog

When replication is backlogged the sl_log tables can grow to be huge. This slows down replication.
Have slony create new sl_log_x tables as needed.
See mailing list discussion
Note that Vivek Khera has noticed it happens less when he's using SSD versus spinning disks, so evidently disk performance has material effect on this.

Bug #167 has some further observations:

During catch up of a major backlog, the receiving slon slows down further and further, sometimes to a point where it cannot keep up any longer with the rate of new changes being logged.

This is caused by a problem in the query from the remoteWorker selecting from sl_log. The lower boundary for the log tuples is the min_xid of the first snapshot OR the xip-vector of the first snapshot. That OR clause causes the planner to fail to create a lower scankey for the index scan against sl_log. It therefore always scans from the beginning with a filter condition. At some point this becomes so expensive that it even switches to a seqscan.

The attempted fix for this will be to extract those transactions, that were in progress at the first snapshot and are no longer in progress at the second, out into another UNION. This will allow the plan to use a lower bound scankey for the bulk of the log rows and pull the few transactions from the current OR clause in with a nested loop of index scans.

-- Jan Wieck

Isolated Node

Per discussion on list
Allow configuring that some nodes should be ignored for the purpose of confirmations.
This allows the cleanup thread to trim out sl_log_(1|2) data.
FAILOVER may already support the notion that a "isolated" node might be lost.
An interesting extension: SUBSCRIBE SET could automatically mark the "child" node as isolated until such time as the subscription has completed and caught up.
- This only works if only one set is involved; if there are multiple subscriptions, shunning only works out well for the first one.
- It will work just fine for multiple subscriptions. The point here is to free the master and other forwarders than the data provider for the node, that is busy subscribing, from keeping copies of the log.
This needs a better name. Possibilities:
- isolated
- nondependent

Bug #174

Parentage Check

At time of node start up, check to see if the nodes providing my subscriptions believe my node exists
- If they do, all is well
- If they don't, then presumably I'm a failed node
- If a connection cannot be established, then warn of this (probably with a pretty quick timeout) but continue, for now...

Bug 171

Use application_name

per Chris Browne
If on a version that supports GUC application_name, connections should capture "this is slon"
application_name is available on PG 9.0+
The value we use should likely include various (possibly all) of the following:
- slon
- ID of the node being managed by slon
- ID of the secondary node being queried
- perhaps the cluster name?

Bug 172

Bugs Known

#53 - high RAM usage with high table #

#53
Likely involves replacing array with hash table
Possible BSD-licensed hash table implementations
- C Hash Table
  See also ryantenney / chashtable @ GitHub
  Note: Used in Xen
- UTHash
- lookup3 - Public Domain

#80 - slon daemon restarts itself in a loop after failover

#80

#81 - duplicate key sl_nodelock-pkey and duplicate slon(8) processes not detected

#81

#111 - UNSUBSCRIBE SET cancels outstanding SUBSCRIBE SET

#111

#126 - client side KEEPALIVE on connections

#126

#137 - EXECUTE SCRIPT not applied in right order

#137
Shift DDL from sl_event to sl_log_(1|2)
Allow DBA to specify which locks EXECUTE SCRIPT requires in the slonik script

#152 - DDL noisy - might be handled well by "Other Health Criteria" ideas

#152

#163 - Change to use TIMESTAMPTZ in Slony-defined tables

#163

#166 : Size of SYNC

Bug #166
Useful to ask how much work is involved in particular SYNCs
Allows evaluating questions like:
- Is this SYNC enormous?
- Is slon busy processing a huge SYNC?

Alternatively, if SYNC is small, long running SYNC suggests something broken

If we add a stored function that evaluates this, it should allow a significant simplification of C logic in src/slon/remote_worker.c which presently does the following:

Pull outstanding transaction lists for two events
Generate a potentially enormous WHERE clause indicating the set of transactions that need to be excluded when pulling tuples from sl_log_(1|2)
Compress that WHERE clause in order that the query not be so long that the Postgres parser falls over

Instead, we imagine that a function may do the following:

Create a temp table to capture transaction ID values
Truncate that table, if it already exists
Pull the transaction list described above, and stow it in the temp table

The WHERE clause would be altered to draw from the temp table instead of writing values explicitly into the WHERE clause. This would eliminate a pretty substantial body of code from src/slon/remote_worker.c, including:

The function compress_actionseq (about 280 lines of code)
Portions of sync_helper()
Portions of sync_event()

Open questions:

Is this a material savings?
Is there a problem with repeatedly running TRUNCATE against the temp table, once for each set of SYNC events that is processed?
Would we want an index on the temp table?

slonik extensions

ABORT

Initially mentioned by Vivek Khera
Fail if replication is behind
- event count - "within 3 events is OK"
- time interval "behind by no more than 15 seconds is OK"
Other Health Criteria
- per Chris Browne
- Is node there?
- Is replication set there?
- Is subscription active?
- Run SQL, fail if ERROR

Bug #173

Implicit WAIT FOR EVENT

proposal
Initially mentioned by Chris Browne
Should track (per Yehuda) some status information so that if it's waiting for a sustained period of time, this isn't a mystery to the user or to people monitoring things.
Slonik commands that require inter-node coordination should check to see if all relevant slon processes are running
- notable exceptions include STORE NODE, STORE PATH
- Slonik should warn or error out if the slon isn't running

Controlling Implicit WAIT FOR EVENT

To support both legacy slonik scripts and new ones, the following features are suggested:

slonik should have a command line option that deactivates "auto-wait"
It may be desirable for uses to control wait behavior inside scripts, hence we should add two slonik commands:
- activate auto wait
- deactivate auto wait

Bug #179

DATE

suggested by Stuart Bishop
Basically, notion of being able to get timestamps during a Slonik script

Bug #180

TRY BLOCK

Chris, Stuart conversation
Perhaps we should forbid running non-transactional commands within a TRY block
Non-transactional commands
- WAIT FOR EVENT
- FAILOVER
- EXECUTE SCRIPT
- STORE NODE
Perhaps TRY should be eliminated?

A good definition of what this means has not yet emerged, so we don't have anything amenable to implementation.

Specified Preamble

per Steve Singer
-p option draws in preamble automagically
See [1]

Bug #183

Bulk Adding Tables

per Steve Singer
set add table (... tables='public.*', ..);
set add table (.. tables='billing.(!password)', ...);
See [2], [3]

Bug #181

fewer defaults for SET ADD TABLE

per Steve Singer
Automatically determine an ID
Automatically determine the origin
See [4]

Bug #182

automate logging

per Chris Browne
-v option leads to all requests being logged
might it be valuable to allow specifying syslog parameters?
Verbose logging to syslog eliminates need for DATE command
Existing ECHO command allows simulating this manually; DATE would complete that.

Advisory Locks

Provide a method so that applications can detect advisory locks. Per Stuart Bishop [5]

There wasn't much agreement on this. Initial plan: If other changes go well, it may make sense to take a look at this later.

New FAILOVER

General proposal, that FAILOVER be a much more sophisticated command, allowing:

Dropping nodes considered dead
Doing several failovers of sets as one request

Thus, something like:

  failover (dead nodes=(1,2,4),
            set id=1, backup node=3,
            set id=2, backup node=5,
            set id=3, backup node=3);

Failover should check various conditions and abort if any are the case
- There need to be paths to support communications to let the new masters catch up
- Slons need to be running for nodes that are needed to let masters catch up
- If a node hosts a subscription that cannot be kept that subscription may be marked dead
  - Automatically kill that subscription?
  - Refuse failover until subscription is marked dead?
    This requires a dead subscription clause be added...

Bug #178

Interactive Mode

line by line running custom command

General Features

Commit timestamps

per Jan Wieck
Requires PostgreSQL extension
Eliminates need for periodic generation of SYNC events
Simplifies queries for searching for sl_log_* data
Enables carryover of commit times to subscribers

DDL Triggers

per Jan Wieck
Requires PostgreSQL extension
Enables automatic generation of DDL_SCRIPT events
Discussed as... DDL_Triggers

Not planning to do this at this point. That might be reconsidered if someone volunteered to implement DDL Triggers.

pull lexxer from postgres

From TODO, probably per Peter Eisentraut
Note that this may be a fair bit more complex than the code in src/parsestatements

Seems unworthwhile for now. It seems likely that it will be easier to fix bugs in src/parsestatements than in this, and it's certainly tougher to implement this than to use the present parser.

Report on the size of pending SYNC requests

Bug 166
Could be implemented as an extension to test_slony_state.pl or inside of slonik

Log Shipping

Bug 177
Afilias doesn't use it; wouldn't mind diminishing the code base by its removal.
There appear to be some users of it that would be injured by its removal, so that seems like a no-go.
An attractive idea: Split slon into two forms:

The "usual" one which manages nodes that are databases amenable to forwarding, failover, and such
Log-shipping-only nodes

Note that there is a lot of log-shipping functionality strewn throughout remote_worker.c. If split out, this might make the log shipping node code simple, and regular slon simpler, too.

Common Admin Conninfo

There are several tools known that need to know administrative conninfo connection information:

slon processes likely use this connection to access the node that the slon manages
slonik requires connections potentially to all nodes, and presently uses ADMIN CONNINFO preamble material to configure this.
Administrative tools such as the configuration analysis tool, test_slony_state.pl
The slon-ctl scripts should also be considered in this
Unfortunately, not quite useful enough. The relevant sample file is bases.h, which oversimplifies by assuming that nodes have a single host/port/username identification.

It is suggested that we create a common "properties file" to capture this in a form that might be reused by various common tools.

Bug #169

SlonyBrainstorming