SlonyBrainstorming
Slony Feature Brainstorming
This collects ideas from meetings held in November/December 2010 for the next major version of Slony.
It is quite likely that many of these features won't get implemented, based on a "more work than they're worth" evaluation.
The point of the exercise has been to capture a goodly set of ideas to make it possible to effectively evaluate what might be worth implementing.
slon extensions
Slon monitoring
- per Chris Browne
- slon records in a queryable form what it's working on
- Requires writing (+COMMIT) at the start of the event loop
- [As noted by Yehuda] This ought to also be useful with slonik, to allow indicating "whazzup"?
- Bug #175
Debate took place surrounding various mechanisms. The only one without dramatically unacceptable properties was to add a table to the Slony-I schema.
- SNMP
- Existing standard for network-based monitoring
- Downside - requires extra infrastructure including libraries and possibly additional tooling to use it
- Bigger downside: SNMP support planned to be removed - see Bug #170
- Spread
- Fast
- Downside - requires extra infrastructure that's not particularly "standard"
- Would require substantial coding effort to ensure there are components listening
- Would require not-insubstantial effort to ensure that there is suitable configuration
- NOTIFY/LISTEN
- Built-in to Postgres - no new infrastructure needed
- In Postgres 9.0+ can carry payload
- Only becomes visible upon COMMIT
- Results are not stored; listeners must pre-declare their interest
- SQL table
- Built-in to Postgres - no new infrastructure needed
- Only becomes visible upon COMMIT
- Another table to manage and clean up, but if we limit the data stored (e.g. - no history) there won't be much maintenance required.
Monitoring Requirements
Crucial facts that we want to know about:
- Is replication behind?
- What are components (e.g. - slon, slonik) doing?
Note that slon has numerous threads - Recent abnormal states for events (e.g. - error messages)
- Are any non-SYNC events outstanding?
- Backlog volume?
- What is the cluster's configuration?
Replication Behind?
The existing view sl_status captures this from a what-is-confirmed perspective. That is not perfect, but it is not obvious that there is high priority to enhancing this.
What are components doing?
Nothing relevant is captured in a usable fashion.
It is thought that what we may do is to add a table where each thread would capture what am I doing? (which would replace whatever was previously being done)
This table would contain a tuple for:
- Each remote worker thread
- Cleanup thread
- Each remote listener thread
- Local SYNC thread
It would track things such as:
- Time processing started
- What thread/process am I?
- What node am I for?
- What am I doing?
Possibly in several pieces, to cover the following sorts of facts:- Event ID
- Event type
Though this could be pulled from sl_event, given node and event ID - Additional event activity
Again, could be pulled from sl_event, given node and event ID - Backend PID
This would support the slonik scenario; if the PID is dead, then the activity is clearly complete (or dead!), and may be safely cleaned out.
Note that the contents of this table should be quite tiny; a tuple per slon thread on a node.
This also needs to be able to capture what slonik is doing; this seems more troublesome.
- It is possible to have multiple slonik instances acting concurrently - multiple concurrent events!
- There is no natural "event loop" such that slonik activities would be expected to clean themselves up over time
- Capturing backend PID for the connection allows knowing when slonik processes are done. They ought to be self-cleaning, but we have to be able to cope with failure, otherwise we're worsening things.
Suggested slon implementation
Two approaches emerged for establishing connections to capture this monitoring data
- Each thread opens its own DB connection
Unacceptable: Leads to enormous increase in use of DB connections that are mostly basically idle - Establish a "monitoring thread"
- A message queue allows other threads to stow entries (complete with timestamps) that the monitoring thread periodically flushes to the database
- It is plausible that this thread could be merged into the existing local SYNC thread which isn't terribly busy. But that is a performance optimization.
Recent abnormal states for events
This captures messages about the most recent problem that occurred, storing:
- Time of abnormality
- Event ID
- Node ID
- Description / Error Message
non-SYNC events outstanding?
This information is already captured, and may be revealed by running a query that asks, on the source node, for all events that are:
- Not SYNC events
- Have not been confirmed by the subscriber
Backlog volume
This seems troublesome; calculating the number of sl_log_* tuples involved in a particular SYNC requires running the same complex query that the remote_worker thread uses to determine which tuples are to be applied.
This is a query that is complex to generate that is fairly expensive to run.
Note that Bug #167 is changing this query.
Cluster configuration
This is captured as Bug #176
There is an existing tool that does some analysis of cluster configuration; see test_slony_state.pl
It is desirable to have something that generates diagrams of the relationships between nodes, capturing:
- Nodes
- Subscription Sets, and the paths they take
- Paths between nodes
- Listen paths
It would be nice for the Subscription Set diagram to include indication of replication state/lag for each node, indicating things like:
- Event Number
- Events Behind Parent
- Time Behind Parent
- Events Behind Origin
- Time Behind Origin
Faster Replication - COPY Protocol
- Use COPY + Triggers on sl_log_*
- per Jan Wieck
- New encoding of tuple information
- Triggers do direct heap updates
- Eliminates overhead of parsing each statement
- COPY implicitly introduces streaming
- Eliminates need for current memory management logic for processing large tuples
- Should reduce amount of work done by slons to parse sl_log_* cursors, generating I/U/D streams
- Requires separating out log shipping into a separate daemon, as there are users known to depend on being able to parse log shipping data as INSERT/UPDATE/DELETE statements
- Based on limited interest this does not appear to be a high priority at the moment
SYNC pipelining
- per Jan Wieck
- open 2 connections to source DB, start pulling new data while the previous request is pushing I/U/D requests to the subscriber
- Might be unnecessary if using COPY+triggers to stream data
- Maximum value comes if the time required to get to the point of %f seconds delay for first row is equal to the remainder of the time required to process the SYNC
- If query to start up the cursor on sl_log_* takes most of the time, then there's not much gained by starting the next one early
- If query to process sl_log_* data takes most of the time, again, there's not much gained by starting the next one early
- The instances where this would actually make a performance difference seems to be very narrow
Compress sequences of DELETE requests
- per Chris Browne
- Note that TRUNCATE is already replicated
- Two ways to interpret it...
- log trigger compresses them upon receipt, essentially merging subsequent requests into a single sl_log_* tuple
This changes the INSERT-only understanding of sl_log_* - slon remote worker thread compresses them when processing a SYNC
Overall, this seems a pretty dubious feature.
SNMP
- per Chris Browne
- Some minimal SNMP functionality was added in 2005
- Untouched since; requires --with-netsnmp option to ./Configure
- Should we improve it, or drop it?
- list thread
- In discussions of 2010-11-30, some debate over whether this is worth augmenting to support some of the monitoring requirements discussed elsewhere
- In favour, SNMP exists and is a standard protocol for network-based monitoring
- Contrary, several things were observed:
- Unknown whether the present code works
- For us to use SNMP within Slony-I for monitoring draws in a dependency on SNMP libraries and possibly other tooling.
We sure wouldn't want to mandate installing and configuring something like Nagios as a component of Slony-I. - In contrast, consider that we already have libpq used everywhere; for us to capture some monitoring information in a table requires no additional components.
Proposal to remove it: Bug #170
Deal better with sl_log backlog
- When replication is backlogged the sl_log tables can grow to be huge. This slows down replication.
- Have slony create new sl_log_x tables as needed.
- See mailing list discussion
- Note that Vivek Khera has noticed it happens less when he's using SSD versus spinning disks, so evidently disk performance has material effect on this.
Bug #167 has some further observations:
During catch up of a major backlog, the receiving slon slows down further and further, sometimes to a point where it cannot keep up any longer with the rate of new changes being logged.
This is caused by a problem in the query from the remoteWorker selecting from sl_log. The lower boundary for the log tuples is the min_xid of the first snapshot OR the xip-vector of the first snapshot. That OR clause causes the planner to fail to create a lower scankey for the index scan against sl_log. It therefore always scans from the beginning with a filter condition. At some point this becomes so expensive that it even switches to a seqscan.
The attempted fix for this will be to extract those transactions, that were in progress at the first snapshot and are no longer in progress at the second, out into another UNION. This will allow the plan to use a lower bound scankey for the bulk of the log rows and pull the few transactions from the current OR clause in with a nested loop of index scans.
-- Jan Wieck
Isolated Node
- Per discussion on list
- Allow configuring that some nodes should be ignored for the purpose of confirmations.
- This allows the cleanup thread to trim out sl_log_(1|2) data.
- FAILOVER may already support the notion that a "isolated" node might be lost.
- An interesting extension: SUBSCRIBE SET could automatically mark the "child" node as isolated until such time as the subscription has completed and caught up.
- This only works if only one set is involved; if there are multiple subscriptions, shunning only works out well for the first one.
- It will work just fine for multiple subscriptions. The point here is to free the master and other forwarders than the data provider for the node, that is busy subscribing, from keeping copies of the log.
- This needs a better name. Possibilities:
- isolated
- nondependent
Parentage Check
- At time of node start up, check to see if the nodes providing my subscriptions believe my node exists
- If they do, all is well
- If they don't, then presumably I'm a failed node
- If a connection cannot be established, then warn of this (probably with a pretty quick timeout) but continue, for now...
Use application_name
- per Chris Browne
- If on a version that supports GUC application_name, connections should capture "this is slon"
- application_name is available on PG 9.0+
- The value we use should likely include various (possibly all) of the following:
- slon
- ID of the node being managed by slon
- ID of the secondary node being queried
- perhaps the cluster name?
Bugs Known
#53 - high RAM usage with high table #
- #53
- Likely involves replacing array with hash table
- Possible BSD-licensed hash table implementations
- C Hash Table
See also ryantenney / chashtable @ GitHub
Note: Used in Xen - UTHash
- lookup3 - Public Domain
- C Hash Table
#80 - slon daemon restarts itself in a loop after failover
#81 - duplicate key sl_nodelock-pkey and duplicate slon(8) processes not detected
#111 - UNSUBSCRIBE SET cancels outstanding SUBSCRIBE SET
#126 - client side KEEPALIVE on connections
#137 - EXECUTE SCRIPT not applied in right order
- #137
- Shift DDL from sl_event to sl_log_(1|2)
- Allow DBA to specify which locks EXECUTE SCRIPT requires in the slonik script
#152 - DDL noisy - might be handled well by "Other Health Criteria" ideas
#163 - Change to use TIMESTAMPTZ in Slony-defined tables
#166 : Size of SYNC
- Bug #166
- Useful to ask how much work is involved in particular SYNCs
- Allows evaluating questions like:
- Is this SYNC enormous?
- Is slon busy processing a huge SYNC?
Alternatively, if SYNC is small, long running SYNC suggests something broken
If we add a stored function that evaluates this, it should allow a significant simplification of C logic in src/slon/remote_worker.c which presently does the following:
- Pull outstanding transaction lists for two events
- Generate a potentially enormous WHERE clause indicating the set of transactions that need to be excluded when pulling tuples from sl_log_(1|2)
- Compress that WHERE clause in order that the query not be so long that the Postgres parser falls over
Instead, we imagine that a function may do the following:
- Create a temp table to capture transaction ID values
- Truncate that table, if it already exists
- Pull the transaction list described above, and stow it in the temp table
The WHERE clause would be altered to draw from the temp table instead of writing values explicitly into the WHERE clause. This would eliminate a pretty substantial body of code from src/slon/remote_worker.c, including:
- The function compress_actionseq (about 280 lines of code)
- Portions of sync_helper()
- Portions of sync_event()
Open questions:
- Is this a material savings?
- Is there a problem with repeatedly running TRUNCATE against the temp table, once for each set of SYNC events that is processed?
- Would we want an index on the temp table?
slonik extensions
ABORT
- Initially mentioned by Vivek Khera
- Fail if replication is behind
- event count - "within 3 events is OK"
- time interval "behind by no more than 15 seconds is OK"
- Other Health Criteria
- per Chris Browne
- Is node there?
- Is replication set there?
- Is subscription active?
- Run SQL, fail if ERROR
Implicit WAIT FOR EVENT
- proposal
- Initially mentioned by Chris Browne
- Should track (per Yehuda) some status information so that if it's waiting for a sustained period of time, this isn't a mystery to the user or to people monitoring things.
- Slonik commands that require inter-node coordination should check to see if all relevant slon processes are running
- notable exceptions include STORE NODE, STORE PATH
- Slonik should warn or error out if the slon isn't running
Controlling Implicit WAIT FOR EVENT
To support both legacy slonik scripts and new ones, the following features are suggested:
- slonik should have a command line option that deactivates "auto-wait"
- It may be desirable for uses to control wait behavior inside scripts, hence we should add two slonik commands:
- activate auto wait
- deactivate auto wait
DATE
- suggested by Stuart Bishop
- Basically, notion of being able to get timestamps during a Slonik script
TRY BLOCK
- Chris, Stuart conversation
- Perhaps we should forbid running non-transactional commands within a TRY block
- Non-transactional commands
- WAIT FOR EVENT
- FAILOVER
- EXECUTE SCRIPT
- STORE NODE
- Perhaps TRY should be eliminated?
A good definition of what this means has not yet emerged, so we don't have anything amenable to implementation.
Specified Preamble
- per Steve Singer
- -p option draws in preamble automagically
- See [1]
Bulk Adding Tables
- per Steve Singer
- set add table (... tables='public.*', ..);
- set add table (.. tables='billing.(!password)', ...);
- See [2], [3]
fewer defaults for SET ADD TABLE
- per Steve Singer
- Automatically determine an ID
- Automatically determine the origin
- See [4]
automate logging
- per Chris Browne
- -v option leads to all requests being logged
- might it be valuable to allow specifying syslog parameters?
- Verbose logging to syslog eliminates need for DATE command
- Existing ECHO command allows simulating this manually; DATE would complete that.
Advisory Locks
Provide a method so that applications can detect advisory locks. Per Stuart Bishop [5]
There wasn't much agreement on this. Initial plan: If other changes go well, it may make sense to take a look at this later.
New FAILOVER
General proposal, that FAILOVER be a much more sophisticated command, allowing:
- Dropping nodes considered dead
- Doing several failovers of sets as one request
Thus, something like:
failover (dead nodes=(1,2,4), set id=1, backup node=3, set id=2, backup node=5, set id=3, backup node=3);
- Failover should check various conditions and abort if any are the case
- There need to be paths to support communications to let the new masters catch up
- Slons need to be running for nodes that are needed to let masters catch up
- If a node hosts a subscription that cannot be kept that subscription may be marked dead
- Automatically kill that subscription?
- Refuse failover until subscription is marked dead?
This requires a dead subscription clause be added...
Interactive Mode
- line by line running custom command
General Features
Commit timestamps
- per Jan Wieck
- Requires PostgreSQL extension
- Eliminates need for periodic generation of SYNC events
- Simplifies queries for searching for sl_log_* data
- Enables carryover of commit times to subscribers
DDL Triggers
- per Jan Wieck
- Requires PostgreSQL extension
- Enables automatic generation of DDL_SCRIPT events
- Discussed as... DDL_Triggers
Not planning to do this at this point. That might be reconsidered if someone volunteered to implement DDL Triggers.
pull lexxer from postgres
- From TODO, probably per Peter Eisentraut
- Note that this may be a fair bit more complex than the code in src/parsestatements
Seems unworthwhile for now. It seems likely that it will be easier to fix bugs in src/parsestatements than in this, and it's certainly tougher to implement this than to use the present parser.
Report on the size of pending SYNC requests
- Bug 166
- Could be implemented as an extension to test_slony_state.pl or inside of slonik
Log Shipping
- Bug 177
- Afilias doesn't use it; wouldn't mind diminishing the code base by its removal.
- There appear to be some users of it that would be injured by its removal, so that seems like a no-go.
- An attractive idea: Split slon into two forms:
- The "usual" one which manages nodes that are databases amenable to forwarding, failover, and such
- Log-shipping-only nodes
- Note that there is a lot of log-shipping functionality strewn throughout remote_worker.c. If split out, this might make the log shipping node code simple, and regular slon simpler, too.
Common Admin Conninfo
There are several tools known that need to know administrative conninfo connection information:
- slon processes likely use this connection to access the node that the slon manages
- slonik requires connections potentially to all nodes, and presently uses ADMIN CONNINFO preamble material to configure this.
- Administrative tools such as the configuration analysis tool, test_slony_state.pl
- The slon-ctl scripts should also be considered in this
Unfortunately, not quite useful enough. The relevant sample file is bases.h, which oversimplifies by assuming that nodes have a single host/port/username identification.
It is suggested that we create a common "properties file" to capture this in a form that might be reused by various common tools.