Overview of features
At the initial meeting discussing this feature list, a vote for "what is important to me?" was taken for some of the major features related to building PostgreSQL clusters, to help prioritize what order they should be considered in. Results:
- Export snapshots to other sessions - 11
- Global deadlock detection - 9
- API into the Parser / Parser as an independent module - 9
- Start/stop archiving at runtime - 8
- Function scan push-down - 7
- DDL trigger - 7
- Modification trigger into core - 6
- Commit order of transactions - 5
- Dynamic shared memory allocation - 5
- XID feed - 4
To further fine-tune what work is useful for which projects and when it might be done, the general sort of questions that need to be answered for each of these features include:
- What feature does this help add from a user perspective?
- Which replication projects would be expected to see an improvement from this addition?
- What makes it difficult to implement?
- Are there any other items on the list this depends on, or that it is expected to have a significant positive/negative interaction with?
- What replication projects include a feature like this already, or a prototype of a similar one, that might be used as a proof of concept or example implementation?
- Who is already working on it/planning to work on it/needs it for their related project?
Prioritized features in core
Export snapshots to other sessions
For any kind of parallel querying (be it on the same or across multiple nodes), you need to be able to export a snapshot of a transaction to another backend - from any point in time of the origin transaction. This includes the full XIP array (list of transactions in progress at the time of snapshot creation) as well as making sure the data that's already written (but uncommitted) by that transaction is available to the destination backend. That is a no-op for a single node, but needs care for remote backends. Additionally, some access controlling information needs to be transferred, to ensure parallel querying isn't a security hole. A worker backend for parallel querying should never need to write any data, so it should also be forced into a read-only mode. And the origin transaction should not be allowed to continue with another query before having "collected" all worker backends that attached to its snapshot.
An initial implementation that added snapshot export on a single node, such as what's started by the recent "Synchronized Snapshots" work, is a useful partial step here--particularly given how common multi-core servers are now. When making snapshots consistent among multiple nodes, you also need to be careful to make local-only tasks such as autovacuum and analyze not disturb shared transactions, lest those turn into global long running transactions impacting the whole cluster.
- have an example from Simon / Jan
- needed for snapshot cloning mechanism
- best way is to export it to shared memory
- Synchronized Snapshots from Joachim Wieland works on some similar issues (share the snapshot among multiple backends on a single node) in the context of pg_dump
- copies snapshot out of private memory into a small file
- another function allows you to set your snapshot from the file
- but it overwrites your current snapshot in memory (could be bad! :D)
- Set current XID to someone elses? What are the side effects?
- Push the transaction management out? into an API? store the status of the transaction ID outside
Global deadlock information
Cluster-wide deadlocks are possible when trying to maintain a set of identical replicas across multiple nodes in any multi-master implementation. Single-master clusters have no need for this feature.
There are two aspects to this item: detection/notification of a deadlock and how to handle deadlock resolution. When there's a deadlock between a local transaction and one that's to be applied from a remote node (or even between two remote ones), that needs to be detected and then resolved. And to ensure congruency between nodes, they must take the same measures to resolve the deadlock, such as aborting identical transaction(s).
A simple notification API is not sufficient. That could result in one node being notified and then aborting transaction A to resolve a deadlock, while another node notifies it aborted transaction B. You'd end up having to abort two (or more) transactions instead of just one to resolve a conflict, and be concerned about whether each node had the same data at the end too.
Some proposed implementations:
- Intercept the deadlock, turning off the existing deadlock resolver, and leaving the resolution of the deadlock to the clustering solution. Such an interception API could register a callback that replaces the current deadlock resolver. Upon detection of a deadlock, the callback would get a list of transaction ids that are part of the lock cycle. It's then up to that callback to chose one to abort to resolve the conflict.
- An experimental PostgresForest feature used a global wait-for graph to detect a global deadlock (see Distributed deadlock). The pg_locks information on each node can be used to build the global wait-for graph, and then you decide to kill/cancel one of the transactions involved.
- A lock_timeout GUC to help here was suggested; see 1, 2
A primary concern with this feature is how quickly global deadlocks can be detected. You always need to find a compromise between waiting long enough to not kill transactions just because of high contention, but still react promptly enough to to resolve real deadlocks.
- statement_timeout is one way to mitigate global deadlocks, but it introduces a delay and doesn't necessarily resolve the consistency issues
- Postgres-R is impacted by this issue, and older versions once did some work on this area, but there's no full resolution available from its implementation.
API into the Parser / Parser as an independent module
Statement-based replication projects and load balancing middleware often need to decide where a statement will be executed at. If the statement is a read-only SELECT, it can be routed to an alternate node. If PostgreSQL provided its raw parser as a C library, it would be useful for this purpose, because the replication/middleware would not need to re-implement their own SQL parser to make this sort of decision.
In addition, there are some types of statements that can be expected to change their values if executed on another node, including:
When using statement-based replication, you need to be careful to replace these values with constants that reflect their current values on the master before replicating them. Having a hook exposing this data at the parser level makes it easier to identify these troublesome constructs and fix them before replicating them.
- pgpool-II already built its own SQL parser that handles most of these cases
- If it had the plan directly, it could do better, and building other projects would be easier
- Just first step, before going to the catalog (?)
Start/stop archiving at runtime
Several of the replication approaches that use PostgreSQL PITR create a base backup using PITR recovery, initially synchronize themselves with the current state of things using an initial set of WAL segments, and then keep in sync afterwards using another mechanism. Keeping archiving turned on all the time for these apps is a waste of resources. It would be simpler for them if archiving could be toggled on and off easily.
This capability would also help with the situation where a standby node has failed, and it's no longer possible to deliver archives to it. Right now that can result in disk space running out on the master when it fills with WAL segments that can't be recycled.
Currently, changing the archive_mode on the server requires a full server restart. You can change the archive_command and just ask for the configuration to be reloaded though. This allows two work-arounds to this issue. You can just change the command to something different to enable/disable archiving. Or you can keep it the same the whole time, and just make your archive_command script looks for external input about whether to save the log files it's being asked to archive, or if it should just throw them away.
- What does this take? Basic outline, reason for current behavior, details on transaction issues
- Skype could handle failed nodes better (specific issues?)
- Pg-2 (?)
- pgpool-II does "on line recovery" using a base backup and WAL archiving, but once that's complete it would like to be able to turn off archiving.
Function scan push-down
Consider a statement that executes a query on a remote server using the dblink protocol:
SELECT * FROM dblink('SELECT i FROM remote_tbl') AS (i int) r WHERE i = 5
Since the remote server doesn't know about the WHERE filter that will be applied here, it has to send all of the rows in remote_tbl back to the server running the query. The server executing the query then has to filter through all of them, which is inefficient. If the protocols for set-returning functions (SRFs) were extended, it would be possible to push the "WHERE i = 5" down to the remote server. Then it would only need to return the needed rows instead.
The SQL/MED Foreign Data Wrapper will be able to handle WHERE conditions and pass them to the external server. Foreign tables should support inheritance and table partitioning for scale-out clustering. The main parent table could be partitioned into multiple foreign tables, with each foreign table connected to different foreign servers. It can be used like the partitioned remote function call in PL/Proxy.
Current PostgreSQL triggers are associated with a specific table. In order to be able to replicate DDL statements such as CREATE or ALTER TABLE, a new type of trigger that operates against that statement rather than against a specific object needs to be created.
DDL triggers would be useful for many replication approaches, except for ones that use WAL shipping where the DDL already is replicated. This feature would also be more useful for logging purposes, for example in heavily monitored or secure deployments, than the current approach of setting "log_statement='ddl'" and parsing the log files.
Modification trigger into core / Generalized Data Queue
Two of the replication systems that work using triggers to discover data modifications, Slony and Londiste, treat the resulting change set as a queue. It's inserted into the database using the same commit guarantees as the underlying data, and then dequeued for commit onto the secondary system. Major disadvantages of this approach are its overhead, concerns about all the dead database rows left behind by the queue churn, and the lag that's inherent in any polling queue.
If this data were saved directly by the core database, with an index based on the transaction ID to provide ordering, this could be made much more efficient. The implementation of the modification trigger and database portion of the queue could turn into a "Generalized Data Queue" scheme, shared between the two replication systems (and others like them).
This area is specified in more detail at General Modification Trigger and Generalized Data Queue Spec
- Skytools has a universal trigger
- pgQ specific triggers, simple names etc.
Commit order of transactions
- Order A / B (b gets lock first, and blocks A, will commit in reverse order)
- read from the WAL ? -> need an interface to it (PITR has the same issue, can specify where to stop, but you don't know which transaction # to give it --> because we don't know what the commit order time is) --> make it optional
- would help multimaster implementation
- commit order & commit timestamp
- two-phase commit order (?)
- maybe multiple ?
- xid -> give me timestamp (look it up in the WAL -> issue: being able
PostgreSQL allocates all of its shared memory in a single, fixed size block on database startup. Several of the clustering features might benefit from being able to allocate shared memory dynamically. For example, "Export snapshots to other sessions" would benefit from having space to save its snapshot information in shared memory, and the amount needed for that would be difficult to predict in advance.
A possible implementation has been proposed here: dynamically allocating chunks from shared memory (Markus Wanner)
PostgreSQL uses a constantly advancing transaction id (XID), visible using the txid_current() function. In order to implement synchronous multi-master clusters, servers would need to accept XIDs injected from external sources. That would allow synchronizing all the XIDs within the cluster.
Postgres-XC has an implementation of this concept using its Global Transaction Manager. Having a more generic API for this purpose would both benefit other projects and potentially simplify that one.
Here, dedicated process called GTM (Global Transaction Manager) is responsible for providing unique and ordered XID called GXID (Global Transaction ID) to each backend running in separate servers. In this way, all the transactions running in multiple servers are given unique XID. When SQL node (called Coordinator in XC) is asking a Data Node to handle statements, it forwards GXID given by GTM to each Data Node involved.
Vacuum and Vacuum Analyze also need GXID to run consistently.
Creation of primary keys online (in the background)
Primary keys in PostgreSQL become a UNIQUE constraint against the table, one that's checked using a unique index. The server allows creating unique indexes concurrently--without locking the table--but with several limitations and a complicated implementation: "perform two scans of the table, and in addition it must wait for all existing transactions that could potentially use the index to terminate". And REINDEX doesn't have a similar concurrent build option available at all.
Not being able to REINDEX CONCURRENTLY is a general issue that would be helpful to resolve. And specifically in the clustering context, you don't necessarily have visibility into "all existing transactions that could potentially use the index", which means a better approach for building primary key indexes when creating the cluster for the first time is needed.
Node partitioning with constraint exclusion
- similar to what we want to do with partitioning on different nodes
- hooks in to share this logic across databases
- Get function scans working ? --> create on one node a partition table where the partitions across multiple machines
- can't do that easily.. function scan executes completely before anything else happens..
- need to push where clause into the function scan, Partitioning module exists? ???
- Foreign data wrappers - different speeds on different tables
- Push the WHERE clause into a function scan -
- Partitioning across multiple nodes - transaction parallelism and write scalability
- essential component of federated queries
User level DDL triggers