General Modification Trigger and Generalized Data Queue Spec
Solving different problems:
- Replication ala Slony/Londiste transactional queues
- Replication to heterogenous systems, possibly non-transactional
- Differential update in Materialized Views.
Issues with current "queue table" implementations:
- Data write and seek/scan overhead.
- and vacuum overhead
- No good way to include transactional DDL changes
- Issues with large data
- Cannot ensure that replication trigger gets fired last
- Delay inherent in "poll" model
Benefits of "queue table" implementations:
- Can ensure that only committed data gets replicated
- Persistent storage
Suggestions for Implementation
- Non-table storage for row modification
- Send items to another server
- Implement non-persistent option (global temp tables)
- Write a DEFERRABLE trigger based on the current constraint-trigger code.
- Work on a compact format for row storage (protobuf?)
- Use 3rd-party queues (AMQ, AMQP, Spread?) for transport.
- Come up with way to inject DDL into queue at appropriate point.
Suggestion (cbbrowne)... This seems like an orthogonal matter in two ways:
- Capturing DDL information is different from capturing data modifications
- Capturing the point at which to inject the DDL into the queue is separate, perhaps using a combination of:
- Shared sequence to indicate ordering
- Commit timestamp information (may also be a sequence number of some sort) to control access to boundaries between sets of changes
- Replication trigger must NOT be fired last, but after all BEFORE triggers and before all (non-replication) AFTER triggers...
- DDL event positioning is solved problem - the usual after trigger "agreeable order" logic applies here:
- Start transaction.
- Execute DDL, this will take locks on appropriate objects.
- Take event number from seq - here all potentially conflicting operations are waiting on locks, thus have not yet their event-id assigned from seq.
- Insert event into queue.
- Commit, releasing locks.
- Queue table efficiency seems also solved problem:
- Use efficient fetch query (PgQ)
- Tables should be INSERT-only and rotated often - eg with 5-10min period - to keep tables as small as possible. See PgQ on how to do it without locking problems.
- Minimal reading overhead - active data is in memory.
- Minimal VACUUM overhead.
- Minimal index maintenance overhead.
Modification triggers must be fired after all BEFORE triggers and before all AFTER triggers. They can access the statement type (INSERT/UPDATE/DELETE) and old and new tuples, but cannot modify the new tuple nor cancel the statement unless an error case.
DDL triggers are discussed in the separated section. However, we might handle TRUNCATEs in the queue layer because modification events before TRUNCATEs will be discardable in some cases.
We have some choices to record modifications in the queues; Full description vs. Primary keys only. They have trade-off, so we would be better to have options to choose which we use.
- Full description queues
- They include not only primary keys of modified tuples but also the non-key fields. They are used by Slony and Londiste.
- We can avoid JOINs the queue and the original table on replay.
- We can replay all of the actual operations.
- The size of queues will be relatively larger then PK-only queues.
- PK-only queues
- They include only primary keys of modified tuples. An UPDATE of PK is recorded with a combination of DELETE and INSERT. They are used by Oracle's Materialized View Log.
- The size of queues will be relatively smaller then full description queues.
- We can easily merge multiple UPDATE operations on the same tuples.
- We need JOINs of the queue and the original table on replay to retrieve actual data.
- We can only reconstruct the final status of the table. Transition state is impossible.
See also discussion in GDQ Implementation.
The queue can be implemented with standard tables, and Slony and PgQ also use them, but we can use another data structure for them because GDQ requires only INSERT/SELECT/TRUNCATE. Fully-features of standard tables are not always needed. However, the consumers (subscribers) of the queues don't always read the data FIFO. Some consumers read the queue in random order because they requires snapshot based grouping of the modification records.
- A queue table should consist of multiple tables to be INSERT-only. We rotate them with some period, and TRUNCATE old log records to avoid overhead of DELETEs.
- The queue could be implemented with on-memory or non-WAL-logged storage engines. If we use such kinds of optimizations, the replicated servers must re-copy all of the table contents after the master's crash.
- Use non-table storage for the queue, like some WAL style flat file, to avoid overheads.
- Use custom WAL records for replication. WALSender can send replication records to standby servers.
- A disadvantage of using WAL is that consumer might require old records in some cases. WALSender needs to read old WAL records from archive, but it's impossible.
- Custom storage for GDQ might be kind of a SQL/MED foreign tables. We could extend the SQL standard for foreign tables to handle INSERT commands, and use them as the storage.
We have a choice of syntax to create queues and register consumers for the queues. Which is better?
- PgQ has pgq.create_queue() and pgq.register_consumer().
- Flexible, but the table name appears as a string literal (single-quoted).
- Oracle has CREATE MATERIALIZED VIEW LOG and CREATE MATERIALIZED VIEW.
- Another SQL idea is to use ALTER TABLE, like ADD MODIFICATION QUEUE.
Slony's sl_log stores each change to be propagated to subscriber nodes. The log consists of two tables, sl_log_1 and sl_log_2, to sweep old records effectively.
CREATE TABLE sl_log ( log_origin integer, -- Origin node from which the change came log_xid xxid, -- Transaction ID on the origin node log_tableid integer, -- The table ID that this log entry is to affect log_actionseq bigint, log_cmdtype char(1), -- U = Update, I = Insert, D = DELETE log_cmddata text, -- The data needed to perform the log action );
In version 2.1, log_cmdtype extends a little, and further is expected in 2.2. Values are:
PgQ defines a text-based generalized queue, and Londiste uses those queues through triggers. Each queue consists of multiple tables (default is 3) for the same reason as Slony.
CREATE TABLE pgq.event_template ( ev_id bigint, -- event's id, supposed to be unique per queue ev_time timestamptz, -- when the event was inserted ev_txid bigint, -- transaction id which inserted the event ev_owner int4, -- subscription id that wanted to retry this ev_retry int4, -- how many times the event has been retried, NULL for new events ev_type text, -- I/U/D ev_data text, -- partial SQL statement or column values urlencoded ev_extra1 text, -- table name ev_extra2 text, -- (not used in Londiste) ev_extra3 text, -- (not used in Londiste) ev_extra4 text -- (not used in Londiste) );
Materialized View Logs in Oracle Database
Oracle Database supports row-based replication based on Materialized View Logs. The logs are used for differential updates in Materialized Views, but also for replication combined with DATABASE LINK. See Planning Your Replication Environment for details. PostgreSQL might learn the design of Replication/Generalized Data Queue from Oracle Database, where the row-based logical replication is based on 3 fundamental modules:
- Generalized Data Queue
- Materialized Views with Differential Updates
- Inter-database Connections
Materialized View Logs only includes primary keys in default, but users can add any columns in the log. Simple materialized views require only primary keys are required for differential updates. The actual data seem to be retrieved from the original table using JOINed with the primary keys. On the other hand, complex materialized views (ex. including aggregations) require referencing column data. Note that Oracle's logs do not contain modified or inserted columns values unlike Slony and Londiste. Changes of primary keys are described as a combination of DELETE and INSERT.
CREATE TABLE oracle_mlog_pk ( pk_1 pk_type_1, -- first primary key pk_2 pk_type_2, -- second primary key ... pk_N pk_type_N, -- Nth primary key snaptime timestamptz, -- first snapshot sets initial refresh time dmltype char(1), -- Type of DML old_new char(1), -- O/N/U (= D/I/U in Slony) change_vector bytea -- Used for subquery and LOB snapshots (?) );