SQL2011Temporal

From PostgreSQL wiki
Jump to navigationJump to search

This page introduces the two typical dimensions of temporal databases: application time and system time.

Application Time (sometimes called Valid Time) tracks the history of the thing out in the world, in contrast to System Time (sometimes called Transaction Time), which tracks the history of when the database itself was changed. These histories are orthogonal dimensions. Whereas the database itself maintains System Time, the user maintains Application Time.

Application Time

In an Application-Time table, each row includes a start time and end time indicating when it was true. Conceptually, a single row with start time and end time is equivalent to many single-timestamp rows, each asserting that the row was true at that instant.

Conventionally, temporal tables use closed/open intervals: the start time is included, but the end time is excluded. With range types this is written [start,end). Following this pattern means that adjacent records for the same entity "snap together" without any special case logic.

Because Application Time is under user control, special syntax is required to make changes, and integrity constraints are helpful to prevent user error.

Work to add application time and system time to Postgres is ongoing. For progress in application time, see ApplicationTimeProgress.

PERIODS

In SQL:2011, the start and end columns must be combined by declaring a PERIOD, which is table metadata with its own syntax. You can declare a period like this:

 CREATE TABLE t (
   valid_from date NOT NULL,
   valid_til date NOT NULL,
   PERIOD FOR valid_at (valid_from, valid_til)
 );

Or if you have an existing table, you can say this:

 ALTER TABLE t ADD PERIOD FOR valid_at (valid_from, valid_til);
 ALTER TABLE t DROP PERIOD FOR valid_at;

A period's start/end columns can be a date, timestamp, or timestamptz. They must be NOT NULL. A period implicitly adds a CHECK constraint forcing the start column to be strictly before the end column.

A period is not a column, but it shares the column namespace. An Application-Time period can be named anything except system_time, which is only used for System Time.

Periods have several limitations:

  • They don't appear in SELECT *.
  • They can't be selected explicitly with SELECT valid_at.
  • They can't appear in a view.
  • They can't be passed to a function or operator.
  • They can't be returned from a function or operator.
  • They can't appear in GROUP BY or be aggregated.

Periods can appear in predicates with certain special syntax: EQUALS, OVERLAPS, CONTAINS, SUCCEEDS, PRECEDES, IMMEDIATELY SUCCEEDS, IMMEDIATELY PRECEDES. In all cases, both operands must be periods, although CONTAINS also accepts a scalar datetime value for its second operand. The operands may be references to periods declared on tables, but they may also be literals, written PERIOD (<start>, <end>). In a literal, the start and end values *can* be null, and they cause the overall predicate to be null.

In Postgres, the start and end columns can be combined in a single range type column, or even a multirange. We accept these anywhere the standard accepts a period. We plan to support SQL:2011 periods as well, implemented internally as hidden generated range columns. We will not necessarily add all the predicate syntax, since we already have rangetype operators for those needs.

Primary Keys and Unique Constraints

In a temporal table, a single entity can have multiple rows, as long as they are for distinct parts of history. The user can define both primary keys and unique constraints by specifying first the "scalar" parts of the key, followed by one range/multirange/PERIOD. The syntax looks like this:

 CREATE TABLE t (
   id int,
   valid_at daterange,
   PRIMARY KEY (id, valid_at WITHOUT OVERLAPS)
 );

You can use period or multirange column as well. More than one scalar key part is supported. The syntax for unique constraints is similar.

Because temporal primary keys are implemented internally as exclusion constraints (with some special handling to forbid empty ranges/multiranges), they use GiST indexes. Therefore you must CREATE EXTENSION btree_gist to include types with no built-in GiST opclass. This includes integers and other "scalar" types. Without that extension, your scalar key parts must also be rangetypes or something with a built-in GiST opclass. If btree_gist is not available (for example in Postgres's core regression test suite), one hack is to use int4range in place of int, and always use one-wide values, so instead of 4, [4,5).

Foreign Keys

A temporal table may declare a foreign key referencing another temporal table. Both tables must have an Application-Time period/range/multirange. The referencing table requires that its whole history be covered by the referenced table---but not necessarily by a single row. The syntax looks like this:

 CREATE TABLE pk (
   id int,
   valid_at daterange,
   PRIMARY KEY (id, valid_at WITHOUT OVERLAPS)
 );
 CREATE TABLE fk (
   parent_id int,
   valid_at daterange,
   FOREIGN KEY (parent_id, PERIOD valid_at) REFERENCES pk (id, PERIOD valid_at)
 );

Conceptually there is no reason to forbid a temporal table referencing a non-temporal table, and maybe vice versa (if we assume the non-temporal referencing table is satisfied by "any time" but applies cascades to "all time"). See Snodgrass 126-131 for all combinations. The standard doesn't propose any way to mix temporal and non-temporal tables, but perhaps Postgres will support it eventually.

If a temporal foreign key has an action of CASCADE, SET NULL, or SET DELETE, the action is applied to the referencing table using a temporal update/delete (see below).

DML

No special command is needed for INSERT: the user just supplies the start and end times for the new record. But for UPDATE and DELETE, the user needs a way to "target" a specific stretch of history. There is special syntax for that:

 UPDATE t
 FOR PORTION OF valid_at FROM '2000-01-01' TO '2020-01-01'
 [AS ...]
 SET foo = bar
 [FROM ...]
 [WHERE ...];
 DELETE FROM t
 FOR PORTION OF valid_at FROM '2001-01-01' TO '2002-01-01'
 [AS ...]
 [USING ...]
 [WHERE ...];

Postgres also permits an alternate syntax, useful for multiranges (and potentially user-defined types) that don't fit into a start-to-end pattern:

 UPDATE t
 FOR PORTION OF valid_at ('[2000-01-01,2020-01-01)')
 [AS ...]
 SET foo = bar
 [FROM ...]
 [WHERE ...];
 DELETE FROM t
 FOR PORTION OF valid_at ('[2001-01-01,2002-01-01)')
 [AS ...]
 [USING ...]
 [WHERE ...];

The commands are implicitly limited only to rows that overlap the targeted portion of history.

In an update, the row's application time is automatically changed according to the targeted portion. The new start/end times are the old values intersected with the target.

Then if the row included any history outside the targeted portion, the database inserts new row(s) with the old values (but appropriately updated start/end times) to preserve that history. In effect these new rows cover old_history - targeted_history. For range or period columns, there can be zero, one, or two such "temporal leftover" records. For multiranges two are never required, only zero or one.

In a delete, the old row is deleted, but "temporal leftovers" are inserted as for update.

The targeted start/end bounds must be constant. Literals are permitted, as are functions like NOW(). Column references are not permitted (but in principle they could be).

System Time

This section deals with a proposal for system time tables in PostgreSQL. It contains design and implementation.

(It is not fully compliant with SQL:2011.)

System period temporal tables maintain historical versions of data. Modifying operations (UPDATE, DELETE, TRUNCATE) on these tables don't cause permanent changes to entries, but create new versions of them. Thus user can easily get to the past states of the table.

Following picture shows basic design of system period temporal tables. In next sections we will discuss each part in more details

Design.jpg

Current table

Current table stores current versions of rows. It has the same structure and name as original table before adding temporal support, except four technical columns (in this example begin with underscore):

  • _system_time – time range specifying when this version is valid. Versions in current table has no upper bound of validity.
  • _entry_id – id of entry. Groups multiple versions of same entry. Useful in cases when user doesn’t provide primary key for the table or in case of ID change (it will be impossible to find which versions belongs together).
  • _transaction_id – id of transaction that created this version. Serves for grouping of the versions created by the same transaction. It is also used for checking if row was previously modified by current transaction.
  • _user_id – id of user that executed modifying operation. Can serve for finding the originator of data changes.

Values for these columns are always generated by system and user cannot set them. To ensure backward compatibility, these columns can be marked as implicitly hidden. It means that they are not returned by the select * statement unless user explicitly list them and values are not required in case of the insert statement without column list (just like system columns).

Index on _system_time column is added for speed up queries for versions valid in specified time or time range.

History table

History table stores historical versions of rows, which are inserted by triggers on every modifying operation executed on current table. It has the same structure and indexes as current table, but it doesn’t have any constraints. History tables are insert only and creator should prevent other users from executing updates or deletes by correct user rights settings. Otherwise the history can be violated.

There is an internal inheritance-like relation between current and history table (current inherits history) to enable queries on whole history of rows (current and historical rows) by executing selects on history table.

History table cannot be dropped. It is dropped automatically when current relation is dropped.

Triggers

Triggers for storing old versions of rows to history table are inspired by referential integrity triggers. They are fired for each row after UPDATE and DELET and before each TRUNCATE statement. They use dynamically prepared statements which are stored in the hash table on first call and reused on subsequent calls in the session.

Operations

INSERT

Operation insert is simple. Row is inserted to current table with following values in system generated columns:

  • _system_time – tstzrange(clock_timestamp(), null, ‘[)’)
  • _entry_id – nextval from sequence
  • _transaction_id – txid_current()
  • _user_id – current_user

UPDATE

In case of update we need to ensure that old version of row is stored in history table only if it is first modifying operation on that row in current transaction. Otherwise the history may contain inconsistent states of rows (caused for example by use of deferred constraints) created inside the transaction, which will normally never be visible to other transactions.

These values are generated by system on update (_entry_id is not updated):

  • _system_time – case when _transaction_id = txid_current() then _system_time else tstzrange(clock_timestamp(), null, ‘[)’) end
  • _transaction_id – txid_current()
  • _user_id – current_user

Here is simplified example of trigger for storing old version of rows on update operation:

CREATE OR REPLACE FUNCTION update_history() RETURNS trigger AS
$$
BEGIN
    INSERT INTO person_history 
        VALUES(
            old.id,
            old.first_name,
            old.last_name,
            old.department_id
            tstzrange(lower(old._system_time), lower(new._system_time), ‘[)’)
            old._entry_id,
            old._transaction_id,
            old._user_id
        );
    RETURN null;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER update_history
    AFTER UPDATE ON person
    FOR EACH ROW
    WHEN _transaction_id <> txid_current()
    EXECUTE PROCEDURE update_history();

DELETE

On delete operation we only need to store rows being deleted into history table. Here is simplified example of trigger serving for this purpose:

CREATE OR REPLACE FUNCTION delete_history() RETURNS trigger AS
$$
BEGIN
    INSERT INTO person_history 
        VALUES(
            old.id,
            old.first_name,
            old.last_name,
            old.department_id
            tstzrange(lower(old._system_time), clock_timestamp(), ‘[)’)
            old._entry_id,
            old._transaction_id,
            old._user_id
        );
    RETURN null;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER delete_history
    AFTER DELETE ON person
    FOR EACH ROW
    EXECUTE PROCEDURE delete_history();

TRUNCATE

On truncate we only need to store all rows from table being truncated into history table. Here is simplified example of trigger serving for this purpose:

CREATE OR REPLACE FUNCTION truncate_history() RETURNS trigger AS
$$
BEGIN
    INSERT INTO person_history 
        (select
            id,
            first_name,
            last_name,
            department_id
            tstzrange(lower(_system_time), clock_timestamp(), ‘[)’)
            _entry_id,
            _transaction_id,
            _user_id,
         from person
        );
    RETURN null;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER truncate_history
    BEFORE TRUNCATE ON person
    FOR EACH STATEMENT
    EXECUTE PROCEDURE truncate_history();

DDL operations

DDL operations like ALTER TABLE, CREATE/DROP INDEX, GRANT, REVOKE etc. can be automatically propagated to history table using event triggers.

SQL Syntax extensions

Column options:

  • GENERATED ALWAYS AS – specifies that this column is generated by system and user cannot assign values to this column. Specific system generated columns:
    • SYSTEM TIME
    • ENTRY ID
    • TRANSACTION ID
    • USER ID
  • IMPLICITLY HIDDEN – implicitly hidden column is not returned by the select * statement unless user explicitly name it and values are not required in case of the insert statement without column list (just like system columns).

Command for associating history table with current table

  • ALTER TABLE <current table> ADD VERSIONING USE HISTORY TABLE <history table>

Example

CREATE TABLE person (
    id SERIAL PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    department_id INTEGER REFERENCES department(id),
    _system_time TSTZRANGE GENERATED ALWAYS AS SYSTEM TIME IMPLICITLY HIDDEN,
    _entry_id BIGINT GENERATED ALWAYS AS ENTRY ID IMPLICITLY HIDDEN,
    _transaction_id BIGINT GENERATED ALWAYS AS TRANSACTION ID IMPLICITLY HIDDEN,
    _user_id INTEGER GENERATED ALWAYS AS USER ID IMPLICITLY HIDDEN 
);
CREATE TABLE person_history(LIKE person);
ALTER TABLE person ADD VERSIONING USE HISTORY TABLE person_history;


Bitemporal

When a table supports both Application Time and System Time, it is called "bitemporal". Such tables have two start/end pairs (or multiranges, etc), one for each dimension. Conceptually a single row maps out an area in 2-D space. It signfies that the database asserted from such-and-such a time that such a statement was true for such-and-such a time.

In general Application Time and System Time do not interact. Primary keys and foreign keys only include Application Time. Whenever the table changes, those changes are recorded in the System-Time dimension. New rows, as well as the new version of updated rows, have a System-Time from current_time to Infinity. Old rows that are deleted or changed yield a copy with their end-time set to current_time. The temporal update/delete leftovers are treated like other new rows: their System-Time is from current_time to Infinity.

Most likely we would store "currently true" records in the main table and "previously true" records in one or more separate history tables. If we store System Time and Application Time in the same table, a primary key or unique constraint's index could include both ranges, but the constraint should only enforce exclusion for "currently true" records.

References

Some extensions exist which provide temporal tables functionality: