RangeTypes

From PostgreSQL wiki
Jump to navigationJump to search

Introduction

Note: this is a work-in-progress draft.

Range types are an interval type in the mathematical sense — that is, they have an absolute beginning and end, and represent a set of points. For example, the interval [10, 20) may represent all integers 10 through 19 (a square bracket means "inclusive" and a round bracket means "exclusive"). These range types are important for many purposes; one of which is modeling temporal data.

Ranges may also be represented by two columns, of course. However, using a single data type has many important advantages:

  • ability to perform index searches using predicates like "contains" or "overlaps"
  • ability to make use of Exclusion Constraints
  • no ambiguity about inclusivity/exclusivity of end points
  • much easier to write easily-understandable queries
  • easier to manipulate inclusivity/exclusivity of end points (for discrete ranges only, discussed below), and easier to display differently without changing the representation on disk

Scope and Future Work

There are other interesting topics that are slightly outside of the scope of this project (at least in the first iteration). One of those things is "sets of sets" (http://scottrbailey.wordpress.com/2009/10/06/timespan_sets/), which would be very useful for allowing, for example, the union of two ranges that aren't adjacent or overlapping. This increases complexity and introduces new behaviors and representations, so to make this iteration of the project manageable, that will be left out for now.

Discrete vs. Continuous Ranges

There may be better names for these concepts. Conceptually, a discrete range is a range over a type where there is a concept of "the next value after" some given value (e.g. a range of INTEGERs). A continuous range is a range over a type where you can't define "the next value after" some given value (e.g. NUMERICs).

Continuous Ranges

A continuous range requires only that you have a total order on the underlying type.

Continuous ranges often work best when following a convention, such as "represent all ranges as closed-open". That works well for many applications, but is not as flexible:

  • If your convention is closed-closed, you can represent a range that consists of only one point, but you cannot represent two ranges that are adjacent.
  • If your convention is anything else, you cannot represent a range that consists of only one point.
  • If you mix conventions within one application, then you end up with possibly all four conventions, and it may be a challenge to maintain consistency.

There is a particular danger with continuous ranges over discrete types. For instance, if you try to make DATE ranges continuous, then it would be impossible to determine that the DATE ranges [2009-01-01, 2009-01-03] and [2009-01-01, 2009-01-04) are equal — which is sure to lead to confusion and inconsistency. In theory this is a problem for any discrete type, but in practice may not be a problem for some (e.g. FLOAT8 is technically discrete, but treating it as continuous may not pose a practical problem).

Discrete Ranges

A discrete range requires a total order and also requires that the underlying type have some "next" and "previous" function. For example, INTEGERs can work as a discrete range because "+1" and "-1" are the "next" and "previous" functions, respectively.

Types such as NUMERIC can be made to work as a discrete range by specifying a granule. For instance, specifying a granule of 1.0 would make a numeric range equivalent to an integer range (but in a larger domain, without a 32- or 64-bit limit). In addition to a next and prev function, it's also useful to have addition, subtraction, multiplication and division for storing and manipulating granules.

Floating point types (including floating-point timestamps) can cause problems for discrete ranges, because the granule is variable-width. In theory, floating-point timestamps are discrete, and there is a well-defined "next" and "previous" function; however there are not any useful functions to determine how many discrete points exist between two floats. Also, there is just generally a higher chance of confusion.

Any discrete range can be displayed in any of the possible conventions: closed-closed, closed-open, open-closed, or open-open. This is useful, but it also raises the question "what should the default output be?".

Additionally, much of the temporal database literature discusses discrete ranges only[1].

Summary

I don't see how we can get around supporting both discrete ranges and continuous ranges. The potential problems with supporting only one seem to be too great.

That being said, we probably need a way to guide users away from these pitfalls and toward a sane type definition. Some concerns along these lines are expressed here:

Concepts

The book "Temporal Data and the Relational Model" by Date, Darwen, and Lorentzos provides a basis for many of these concepts.

Component Types

There are several relevant types when constructing a range:

  • RTYPE This is the type of the range itself, e.g. "PERIOD" might be a range of TIMESTAMPs.
  • STYPE This is the type that makes up the range. For a "PERIOD" range type, the STYPE is TIMESTAMP.
  • DTYPE This is the type that represents a relative value of the STYPE, i.e. the type of the result when you subtract two values of type STYPE. For a "PERIOD" range type, the DTYPE is INTERVAL.

Special Values

A range may be NULL, just like any other value; and so can either of the components of the range. The semantics of a range component being NULL are left as an exercise for the reader (for the time being).

A range component may also be the special value "+/- INFINITY".

A range may be empty, as well, meaning that it contains no points at all. An empty range is represented by the string 'EMPTY'. Additionally, you can construct empty ranges by using the same value for the start and end points, and specifying the inclusivity as either "[)" or "(]". All empty ranges are equal -- e.g. "[ 10, 10 )" is equal to "( 20, 20 ]".

Some STYPEs, such as TIMESTAMP, allow values such as "infinity". The range types do not handle these values specially, they behave the same way as in the subtype. However, the "infinity" in the TIMESTAMP type will compare less than the special range boundary "+INFINITY".

Generic API

Functions

empty(RTYPE) -> BOOL                       -- Is the range empty?
lower_bound(RTYPE) -> STYPE                -- Lower bound of range
upper_bound(RTYPE) -> STYPE                -- Upper bound of range
lb_null(RTYPE) -> BOOL                     -- Is the lower bound null?
ub_null(RTYPE) -> BOOL                     -- Is the upper bound null?
lb_inf(RTYPE) -> BOOL                      -- Is the lower bound negative infinity?
ub_inf(RTYPE) -> BOOL                      -- Is the upper bound infinity?
length(RTYPE) -> DTYPE                     -- Size of the range.
range_offset(STYPE, RTYPE) -> DTYPE        -- Offset of LHS within the range on the RHS; error if empty.
range_offset(RTYPE, STYPE) -> DTYPE        -- Offset of RHS within the range on the LHS; error if empty.


contains(RTYPE, STYPE) -> BOOL             -- Contains?
contained_by(RTYPE, STYPE) -> BOOL         -- Contained by?
distance(RTYPE, RTYPE) -> DTYPE            -- If RHS is strictly after the LHS, then the distance between end of LHS and
                                              start of RHS. If LHS is strictly after the RHS, then the distance between
                                              the end of RHS and the start of LHS. Otherwise, an exception is raised.
                                              Error if either are empty.
equals(RTYPE, RTYPE) -> BOOL               -- Equals? All empty intervals are equal.
nequals(RTYPE, RTYPE) -> BOOL              -- Not equals?
contains(RTYPE, RTYPE) -> BOOL             -- Contains?
contained_by(RTYPE, RTYPE) -> BOOL         -- Contained by?
before(RTYPE, RTYPE) -> BOOL               -- Strictly left of? Error if either are empty.
after(RTYPE, RTYPE) -> BOOL                -- Strictly right of? Error if either are empty.
adjacent(RTYPE, RTYPE) -> BOOL             -- Adjacent? That is, are they touching but not overlapping, e.g. "[10, 20)"
                                              and "[20, 21)"? Error if either are empty.
overlaps(RTYPE, RTYPE) -> BOOL             -- Overlaps?
overleft(RTYPE, RTYPE) -> BOOL             -- Over-left? (LHS's end point is less than or equal to RHS's end point). Error if either are empty.
overright(RTYPE, RTYPE) -> BOOL            -- Over-right? (LHS's start point is greater that or equal to RHS's start point). Error if either are empty.
range_union(RTYPE, RTYPE) -> RTYPE         -- Union of two ranges. Error if the ranges are neither adjacent nor overlapping.
range_intersect(RTYPE, RTYPE) -> RTYPE     -- Intersection of two ranges.
minus(RTYPE, RTYPE) -> RTYPE               -- Difference of two ranges. Error if the result is not a single range.

Operators

?  (RTYPE) -> BOOL          -- empty
=   (RTYPE, RTYPE) -> BOOL  -- equals
!=  (RTYPE, RTYPE) -> BOOL  -- nequals
<>  (RTYPE, RTYPE) -> BOOL  -- nequals
&&  (RTYPE, RTYPE) -> BOOL  -- overlaps
@>  (RTYPE, RTYPE) -> BOOL  -- contains
<@  (RTYPE, RTYPE) -> BOOL  -- contained_by
<<  (RTYPE, RTYPE) -> BOOL  -- before
>>  (RTYPE, RTYPE) -> BOOL  -- after
&<  (RTYPE, RTYPE) -> BOOL  -- overleft
&>  (RTYPE, RTYPE) -> BOOL  -- overright
-|- (RTYPE, RTYPE) -> BOOL  -- adjacent
@>  (RTYPE, STYPE) -> BOOL  -- contains
<@  (STYPE, RTYPE) -> BOOL  -- contained_by
#   (RTYPE) -> DTYPE        -- length
<-> (RTPYE, RTYPE) -> DTYPE -- distance
&   (RTYPE, RTYPE) -> RTYPE -- rintersect
|   (RTYPE, RTYPE) -> RTYPE -- runion
*   (RTYPE, RTYPE) -> RTYPE -- rintersect
+   (RTYPE, RTYPE) -> RTYPE -- runion
-   (RTYPE, RTYPE) -> RTYPE -- minus

GiST Indexing Strategy

The GiST opclass will use the range type itself for the GiST predicates. The picksplit function can be identical to that in the "seg" contrib module.

Defining a Range Type

There are two possible implementations of a Range Types framework for defining new range types, and associated DDL. We need to choose one approach.

Approach 1

This is the most direct approach. The idea is to have PostgreSQL manage the input, output, and representation; and to understand the difference between a discrete range and a continuous range.

For instance, for a discrete range type you would need to specify something like:

 CREATE TYPE <name> AS RANGE
 (
   SUBTYPE = <stype>,
   DTYPE   = <dtype>,
   GRANULE = <granule>,
   CMPFUNC = <cmpfunc>,
   ADDFUNC = <addfunc>,
   SUBFUNC = <subfunc>,
   MULFUNC = <mulfunc>,
   DIVFUNC = <divfunc>
 );

Rather than using a "granule", we could specify a "next" and "previous" function for the same effect. The multiply and divide functions are only necessary if using granules in the representation, e.g. <start, num_granules>. If the begin and end points are used, multiplication and division are unnecessary.

A continuous range can be specified like:

 CREATE TYPE <name> AS RANGE
 (
   SUBTYPE = <stype>,
   DTYPE   = <dtype>,
   CMPFUNC = <cmpfunc>,
   ADDFUNC = <addfunc>,
   SUBFUNC = <subfunc>
 );

Representation

Approach 2

This approach is closer to a type interface where the representation itself is more encapsulated and hidden from PostgreSQL. PostgreSQL is then able to offer the aforementioned generic api (including an GiST opclass), and also has knowledge about range types and how to act on them (which could be used for syntax sugar, etc).

In this approach the user specifies how the range should be input, output, and represented. The API would then just need to know how to access the lower bound, upper bound, inclusivity/exclusivity, and it would need a comparator for the base types. For instance:

 CREATE TYPE <name> AS RANGE
 (
   STYPE       = <stype>,
   DTYPE       = <dtype>,
   CONSTRUCTOR = <constructor_func>,
   LBOUND      = <get_lbound_func>,
   UBOUND      = <get_ubound_func>,
   MIN         = <get_min_func>,
   MAX         = <get_max_func>,
   FLAGS       = <get_flags>,        -- null/inf bounds
   STYPE_CMP   = <stype_cmp_func>,
   WIDTH       = <get_width_func>,   -- for GiST penalty function
   [INPUT]     = <input_func>,
   [OUTPUT]    = <output_func>
 );

The advantage is that it sidesteps the complexity of PostgreSQL understanding the difference between a discrete range and a continuous one.

The disadvantage is that it's a little harder to check for sanity at compile time, and it would place a greater burden on the user when defining the type.

References

  • Date, Darwen, Lorentzos: Temporal Data and the Relational Model pp. 344–47.