Rethinking datatypes

From PostgreSQL wiki
Jump to navigationJump to search

Rethinking datatypes

Currently a type in PostgreSQL is really nothing more than that a set of type input/output functions and some details about storage. All operators on these types are merely incidental to the definition. This leads to various odd things, such as determining if a type can be ordered by looking for a B-tree index method for that type.

The problem is essentially that things like order, hashing and equality are properties of the type and what should really be happening is that the B-tree index class should be using the properties of the type to build itself, rather than the other way round.

But it's more complicated than that: a type can have multiple ways to order, hash or to test equality. The obvious example is strings, but you could come up with a number of ways to order points for example, all of which are useful for different purposes.

One way of describing these are "collations" but that is generally used when referring to strings. The mathematical term might be "ordered set".

Benefits

Autocreation of operator classes and operators

While the above is a bit of an abstract discussion, there are a number of real benefits to thinking about types this way.

For example, if as part of the type definition you could provide an order function and a hash function that have the following properties:

order: type × type → integer
order(a,b) < 0  <=>  a < b
order(a,b) = 0  <=>  a = b
order(a,b) > 0  <=>  a > b
hash: type → integer
a = b  implies  hash(a) = hash(b)

Then you could automatically create a consistant set of comparison operators (<,<=,=,>=,>), a B-tree operator class and a hash operator class.

Having either an order or a hash is entirely optional ofcourse. Without a hash function you couldn't build hash indexes for example. The Unique node would be able to use either the order() or the hash() function, however, there could be an assumption that calculating a hash is going to be faster than determining order.

Important note: the term "equality" does not necessarily mean that two values are exactly the same. Strictly speaking an equality operator divides a set into a number of equivalence classes. So it is perfectly reasonable to have an order function that determined that the strings "Hello" and "héllo" are the same. Just as long as it is consistant.

Sort-keys

One of the issues with sorting large amounts of data is that you need on the order of n*log(n) comparison operations. Calling functions in PostgreSQL is not cheap and when comparing objects like strings the costs become huge. However, if you consider what your sorting an ordered set like above, there is an alternative.

What you do is provide a sortkey function with the following properties:

sortkey: type → integer
sortkey(a) < sortkey(b)   implies a < b

Note that this is a stricter definition than the hash function, so not all types will be able to provide it. However, if present it can be used to speed up sorting considerably. The code currently treats the first datum specially, but storing it directly reather than as part of the tuple. A sortkey is a more compact representation, with the advantage that the comparison can be done without calling the comparison function at all most of the time. Instead, we compare the integers first and only if they're equal do we call the actual comparison function.

Some testing will need to be done to see if this is faster than storing the collation key directly for strings. However, collation keys for strings tend to be 1.5 to 2 times the size of the original string. The sortkey would then be the first 4 bytes as an integer.

Related items

This items is closely related to many other things. For example:

  • The use of the B-tree operator class by sort nodes and the hash operator class by Hash nodes.
  • SQL COLLATE support is really going to bring to a head the issue of having multiple ways to sort a single datatype.
  • The SQL spec is advancing the use of "order" to many places, such as for example window functions. Proper optimisation of these thing may require advanced notions of what "order" is.

Implementation

Nowhere. This page is really just a discussion about possible future directions for the PostgreSQL type system.