Todo:Collate

From PostgreSQL wiki
Jump to navigationJump to search

Note: This page is out of date; what it describes has been implemented as of 2011. See Todo for future tasks.

SQL standard COLLATE support

A collation is a set of rules about how a set of objects should be ordered. In general it only applies to strings because that's the only case that has any kind of debate. This is also the only case the SQL standard worries about. However, in the general case you could probably have a meaningful discussion about how geometric shapes could be ordered.

References

  • Charset/collate support and function parameters [1]
  • Proof of concept COLLATE support with patch [2]
  • For review: Initial support for COLLATE [3]

This patch covers all the catalog changes and syntax changes required for SQL complaint COLLATE support. It also has all the necesary changes for b-tree to work with these collations. The only part missing is teaching the planner what indexes may be used for what COLLATE statement, and underlying support for multiple collations (see Todo:ICU).

This patch has severely bitrotted.

  • Removing SORTFUNC_LT/REVLT [4]

Related todo items

ICU collation support

Status

  • Design submitted [5] - not reviewed
  • Partial Patch - abandoned

Overview of changes required

  • The collation is a property of a column, not of a value. At any point in a query only a single collation is current. So fields need to be added to the executor to track this.
  • Functions need to know what collation their arguments are, so a field is to be added to the fmgr structure.
  • Changes to grammer to allow:
    • CREATE TABLE .. ( ..., field COLLATE <collation>, ... )
    • ALTER TABLE
    • CREATE INDEX .. ON ( field COLLATE <collation> )
    • CREATE INDEX .. ON ( (expr COLLATE <collation>) )
    • expr :: expr COLLATE <collation>
    • Possibly: CREATE COLLATE ..

Discussion

PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised. Note that this is intrinsically broken because you can choose the encoding per database but usually collation implementations only work on a particular encoding. It's also a major problem because virtually all databases have some strings that should be considered as plain ascii strings (such as path names, urls, or even codes like 'Y' and 'N') even if they have localizable strings as well.

POSIX collation support basically consists of two functions: strxfrm() and strcoll(). The former takes a string and converts it to another string that can be passed to strcmp() to do the actual comparison. The following statement is true:

strcoll(a,b) == strcmp( strxfrm(a), strxfrm(b) )

POSIX collation only works on the current locale. This is somewhat restrictive, so most major operating systems have provided variation that work with multiple locales. This includes Win32, Linux/glibc and Mac OS X. OSes which don't include the BSDs, HPUX and Solaris.

For this reason you can actually hack multi-locale sorting in some operating systems using a pg_strxfrm() function and creating a functional index on that. [6]

However, it is undesirable to have the COLLATE support depend on which platform you are running. It is possible to call setlocale() repeatedly, though the performance will be variable across platforms. It would be fine on glibc but abysmal on solaris, for example. It should be possible to defend against performance problems when only one locale is actively being used in a given query though.

Supporting ICU would give us uniform support across all platforms.

Possible Problems

Fill me in...