Todo:ICU

From PostgreSQL wiki
Jump to navigationJump to search

Collation using ICU

Currently PostgreSQL relies on the underlying operating system to provide collation support. This means that each platform has a slightly different way of doing collation. Some like the BSDs historically didn't support UTF-8 collation (see [1] [2] for recent progress). Some like glibc have fairly complete collation support.

ICU is described at http://www.icu-project.org/

It's available under what appears to be a BSD-compatible license.

References

Discussion about using ICU to prevent index corruption

Discussion about integrating Palle Girgensohn's ICU patch

Discussion about ICU in context of abbreviated keys and glibc's strxfrm mess

Palle Girgensohn's ICU patch updated with COLLATE support

Discussion on Peter Eisentraut's ICU patch

Status

An patch for ICU support has existed for several years in FreeBSD ports [3]. It only covers a small part of what is necessary.

A todo list for integrating Palle's patch has been posted to the hackers mailing list.

PostgreSQL Plus Advanced Server already has support for ICU. EnterpriseDB might decide to contribute those changes back.

The SQL syntax and parser/catalog support is a separate item, see Todo:Collate.

Overview of changes required

Basically, the collation support provided by ICU needs to be done either instead of the current system support, or in addition to. Given that client programs will probably be using the system collation, at the very least it should be an option.

ICU supports a very flexible scheme for collation. In addition to just providing standard collations for many languages, it also allows users to customise collations and create their own. This does create the question about how to represent the collation to the server.

The native format of ICU is UTF-16 (not to be confused with UCS-2 which doesn't support all of Unicode). Now, PostgreSQL doesn't support this encoding at all (for various reasons) and we do want to avoid the overhead of converting every string to UTF-16 before comparison.

Fortunatly ICU provides an alternative, iterators [4]. An iterator is configured on a string which returns one character at a time. What will need to happen for non-Unicode encodings is that they will have an iterator that converts the characters one at a time for the collation. PostgreSQL already contains all the necessary tables to do this.

For large scale sorting ICU recommends doing a conversion from the strings to sort-keys which can be compared with just memcmp(). This is the same machanism as in POSIX with strxfrm(). However, currently PostgreSQL has no way to store these sort-keys.

Discussion

This topic has been discussed on and off for many years, most recently [5] (see attached JS-Kit discussion). The objections appear to be:

  1. ICU is a large library
  2. ICU prefers to use Unicode internally, though it does not require strings to be stored in Unicode
  3. It won't match the results returned by sort on the command-line

The pros are:

  1. ICU is cross-platform, it supports at least the platforms postgresql does. [6]
  2. ICU is generally faster than the system collation support (glibc [7])
  3. ICU is more flexible, allowing users to create custom collators
  4. Sorting will match Java sorting

No research has been done on the effect of collation for non-UTF-8 locales. In general any encoding will require some iterator for collation, only UTF-8 has a builtin iterator.

It should be noted that this is not a blocker for SQL COLLATE support. Using just setlocale() and strcoll_l (supported on Windows, GLibc and MacOSX) we can get excellent performance on all the commonly used platforms. However, this does leave BSD/MacOSX users requiring Unicode collation out in the cold. (MacOSX has support for unicode collation internally [8], but does not support it in any encoding other than UTF-16. Hence it isn't available in the C library).