Collations

From PostgreSQL wiki
Jump to: navigation, search

If the ordering of strings changes due to collation definition changes, a btree index (or more rarely, a check constraint or partition) can become corrupted. GNU libc 2.28, for example, will change the ordering of many strings for all locales, and in recent memory German and Hungarian had subtle changes on Glibc that broke people's indexes. Some reports:

To quote from Unicode Technical Standard:

"Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned."

What are we doing about it?

Peter Eisentraut worked on adding ICU support to PostgreSQL 10. ICU is an alternative provider of collations (and many other things) that is much more powerful than libc, and, crucially, it can report a version string that you can use to detect when its underlying ordering rules change. There are problems with the way that we model versions in releases 10, 11 and 12:

  • you can't use ICU for the "default" collation yet, and that's what almost all users use almost all of the time
  • the versions are tracked in a way that doesn't actually force you to rebuild your indexes or even help you find out which ones need to be rebuilt, it just trusts that you've done so when you enter the DDL command ALTER COLLATION ... REFRESH VERSION to clear the version mismatch warning it spits out whenever opening an index
  • it still uses potentially corrupted indexes even when it generates the warning, and users might not see the warning

In PostgreSQL 13 or 14, we hope to make several improvements:

Other ideas for the future:

  • At least DB2 and SQL Server have notions of collation versions, and some allow multiple versions to be used at the same time. We might eventually want to support that (would ICU64 and ICU65 be different providers, or the same provider with different versions, or ...?)
  • We might also want to refuse to use an index we believe to be potentially corrupted.
  • We might want to extend per-database object version tracking to other database objects, such as CHECK constraints and partitions, that could be affected by ordering changes.

One interesting thing to note is that the reason for many of the recent changes on various operating systems (Windows, FreeBSD, GNU, ...) is that they're all giving up on maintaining their own collation systems and switching to CLDR, so in the end they'll probably all agree with ICU anyway. Then the only changes will come from gradual adjustments and improvements made by the Unicode CLDR projects, though these will flow into different operating systems and libraries at different speeds... so it won't matter much whether you use ICU or libc, but they'll sometimes be out of sync and we'll always have to deal with versions and changes anyway, despite the general convergence.