StateOfICU
From PostgreSQL wiki
Jump to navigationJump to searchIntro
Collation is how strings are compared and sorted. The simplest approach is to "memcmp" the strings, which is what the C locale does.
Collation affects predicates, but it also affects the structure of an index, which depends on a consistent ordering.
Providers
More complex collations use a provider, which may be either ICU or glibc. A different provider or different version may produce a different collation order, which risks corrupting indexes (necessesitating REINDEX).
Benefits of ICU
- Platform-independent semantics
- Performance
- Abbreviated keys
- Seems to be faster in general (at least compared to some versions of glibc)
- http://smalldatum.blogspot.com/2023/05/postgres-16beta1-looks-good-vs-sysbench.html
- Features https://www.postgresql.org/docs/devel/collation.html#ICU-CUSTOM-COLLATIONS:
- Case-insensitive and/or accent-insensitive
- Ignore punctuation
- Treat sequence of digits as a single number
- Not libc
- collations change in different versions
- limited ability to control the version of libc that you use
Risks
- Unknown unknowns
- Ordering differences
- Though that can happen due to different libcs, or different versions of any provider)
- ICU has its own bugs
- u_versionToString(): https://unicode-org.atlassian.net/browse/ICU-22215 ("astonishingly bad" -- Robert Haas)
- some "obsolete" locales are no longer recognized in newer versions of ICU
- C
- fr_FR@euro
- de__PHONEBOOK
Done
- Canonicalization to language tag
- Consistency in interpretation of "und"
- Handle language tags in ICU < 54
- New built-in collations (Peter Eisentraut):
- UNICODE: root collation
- UCS_BASIC: code point order (memcmp for UTF-8)
- ICU rules
- Documentation
TODO
- Redefine iculocale=C/POSIX (and "C.anything"/"POSIX.anything"?) to mean memcmp/pg_ascii
- Make LOCALE (and --locale) apply to ICU
- the fact that locale doesn't apply to ICU creates a situation described as "maximally confusing"
Questions
Opinions about ICU technically?
- Quality?
- Performance?
- User experience?
Direction: opinions about pushing users toward ICU to be the preferred collation provider?
Timing: opinions about the steps that have been taken or should be taken in version 16? Defaults?
Notes
- ...