CollationStatus
Introduction
Collation in Postgres is actually two closely-related subsystems:
1. Text ordering: sort order for ORDER BY, and semantics of < <= >= >.
2. Casing and Character Classification: case conversion, case folding, and pattern matching (ILIKE, regex [[:alpha:]], etc.) semantics.
What follows is an outline of recent progress, and a general direction for the future.
Builtin Provider
The builtin provider was introduced in version 17 and offers locales that use code point ordering semantics combined with Unicode casing and character classification.
Code point ordering is not a natural language collation, so it's not ideal for human consumption, but has the following advantages:
- Much faster, exactly like the "C" locale
- Stable ordering, which avoids index inconsistencies
- Better for interacting consistently with other systems without needing to coordinate the collation library versions. This makes it less likely to cause problems with FDWs, some kinds of TableAMs or IndexAMs, etc.
Meanwhile, it still offers the natural language semantics for casing and pattern matching, and those semantics are based directly on Unicode which is updated with each major version.
Improved Database Collation & Multibyte Support
In version 19, the database default collation is used consistently by other subsystems, like Full Text Search. .Previously, many parts of Postgres still depended on the global libc locale, which could be confusing for users who chose ICU or builtin as the locale provider.
Similarly, there's better multibyte support, for instance when extracting ILIKE prefixes to use as an index key, and in the fuzzystrmatch contrib module.
Code Organization
The code has gone through several major refactoring efforts to fully define the semantics of each provider in method tables. Previously, the semantics were mostly defined by branching on the provider kind at the call site.
By using method tables, it draws better boundaries of responsibility, so that the callers do not need to understand or make assumptions about each provider. It also makes it much easier to introduce a new provider or replace an existing provider, which could be extensible in the future.
Future
- Built-in case insensitive collation?
- Avoid pushing natural language collation into lower layers (i.e. indexes)?
- Dependency tracking, versioning, and migration to new provider versions?
- Extensible providers?