Introduction

Collation in Postgres is actually two closely-related subsystems:

1. Text ordering: sort order for ORDER BY, and semantics of < <= >= >.

2. Casing and Character Classification: case conversion, case folding, and pattern matching (ILIKE, regex [[:alpha:]], etc.) semantics.

What follows is an outline of recent progress, and a general direction for the future.

Builtin Provider

The builtin provider was introduced in version 17 and offers locales that use code point ordering semantics combined with Unicode casing and character classification.

Code point ordering is not a natural language collation, so it's not ideal for human consumption, but has the following advantages:

Much faster, exactly like the "C" locale
Stable ordering, which avoids index inconsistencies
Better for interacting consistently with other systems without needing to coordinate the collation library versions. This makes it less likely to cause problems with FDWs, some kinds of TableAMs or IndexAMs, etc.

Meanwhile, it still offers the natural language semantics for casing and pattern matching, and those semantics are based directly on Unicode which is updated with each major version.

Improved Database Collation & Multibyte Support

In version 19, the database default collation is used consistently by other subsystems, like Full Text Search. .Previously, many parts of Postgres still depended on the global libc locale, which could be confusing for users who chose ICU or builtin as the locale provider.

Similarly, there's better multibyte support, for instance when extracting ILIKE prefixes to use as an index key, and in the fuzzystrmatch contrib module.

Code Organization

The code has gone through several major refactoring efforts to fully define the semantics of each provider in method tables. Previously, the semantics were mostly defined by branching on the provider kind at the call site.

By using method tables, it draws better boundaries of responsibility, so that the callers do not need to understand or make assumptions about each provider. It also makes it much easier to introduce a new provider or replace an existing provider, which could be extensible in the future.

Future

Built-in case insensitive collation?
Avoid pushing natural language collation into lower layers (i.e. indexes)?
Dependency tracking, versioning, and migration to new provider versions?
Extensible providers?

CollationStatus

Contents

Introduction

Builtin Provider

Improved Database Collation & Multibyte Support

Code Organization

Future

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools