Locale data changes

From PostgreSQL wiki
Jump to navigationJump to search

The following information focuses on operating systems using the GNU C library (glibc), which includes most popular Linux distributions. All versions of PostgreSQL are affected. Other operating systems could in principle have the same issues, but we have not gathered any information about that.

PostgreSQL uses locale data provided by the operating system’s C library for sorting text. Sorting happens in a variety of contexts, including for user output, merge joins, B-tree indexes, and range partitions. In the latter two cases, sorted data is persisted to disk. If the locale data in the C library changes during the lifetime of a database, the persisted data may become inconsistent with the expected sort order, which could lead to erroneous query results and other incorrect behavior. For example, if an index is not sorted in a way that an index scan is expecting it, a query could fail to find data that is actually there, and an update could insert duplicate data that should be disallowed. Similarly, in a partitioned table, a query could look in the wrong partition and an update could write to the wrong partition. Therefore, it is essential to the correct operation of a database that the locale definitions do not change incompatibly during the lifetime of a database.

Operating system vendors, and in particular the authors of the GNU C library, change the locale data from time to time in minor ways to correct mistakes or add support for more languages. While this in theory violates the above rule, it has historically affected few users and has not received wide attention. However, in glibc version 2.28, released 2018-08-01, a major update to the locale data has been included, which can potentially affect the data of many users. It should be noted that the update itself is legitimate, as it brings the locale data in line with current international standards. But problems are bound to happen if these updates are applied to an existing PostgreSQL system.

The integration of glibc updates into Linux distributions is the domain of the operating system vendor. We expect that vendors of long-term support Linux distributions will not apply incompatible locale updates to their distribution within a given release, but this is only an expectation, as we cannot predict or influence future actions. Moreover, PostgreSQL currently has no way to detect an incompatible glibc update. Therefore, some manual care is required in planning any updates or upgrades.

A recording of the 2024 pgconf.dev talk on this topic is available at https://www.youtube.com/watch?v=KTA6oau7tl8

What is affected

Situations that are potentially affected involve changed locale data being applied to an existing instance or binary-equivalent instance, in particular:

  • Changing locale data on a running instance (even if restarted).
    • This includes in particular upgrading the Linux distribution to a new major release while keeping the PostgreSQL data directory around.
    • Using pg_upgrade (e.g. when upgrading both the operating system and the PostgreSQL major at the same time) does not avoid the problem.
  • Using streaming replication to a standby instance with different locale data. (The standby is then potentially corrupted, but the primary is fine.)
  • Restoring a binary backup (e.g., pg_basebackup) on a system with different locale data.

Not affected are situations where the data is transported in a logical (not binary) way, including:

  • Backups/upgrades using pg_dump
  • Logical replication

Testing collation

Using the OS "sort" utility is an easy way to see if the collation has changed:

( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.UTF-8 sort

These two strings will sort differently on different locale versions. Compare the output on the old and new OS version.

Note: If these two sort the same, there might still be other differences.

Affected data types

Affected:

  • btree indexes on text, varchar, char (including multi-column indexes where one column is textual)

Unaffected:

  • bytea
  • tsvector gin indexes
  • pg_trgm indexes
  • numeric data types: int, bigint, numeric, float, ...
  • custom data types like geometry (PostGIS)

What to do

When an instance needs to be upgraded to a new glibc release, for example to upgrade the operating system, then after the upgrade

  • All indexes involving columns of type text, varchar, char, and citext should be reindexed before the instance is put into production.
  • Range-partitioned tables using those types in the partition key should be checked to verify that all rows are still in the correct partitions. (This is quite unlikely to be a problem, only with particularly obscure partitioning bounds.)
  • To avoid downtime due to reindexing or repartitioning, consider upgrading using logical replication.
  • Databases or table columns using the “C” or “POSIX” locales are not affected. All other locales are potentially affected.
  • Table columns using collations with the ICU provider are not affected.

What indexes are affected

Use this SQL query in each database to find out which indexes are affected:

SELECT DISTINCT indrelid::regclass::text, indexrelid::regclass::text, collname, pg_get_indexdef(indexrelid) 
FROM (SELECT indexrelid, indrelid, indcollation[i] coll FROM pg_index, generate_subscripts(indcollation, 1) g(i)) s 
  JOIN pg_collation c ON coll=c.oid
WHERE collprovider IN ('d', 'c') AND collname NOT IN ('C', 'POSIX');

What Linux distributions are affected

To help users assess the situation with their current operating system, we have assembled the following information. Note again that this is only a report of the current situation and that we cannot influence what these vendors do or might do in the future.

Debian

Version 8 (jessie) and 9 (stretch) use the old locale data. We don’t expect any incompatible changes inside those releases. Upgrades from version 8 to 9 are safe.

Version 10 (buster) uses the new locale data. Therefore, caution will be necessary when upgrading.

See also:

Ubuntu

Ubuntu up to version 18.04 (bionic) uses the old locale data.

The new glibc 2.28 locale data is introduced in version 18.10 (cosmic) (not LTS). Upgrading from bionic-or-older to cosmic-or-newer needs the mitigation steps outlined above.

RHEL/CentOS

Versions 6 and 7 use the old locale data. Upgrades from version 6 to 7 are safe, unless the de_DE.UTF-8 locale is used, which has seen a similar change between those releases. (All other locales, including other de_*.UTF-8 locales like de_AT, are safe.)

Version 8 uses the new locale data. Therefore, caution will be necessary when upgrading.

SuSE Linux Enterprise

SLE12 uses the old locale data.

SLE15 up till service pack 2 (SLESP2) shipped with glibc-2.26 and used the old locale data. SLE15 Service Pack 3 updated the glibc version to 2.31, so uses the new locale data. Therefore, caution will be necessary when upgrading.

macOS

While macOS is not using glibc it shows a similar problem due to extensive changes in many collations between macOS 10.15 Catalina and macOS 11 Big Sur. Old versions had symlinks for all their UTF8 LC_COLLATE files pointing to a dummy with essentially byte order collation, like 'C'. Starting with macOS 11, Apple changed about half of them to point to their ISO8859-1 (Latin 1) counterparts instead:

ls -l /usr/share/locale/*.UTF-8/LC_COLLATE
( echo "5£"; echo "£5" ) | LC_COLLATE=en_US.UTF-8 sort

add yours here

References