Abbreviated keys glibc issue

From PostgreSQL wiki
Revision as of 00:27, 30 March 2016 by Aglio (talk | contribs) (first draft of wiki page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

PostgreSQL 9.5 Abbreviated Keys Regression

PostgreSQL 9.5.0 shipped with a feature which sped up sorts and indexing for all character columns (TEXT, VARCHAR and CHAR) by using an algorithm called "abbreviated key sorting". This was quite a compelling feature because it sped up sorts and index builds for some users as much as 20X.

As of 9.5.2, the project has had to disable this feature for index builds for any non-C locale because it produces corrupt indexes in some cases. Users may need to REINDEX some indexes.

This issue does not affect users of version 9.4 and earlier.

What's the bug?

Users whose text data uses a collation other than "C" locale found that, in some cases, indexes built with the abbreviated key feature would fail to find existing rows. In worst cases, this could cause unique key violations. An investigation in pgsql-bugs identified the problem.

In real language locales, the abbreviated keys feature uses the strxfrm() function to create a sortable hash key for the first part of the string. After the bug report, developers discovered that many versions of glibc (Linux's implementation of the C library) have buggy implementations of strxfrm() that, in some locales, can produce string comparison results that do not match strcoll(), resulting in indexes which return inconsistent results. This certainly applies to versions of glibc before 2.22, and may apply to other versions as well. Since older versions of glibc are far more widespread than the newer versions, it is not safe to depend on reliable behavior. See [the Red Hat bug](https://bugzilla.redhat.com/show_bug.cgi?id=1320356) for more discussion of this.

This issue does not affect the simplified "C" locale because it does not use either of those functions. It probably does not affect many locales (there is no evidence of issues with en_US, for example), but as there is no way to do exhaustive testing on all locales and platforms, it is considered unsafe in any non-C locale.

What's my locale?

You can check your database collation and locale with the \l shortcut from psql, for example:

   postgres=# \l+
                      List of databases
     Name     |  Owner   | Encoding | Collate | Ctype |  Size   | 
 -------------+----------+----------+---------+-------+---------+-
  libdata     | libdata  | UTF8     | pt_BR   | pt_BR | 18 MB   | 
  postgres    | postgres | UTF8     | C       | C     | 6976 kB | 
  subscribers | postgres | UTF8     | C       | C     | 7169 kB | 
 (5 rows)

For this example, the "subscribers" database could safely use abbreviated key indexing, whereas the "libdata" database could not.

Since PostgreSQL allows you to set collations per column, however, it's worth searching for indexes on columns with non-default collations. A query to do that is as follows:

 COLLATION SEARCH QUERY HERE
 

What do I have to do?

After updating your server to 9.5.2 (or later), you will want to REINDEX each index which was created with non-C locale just in case it is corrupt due to this issue. Since REINDEX is a blocking operation, however, you may wish to instead do this as a two-step concurrent operation, for example:

 CREATE INDEX CONCURRENTLY table1_charcol2_a ON table1(charcol2);
 DROP INDEX CONCURRENTLY table_charcol2;

How critical doing this reindexing is depends on whether you believe you are likely to experience the issue.

When will I get my faster index builds back?

Nobody knows at this point. The bug is a hard problem to solve, not only because of the number of versions and collations involved, but because there are stability and consistency issues with strxfrm() and strcoll(). For example, see this thread.

If you have ideas to offer, please join the pgsql-hackers thread discussing the issue.