Key normalization

From PostgreSQL wiki

Jump to: navigation, search

This page is a draft design document for an enhancement to the representation of B-Tree index tuples called key normalization, which is intended to enable a number of other optimizations. Because of the fact that these B-Tree optimizations are interrelated, it isn't currently clear which should be implemented first, or how to do so without disadvantaging other cases, which may hold back implementations of any single one of them. The goal of this document is to provide clarity.

This wiki page was raised on a pgsql-hackers thread in July of 2017.



Key normalization is an enhancement to the representation of index tuples. It should only be applied to those index tuples that "separate the key space": high key items at the leaf level, and all page items (both high keys and pointers to the lower level of the tree) within internal pages. Index tuples that separate the key space are called pivot tuples in the nbtree README. Normalization should not be applied to "real" items that point to tuples in the heap (non-pivot tuples), since key normalization will generally leave the index tuple in a state where it's impossible to reclaim the original data.

The basic idea is that the representation of the final pivot index tuple is made suitable to perform all comparisons as a simple memcmp() operation within routines like _bt_compare() (actually, strcmp() style semantics should be used, without using a real NUL-terminated string). As an alternative to the nbtree opclass working with an on-disk representation structured in a rather unconstrained way, along with a custom comparator for use with that representation, the opclass can instead generate a normalized key from a base index tuple. It generates it using an encoding scheme (this will probably become "B-Tree support function 4"). Normalized keys are allowed to be larger than the original index tuple.

Index scans for indexes that have opclasses that all support key normalization can let nbtree generate whole binary strings that give the same answer to comparisons with memcmp() as the original representation would with a scankey consisting of one or more standard comparators (B-Tree support function 1). Index scans work the same way as before on the leaf level, except that the high key check also uses a normalized key there.

Optimizations enabled by key normalization

Key normalization is an enabler of several further optimizations, because binary strings are extremely flexible. Normalized keys are essentially strings of bytes, or strings of wider integers (big endian) that can be easily concatenated, inverted, and truncated. Specific optimizations include:

Simpler comparisons:

  • The simple fact that a memcmp() is all that is needed in all cases, without exception.
    • No complex index_getattr() calls needed in the majority of calls to _bt_compare(). They're only needed for the conventional index tuple representation that items that actually point to the heap will continue to use.
    • Normalized keys are a flattened, one-pass representation. A single pointer to the current start position in the scan key's normalized key can therefore be maintained. Once the start position for future comparisons advances, it never needs to go backwards. In other words, normalized keys enable skipping earlier scan key bytes that are known to not be interesting to internal page binary search comparisons at this level of the tree; we can reason about the properties of keys transitively, from the parent's downlink. (This can usefully interact with prefix compression and abbreviated keys in the ItemId/item pointer array.)

Smaller keys (on average):

  • Prefix compression + suffix truncation on internal pages.
    • Normalized keys are a flattened, one-pass representation. They provide a way to combine prefix compression and suffix truncation that is far simpler and at least as effective as any alternative approach.
    • Smaller items in internal pages increases fan-in, possibly making B-Tree shorter overall.
    • We might as well remove trailing bytes (i.e. perform suffix truncation) to make a new high key at the leaf level as small as possible while still correctly separating what belongs in the left and right leaf pages after the split.
    • Suffix truncation is a classic technique, described first in a 1977 paper by Rudolf Bayer, who is generally credited with having invented B-Trees in the early 1970s. Berkeley DB has long supported this technique [1].
  • No need for multiple varlena headers in multi-column cases, or even for one. IndexTuple size is all that you need.
    • Even on alignment-picky platforms, multiple attributes don't have to access an aligned varlena header at multiple offsets per IndexTuple.
  • Prefix compression in internal pages, further increasing fan-in.
    • A "low key" is needed in addition to the existing "high key" in order to make prefix compression work without complicated bookkeeping, and having normalized keys makes that less expensive, and far simpler.

VACUUM/bloat control:

  • Making some heap TIDs part of the key space, participating as an implicit "final column", so that equal index TIDs reliably stay in heap order.
    • Has benefits for controlling bloat in some workloads.
    • Increases locality of access with the heap for index scans when there are duplicates.
    • Suffix truncation is really needed so that adding a heap TID in internal pages doesn't kill fan-in for the majority of cases that will see no benefit from that.

CPU cache efficiency:

  • Abbreviated keys in the ItemId array [2].
    • Cache misses during B-Tree binary searches are thought to be a problem for some workloads [3].
    • There is space for these that is reclaimable from ItemIdData.lp_len -- this can just be the first 15 bits of the normalized key. This is redundant with IndexTuple length, and is only accessed in a few places within nbtree and bufpage.c index routines today. 15 bits gets you surprisingly far within internal pages.
    • Internal page prefix compression can be used to make it far more likely that high entropy abbreviated keys are always available. Prefix compression may be an essential enabler of abbreviated keys, in fact. Since the prefix compression "low key" is the same as the downlink in the parent, it doesn't need to participate in comparisons at all. We can just advance our offset into the flattened normalized scan key such that the comparisons of abbreviated keys are against the correct point in the scan key.

Building normalized keys

Key normalization is an "all or nothing" optimization. There should probably be a support function 4 (key normalization routine) for all the core datatypes that currently have sortsupport, which is where the optimization makes sense. Even simple int4/int8 indexes can be effectively optimized in this way, as their comparisons won't be notably more expensive, but the benefits with many duplicates can still be considerable.


Steps in generating a normalized key with one attribute:

  1. Place NULL bit in first bit in first byte.
    1. We reserve one "NULL bit" per attribute. If the attribute is NULLS FIRST, then NULL columns do not have the bit set. If it is NULLS LAST, then NULL columns have the bit set.
    2. If a value is NULL for some attribute x, there is no need to include any placeholder/padding bytes after the NULL bit. We can skip on to the next attribute x + 1 immediately; the very next bit is the x + 1 NULL bit. The x + 1 opclass supplied normalized key, if any, follows. This works because x + 1 will only ever be compared against other index tuples whose x attribute is also NULL.
  2. Ask an opclass for a normalized key based on a base attribute value -- no need to pass any information other than the value itself, and the attribute's collation.
  3. DESC is encoded directly -- just get the complement of the opclass-supplied encoded binary string for that attribute.

Normalized keys can and will be larger than the base index tuple that they are based on, although they will often actually be shorter, even when truncation or compression are not used.

Notes about opclass encoding schemes for common types

  • Types like int4, whose on-disk representation is two's complement signed integers (little endian or big endian, depending on platform) are encoded as big endian unsigned integers. This can happen without any storage overhead compared to the original representation for that portion of the final normalized key. This can be done easily: perform integer conversion to unsigned, add "INT_MAX + 1" to that, and do a byteswap iff on a little-endian platform [4].
  • Booleans only need a single bit.
  • Text will be supported, but only when the C collation or an ICU supplied collation are in use. Variable-length types like text must having a terminating NUL byte. The opclass doesn't add the NUL byte, because we don't want to get confused if and when the opclass binary key is inverted (where we actually compute its complement and store that instead) in order to support DESC.

It may or may not be worth versioning opclass encoding schemes separately from versioning text collations.

Breaking the 1/3 of a page size restriction


If one attempts to insert values above about 1/3 of BLCKSZ into a B-Tree index today, an error is raised: "Values larger than 1/3 of a buffer page cannot be indexed" [5]. This is necessary because we need to be able to store at least 3 items on every page (high key, plus two distinct real items), and this is a simple way to enforce that.

This raises a problem for us: If a normalized key can be larger than the original, and can in theory even be arbitrarily larger, then why is it okay that that could lead to a situation where we seemingly have to break this rule? It would be especially bad if the error could be seen by an innocent party that is only inserting a small key that would produce a small normalized key, since that's completely unpredictable and cannot even be worked around. It's unpredictable because the choice of split point (which implies what the new high key for new left page will be -- it comes from the first item on the right) has little to do with the inserter.


The solution here is counter-intuitive: silently truncate any bytes at the end of the normalized key that go over the 1/3 of a page restriction. This needs to happen regardless of whether the normalized key is for a scankey, or is for storage as a leaf high key (which is copied to become the downlink in the parent in the second phase of page split).

This is correct because:

  1. When the untruncated scankey normalized key for some index scan isn't >= number of bytes we can fit, but that isn't the case for one or more separator keys in the index (they were truncated or precisely filled maximum string size), then it's clear that nothing is broken because any truncated bytes would not have participated in any comparison anyway.
  2. When it's the other way around, then it's also clear that nothing is broken, again because those truncated bytes would not have participated in any comparison anyway.
  3. In all other cases, we consider the keys equal. We have always had a special adaptation to the original Lehman & Yao invariant which saves us by moving one to the left when keys are equal -- this only happens within internal page binary searches, which is exactly what we want, too.
  4. We treat equal leaf high keys as a special case when they're truncated -- inserters can no longer assume that either sibling page is just as good when this happens, leeway that we take advantage of today allowed by special duplicate handling in Postgres. With heap TID added to the keyspace as a "unique-ifier" (described elsewhere in this document), this really should only happen in rare cases where truncation was needed.

This "special adaptation to the original Lehman & Yao invariant" is needed today to support equal keys, since Lehman & Yao assume that keys cannot be fully equal (they must all be unique). Our (truncated) keys are "equal" in the sense that at least the first few KBs are equal. We can start on the next level down by following the downlink "one to the left" from an internal page, just as we do for actually-equal items today.

Extract on this from the top of the nbtree README:

Lehman and Yao assume that the key range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page. This does not work for nonunique keys (for example, if we have enough equal keys to spread across several leaf pages, there *must* be some equal bounding keys in the first level up). Therefore we assume Ki <= v <= Ki+1 instead. A search that finds exact equality to a bounding key in an upper tree level must descend to the left of that key to ensure it finds any equal keys in the preceding page. An insertion that sees the high key of its target page is equal to the key to be inserted has a choice whether or not to move right, since the new key could go on either page. (Currently, we try to find a page where there is room for the new key without a split.)

This "use Ki <= v <= Ki+1 instead" approach essentially works by pretending that every item really is unique, though it just so happens that nobody is ever interested in searching based on a final attribute that ensures uniqueness. If no one actually cares about the imaginary final attribute for the purposes of searching, then it doesn't actually matter that it doesn't really exist (and won't actually affect ordering at the leaf level, where no special tricks can maintain the useful illusion of uniqueness). This is true at least as far as correctness is concerned.

The "heap TID in internal pages" optimization, if also implemented, would have the effect of making this "use Ki <= v <= Ki+1 instead" business unnecessary by ensuring that only non-unique keys make it into internal pages (all keys would be unique, actually, because heap TID would be a first class part of the keyspace). Key normalization enables suffix truncation which enables removing this alteration to the Lehman & Yao design, but not quite, because key normalization is dependent on the alteration itself to handle one edge case!

The solution described here depends on having a standardized representation that we always just memcmp(), because that way the implementation is able to reason about the transitive properties of partial keys.

How big can normalized keys get, and is it worth it?

Opclass authors are not contractually obligated to keep normalized key size below any particular limit, such as limiting the size to no greater than the size of the original datum. Still, for key normalization to be at all effective, the storage overhead cannot be excessive. While it is true that techniques like suffix truncation will greatly reduce the size of normalized keys on disk in practice, the worst case is still important, though this is only likely to matter when there are many duplicates, or many items that are duplicated in early attributes. There will be cases when the total fan-in is hurt by key normalization. The costs need to be measured, and weighed against the other possible benefits.

Even if we end up with substantially larger keys, it will probably still be worth it once both prefix compression and suffix truncation are both implemented. The effective size of index tuples can still be far smaller on average.

ICU ucol_getSortKey() key size

ICU is actually quite good at producing small strxfrm() style binary keys. Performance benchmark:

Latin text requires about 1.47x as much storage when stored as binary sort key generated by ICU, per their benchmark. Overhead for East Asian languages seems to be even less than that, when weighed against the storage overhead in bytes using UTF-8. The size of its binary sort keys is far smaller than those generated by glibc's strxfrm(). Note that this is not true when ucol_nextSortKeyPart() is used instead of ucol_getSortKey(), because ucol_nextSortKeyPart() doesn't use order-preserving compression.

If suffix truncation is used, then separator keys will only be as many bits as is needed to store the would-be normalized key for the first item on the right up to and including its first distinguishing bit. The new separator (the new leaf highkey) must be greater than the last item on the left, but can be a perfectly minimal separator of the keyspace, down to the last bit.

ICU, text equality semantics, and hashing

One blocker to using strxfrm() style binary keys today is the strcmp() tie-breaker that is used for strcoll() [6]. It clearly would not be okay if the original string had to be appended on top of the ICU generated binary key in order to ensure correct behavior. There is interest in removing this restriction anyway, in order to implement true case insensitive collations. The hash opclass hash function of the text type (which is expected to have equivalent equality semantics to that of the text default B-Tree operator class) could hash the binary key itself, rather than the original [7], at least for some collations.

There is a way to get ICU to promise to never indicate that two non-identical strings are equal -- adding an "identical level". This is exactly equivalent to just adding the original value to the key yourself anyway, which is considered bad practice by the Unicode consortium [8] on both performance and usability grounds, and so it might well be a better idea to do it the other way around and change hashing instead. Unfortunately, this would necessitate a REINDEX when Postgres 10 ICU users upgraded to a version that made that change, but it might be a good idea to do so sooner rather than later. Besides, not that many people will use ICU with Postgres 10, because it only works with per-column collations in that version, although this restriction will probably be lifted soon [9].

Hashing this way probably wouldn't be too slow. ICU provides its own ucol_keyHashCode() function, which produces 32-bit hashes, and has been part of the stable API for a long time. It appears to be designed specifically for hashing sort keys [10]. It's just a hash function, which isn't particularly worth adapting, but that it exists illustrates that this approach is idiomatic. Also, we could probably get away with hashing only partial sort keys, or hashing sort keys that were generated based only on primary weights, which are cheaper to generate yet almost certainly no less useful as hash values.

Most importantly, the benchmark linked to above shows that ICU's strcoll() equivalent has far superior performance to that of glibc -- it's frequently over twice as fast. This is before we even factor in that ICU's strcoll() has a variant that doesn't require a NUL-terminated string, unlike glibc's strcoll(), which has to copy a text datum's data to a buffer on the stack to NUL-terminate first.

Suffix truncation of normalized keys

High keys at the leaf level, and all internal items, are all merely separator keys. A pointer in a parent page has index tuple values that represent a lower bound on child page. It's immediate neighbor downlink index tuple (if any) is a lower bound for its own child, but is also an upper bound of the first downlink's child. Together, these values represent the minimum and maximum values that belong on the child pointed to by the first downlink.

Even today, these are just separators. For example, the values that appear in the root page of an index must be based on some index tuple that was inserted at some point in the past, but there is no reason to think that that original inserted index tuple wasn't removed by VACUUM long ago. The job of separators is to separate, and nobody really cares about anything else, provided they do that job correctly.

Lehman & Yao/PostgreSQL style B-Tree without suffix truncation

This suggests a pretty obvious optimization: Why not just pick a contrived value that accomplishes this with the smallest possible storage overhead in the first place? This optimization is suffix truncation. A B-Tree implementation with this optimization is sometimes said to be a "prefix B-Tree", which is not to be confused with an implementation that implements prefix compression, a rather different optimization that is complementary.

Lehman & Yao/PostgreSQL style B-Tree with suffix truncation

The logic for picking the point we truncate at can be completely generic with normalized keys, but would be complicated with other approaches based on storing values that are compared using the original representation.


When we run out of space on a leaf page, and cannot compact the page with LP_DEAD recycling [11], it's clear that a leaf page split is needed.

  1. Have the pick split logic give some weight to how early distinguishing bytes appear for the first item on the right page after a page split.
    1. Ideally, it will differ from the left hand item (the last on the right) fairly early on.
    2. We ought to be willing to make a small sacrifice in how well balanced the page split will be in terms of space utilization in support of this new goal.
    3. We don't even bother making totally optimal decisions about space utilization today [12], so this doesn't seem too sensitive.
  2. Perform key normalization on the item at the left of the split point, and at the right of the split point.
  3. Store as our new high key in the new left leaf page the minimal distinguishing normalized key, down to the bit.
    1. In essence, we truncate the right item so that it's one bit different to the left item.

Note that this only happens on the leaf level; never at the internal levels. This is our only opportunity to truncate; truncating at higher levels is wrong.

Note also that we're now done -- no extra steps are needed on the next level up. This is because during the second phase of a page split (where we insert a new downlink to the new right half), the implementation already gets its new downlink for the right page by retrieving the high key in the new left page on the leaf level [13]. All internal items end up as truncated normalized keys as a consequence of inserting these extra steps into leaf page splits. We end up with separator keys that have a consistent representation (they use normalized keys), and with all real keys that have a consistent representation (they use the ordinary representation -- same as today).

Prefix compression of normalized keys

Suffix truncation will be particularly effective if we can also store a prefix on each internal page -- a "low key" can be stored, which is conceptually the exact opposite of a high key. While it might be a bit more efficient to store a key lower bound that works with all index tuples that are currently on the page, it will be far simpler to use a value that will work with any possible index tuple that belongs on the page. This should be the same as the internal page's parent's downlink for the page (could be negative infinity at leftmost page on a level, defeating our simple prefix compression).

In practice, most normalized key index tuples only need to store the bytes that distinguish their entry. This could easily be as little as 1 or 2 bytes, even where the typical size of uncompressed/untruncated normalized keys is far larger. Normalized keys are effective despite potentially being significantly larger than the original representation because that overhead can easily be made up for through page-level compression, and suffix truncation. A new low key will be needed on the right page following an internal page split, and the items that end up in the right half of the split will need to be re-compressed, but with normalized keys that's guaranteed to leave us with index tuples that are no larger than they were with the original compressed representation. This really does greatly simplify space accounting, dealing with the 1/3 of a page restriction, etc. Keeping subtle, hard to test page split code as simple as possible almost seems essential.

Using existing "minus infinity" index tuple as a low key

Currently, the first item on all internal pages is the minus infinity tuple. It's always truncated to zero attributes, and there is hard-coded logic within _bt_compare() to treat it as less than any possible index tuple during a descent of the tree. However, only the minus infinity tuple on the leftmost page on an entire level is truly minus infinity. All other internal pages have minus infinity tuples that are better understood as "minus infinity within the subtree rooted at this page". Logically, nothing stops us from not truncating away all attributes when creating a new first item for the right half during an internal page split. The page split code could just not do that (we only need to tweak _bt_pgaddtup() and _bt_sortaddtup(), so that they don't truncate). This would naturally work as a low key, since it's naturally a stable lower bound, since the truncated minus infinity item's original key values live on in the parent page -- it becomes the downlink in the parent.

Leaf pages

A low key for prefix compression on leaf pages might be a good idea too, but that will have to work in a fairly different way, and in any case is not within the scope of this wiki page.

Making all items in the index unique by treating heap TID as an implicit last attribute

Having unique keys seems to be a widespread convention that B-Trees follow (e.g., it is assumed by the Lehman & Yao algorithm). This can mean adding an artificial tie-breaking "unique-ifier" for user keys that don't have that property. After all, systems that perform index tuple deletion in line with DELETE DML (probably almost all others) need to be able to quickly and reliably identify which index tuple to kill; not all indexes will be unique indexes.

Even today, CREATE INDEX has tuplesort.c treat heap TID as a tie-breaker [14]. This optimization is about reliably preserving that property as an index is modified, in order to realize certain particular performance benefits, including but not limited to the obvious benefit of having locality of access between leaf pages and the heap pages that they point to.

The idea has come up a number of times before; Tom Lane briefly described some of the tricky details of making it work in 2013 [15]. Funnily enough, the issue sometimes comes up when somebody complains about the inaccuracy of pg_stats.correlation with many duplicates [16][17]. Vadim B. Mikheev, who wrote much of the original nbtree code, was an early advocate of doing this [18].

There is a 2018-06-14 prototype patch that implements this, and also implements suffix truncation, although it is not based on key normalization. This limited whole-attribute implementation of suffix truncation has shown significant promise as a way of improving space utilization during random insertions.

Having non-HOT UPDATEs go to the correct leaf page immediately when there are many duplicates/bloat

If an insertion scankey for an insert that originates from UPDATE DML contains the new heap TID, certain optimizations are possible. We don't have to wade through many duplicates at the leaf level to find freespace to store our new index tuple. Instead, we jump straight to the correct leaf page from its parent page, not going through potentially several siblings prior to a leaf page that has free space.

The cost of having wider index tuples in the internal pages in the face of many duplicates (where suffix truncation won't work) at a minimum buys us being able to skip through all items on neighboring, full leaf pages most of the time.

Avoiding unnecessary unique index enforcement

The "jump to the correct leaf page immediately" optimization is always possible when the index isn't a unique index, and is possible even if the index is a unique index once the executor consults its bitmapset of updated columns and finds that it didn't change relevant indexed attributes within the preceding heap update (when the new row version was created whose TID the new index tuple is to point to). If the row is already updated, and the values haven't changed in the index, then the row lock in the heap is sufficient as a "value lock" already, so checking for conflicting visible tuples is pointless, and potentially rather expensive. It's expensive in part because multiple buffer locks are held when the _bt_check_unique() checks are underway (an exclusive buffer lock on the first leaf page the value could be on [19], perhaps an additional shared buffer lock for later leaf pages held concurrently with the first leaf page lock [20], and finally a shared lock on a heap page, where we find visibility information [21]). All of these buffer locks may be held during random heap I/O.

When trying to find a split location in the event of many duplicates, we can move right many times, but we will eventually "get tired" [22]. This means that there is a 1% probability of giving up and accepting a page split every time freespace isn't found on a page. We can give up and do a split, rather than continuing in the hope of finding free space. This was found to make a big difference in worst case performance back in August of 2000 [23]. This suggests that it's quite possible for there to be a very large number of duplicate leaf pages in real world cases.

Retail IndexTuple deletion

Unique identifiers for all index tuples may allow us to teach VACUUM to do "retail IndexTuple deletions" when there are very few dead tuples. Rather than scanning the entire index in physical order through the ambulkdelete interface, each tuple to be killed would be located by using a real index scan. Along with the visibility map, this could enable very frequent VACUUMs that aggressively control bloat. This is something that Andres Freund has talked about before (this is from memory; couldn't find link to the archives). Vadim also suggested this approach be followed at a much earlier time [24]

VACUUM and nbtree page deletion

Postgres B-Trees do not have page compaction. In other words, under-utilized pages cannot be merged together, unless and until the left hand sibling is completely empty, and so can undergo page deletion [25]. Some other systems can merge underutilized nodes, but it is hard to make work with Lehman & Yao's technique, and is not in the scope of this wiki page.

There are probably UPDATE heavy workloads where VACUUM does a bad job of keeping the index balanced in the face of short term periods where recycling can't go ahead (say, during a long running transaction). Since the order that we reclaim space from leaf pages containing duplicates to make way for new duplicate index tuples is essentially undefined, leaf pages can never be entirely deleted, despite only having a very small number of items. If the bloat was contained to a subset of the keyspace, VACUUM would have a better chance of deleting pages, and ultimately removing downlinks in the parent that were added during a long running transaction. Matters are generally not that likely to get so bad as to have many leaf pages full of duplicates, but when that does happen it's important that a short term problem does not lead to long term non-reversible damage to how balanced the B-Tree is.

There is a snippet query available that can usefully summarize the keyspace of a B-Tree, which uses contrib/pageinspect. This can be used to provide feedback about how balanced a B-Tree index is over time.

The optimization may also help with index bloat due to interactions with HOT pruning and/or nbtree LP_DEAD recycling. This is more speculative.

Personal tools