Key normalization
This page is a draft design document for an enhancement to the representation of B-Tree index tuples called key normalization, which is intended to enable a number of other optimizations. Because of the fact that these B-Tree optimizations are interrelated, it isn't currently clear which should be implemented first, or how to do so without disadvantaging other cases, which may hold back implementations of any single one of them. The goal of this document is to provide clarity.
This wiki page was raised on a pgsql-hackers thread in July of 2017.
June 2019 update: Following the Postgres 12 improvements to nbtree, key normalization now seems like it is only likely to be compelling in order to add abbreviated keys to internal nbtree pages. This approach seems like a good compromise, since it will allow search style scan keys to continue to work with leaf level high keys (see 29b64d1de7c77ffb5cb10696693e6ed8a6fc481c) without special effort, while still having most of the upsides of a design that actually stores whole normalized keys in internal pages.
Definition
Key normalization is an enhancement to the representation of index tuples. It should only be applied to those index tuples that "separate the key space": high key items at the leaf level, and all page items (both high keys and pointers to the lower level of the tree) within internal pages. Index tuples that separate the key space are called pivot tuples in the nbtree README. Normalization should not be applied to "real" items that point to tuples in the heap (non-pivot tuples), since key normalization will generally leave the index tuple in a state where it's impossible to reclaim the original data.
The basic idea is that the representation of the final pivot index tuple is made suitable to perform all comparisons as a simple memcmp() operation within routines like _bt_compare() (actually, strcmp() style semantics should be used, without using a real NUL-terminated string). As an alternative to the nbtree opclass working with an on-disk representation structured in a rather unconstrained way, along with a custom comparator for use with that representation, the opclass can instead generate a normalized key from a base index tuple. It generates it using an encoding scheme (this will probably become "B-Tree support function 4"). Normalized keys are allowed to be larger than the original index tuple.
Index scans for indexes that have opclasses that all support key normalization can let nbtree generate whole binary strings that give the same answer to comparisons with memcmp() as the original representation would with a scankey consisting of one or more standard comparators (B-Tree support function 1). Index scans work the same way as before on the leaf level, except that the high key check also uses a normalized key there.
Optimizations enabled by key normalization
Key normalization is an enabler of several further optimizations, because binary strings are extremely flexible. Normalized keys are essentially strings of bytes, or strings of wider integers (big endian) that can be easily concatenated, inverted, and truncated. Specific optimizations include:
Simpler comparisons:
- The simple fact that a memcmp() is all that is needed in all cases, without exception.
- No complex index_getattr() calls needed in the majority of calls to _bt_compare(). They're only needed for the conventional index tuple representation that items that actually point to the heap will continue to use.
- Normalized keys are a flattened, one-pass representation. A single pointer to the current start position in the scan key's normalized key can therefore be maintained. Once the start position for future comparisons advances, it never needs to go backwards. In other words, normalized keys enable skipping earlier scan key bytes that are known to not be interesting to internal page binary search comparisons at this level of the tree; we can reason about the properties of keys transitively, from the parent's downlink. (This can usefully interact with prefix compression and abbreviated keys in the ItemId/item pointer array.)
Smaller keys (on average):
- Prefix compression + suffix truncation on internal pages.
- Normalized keys are a flattened, one-pass representation. They provide a way to combine prefix compression and suffix truncation that is far simpler and at least as effective as any alternative approach.
- Smaller items in internal pages increases fan-in, possibly making B-Tree shorter overall.
- We might as well remove trailing bytes (i.e. perform suffix truncation) to make a new high key at the leaf level as small as possible while still correctly separating what belongs in the left and right leaf pages after the split.
- Suffix truncation is a classic technique, described first in a 1977 paper by Rudolf Bayer, who is generally credited with having invented B-Trees in the early 1970s. Berkeley DB has long supported this technique [1].
- No need for multiple varlena headers in multi-column cases, or even for one. IndexTuple size is all that you need.
- Even on alignment-picky platforms, multiple attributes don't have to access an aligned varlena header at multiple offsets per IndexTuple.
- Prefix compression in internal pages, further increasing fan-in.
- A "low key" is needed in addition to the existing "high key" in order to make prefix compression work without complicated bookkeeping, and having normalized keys makes that less expensive, and far simpler.
CPU cache efficiency:
- Abbreviated keys in the ItemId array [2].
- Cache misses during B-Tree binary searches are thought to be a problem for some workloads [3].
- There is space for these that is reclaimable from ItemIdData.lp_len -- this can just be the first 15 bits of the normalized key. This is redundant with IndexTuple length, and is only accessed in a few places within nbtree and bufpage.c index routines today. 15 bits gets you surprisingly far within internal pages.
- Internal page prefix compression can be used to make it far more likely that high entropy abbreviated keys are always available. Prefix compression may be an essential enabler of abbreviated keys, in fact. Since the prefix compression "low key" is the same as the downlink in the parent, it doesn't need to participate in comparisons at all. We can just advance our offset into the flattened normalized scan key such that the comparisons of abbreviated keys are against the correct point in the scan key.
Building normalized keys
Key normalization is an "all or nothing" optimization. There should probably be a support function 4 (key normalization routine) for all the core datatypes that currently have sortsupport, which is where the optimization makes sense. Even simple int4/int8 indexes can be effectively optimized in this way, as their comparisons won't be notably more expensive, but the benefits with many duplicates can still be considerable.
Algorithm
Steps in generating a normalized key with one attribute:
- Place NULL bit in first bit in first byte.
- We reserve one "NULL bit" per attribute. If the attribute is NULLS FIRST, then NULL columns do not have the bit set. If it is NULLS LAST, then NULL columns have the bit set.
- If a value is NULL for some attribute x, there is no need to include any placeholder/padding bytes after the NULL bit. We can skip on to the next attribute x + 1 immediately; the very next bit is the x + 1 NULL bit. The x + 1 opclass supplied normalized key, if any, follows. This works because x + 1 will only ever be compared against other index tuples whose x attribute is also NULL.
- Ask an opclass for a normalized key based on a base attribute value -- no need to pass any information other than the value itself, and the attribute's collation.
- DESC is encoded directly -- just get the complement of the opclass-supplied encoded binary string for that attribute.
Normalized keys can and will be larger than the base index tuple that they are based on, although they will often actually be shorter, even when truncation or compression are not used.
Notes about opclass encoding schemes for common types
- Types like int4, whose on-disk representation is two's complement signed integers (little endian or big endian, depending on platform) are encoded as big endian unsigned integers. This can happen without any storage overhead compared to the original representation for that portion of the final normalized key. This can be done easily: perform integer conversion to unsigned, add "INT_MAX + 1" to that, and do a byteswap iff on a little-endian platform [4].
- Booleans only need a single bit.
- Text will be supported, but only when the C collation or an ICU supplied collation are in use. Variable-length types like text must having a terminating NUL byte. The opclass doesn't add the NUL byte, because we don't want to get confused if and when the opclass binary key is inverted (where we actually compute its complement and store that instead) in order to support DESC.
It may or may not be worth versioning opclass encoding schemes separately from versioning text collations.
Breaking the 1/3 of a page size restriction
If one attempts to insert values above about 1/3 of BLCKSZ into a B-Tree index today, an error is raised: "Values larger than 1/3 of a buffer page cannot be indexed" [5]. This is necessary because we need to be able to store at least 3 items on every page (high key, plus two distinct real items), and this is a simple way to enforce that.
This raises a problem for us: If a normalized key can be larger than the original, and can in theory even be arbitrarily larger, then why is it okay that that could lead to a situation where we seemingly have to break this rule? It would be especially bad if the error could be seen by an innocent party that is only inserting a small key that would produce a small normalized key, since that's completely unpredictable and cannot even be worked around. It's unpredictable because the choice of split point (which implies what the new high key for new left page will be -- it comes from the first item on the right) has little to do with the inserter.
Perhaps this problem can be solved, but right now a solution doesn't present itself.
How big can normalized keys get, and is it worth it?
Opclass authors are not contractually obligated to keep normalized key size below any particular limit, such as limiting the size to no greater than the size of the original datum. Still, for key normalization to be at all effective, the storage overhead cannot be excessive. While it is true that techniques like suffix truncation will greatly reduce the size of normalized keys on disk in practice, the worst case is still important, though this is only likely to matter when there are many duplicates, or many items that are duplicated in early attributes. There will be cases when the total fan-in is hurt by key normalization. The costs need to be measured, and weighed against the other possible benefits.
Even if we end up with substantially larger keys, it will probably still be worth it once both prefix compression and suffix truncation are both implemented. The effective size of index tuples can still be far smaller on average.
ICU ucol_getSortKey() key size
ICU is actually quite good at producing small strxfrm() style binary keys. Performance benchmark: http://site.icu-project.org/charts/collation-icu4c48-glibc
Latin text requires about 1.47x as much storage when stored as binary sort key generated by ICU, per their benchmark. Overhead for East Asian languages seems to be even less than that, when weighed against the storage overhead in bytes using UTF-8. The size of its binary sort keys is far smaller than those generated by glibc's strxfrm(). Note that this is not true when ucol_nextSortKeyPart() is used instead of ucol_getSortKey(), because ucol_nextSortKeyPart() doesn't use order-preserving compression.
If suffix truncation is used, then separator keys will only be as many bits as is needed to store the would-be normalized key for the first item on the right up to and including its first distinguishing bit. The new separator (the new leaf highkey) must be greater than the last item on the left, but can be a perfectly minimal separator of the keyspace, down to the last bit.
ICU, text equality semantics, and hashing
One blocker to using strxfrm() style binary keys today is the strcmp() tie-breaker that is used for strcoll() [6]. It clearly would not be okay if the original string had to be appended on top of the ICU generated binary key in order to ensure correct behavior. There is interest in removing this restriction anyway, in order to implement true case insensitive collations. The hash opclass hash function of the text type (which is expected to have equivalent equality semantics to that of the text default B-Tree operator class) could hash the binary key itself, rather than the original [7], at least for some collations.
There is a way to get ICU to promise to never indicate that two non-identical strings are equal -- adding an "identical level". This is exactly equivalent to just adding the original value to the key yourself anyway, which is considered bad practice by the Unicode consortium [8] on both performance and usability grounds, and so it might well be a better idea to do it the other way around and change hashing instead. Unfortunately, this would necessitate a REINDEX when Postgres 10 ICU users upgraded to a version that made that change, but it might be a good idea to do so sooner rather than later. Besides, not that many people will use ICU with Postgres 10, because it only works with per-column collations in that version, although this restriction will probably be lifted soon [9].
Hashing this way probably wouldn't be too slow. ICU provides its own ucol_keyHashCode() function, which produces 32-bit hashes, and has been part of the stable API for a long time. It appears to be designed specifically for hashing sort keys [10]. It's just a hash function, which isn't particularly worth adapting, but that it exists illustrates that this approach is idiomatic. Also, we could probably get away with hashing only partial sort keys, or hashing sort keys that were generated based only on primary weights, which are cheaper to generate yet almost certainly no less useful as hash values.
Most importantly, the benchmark linked to above shows that ICU's strcoll() equivalent has far superior performance to that of glibc -- it's frequently over twice as fast. This is before we even factor in that ICU's strcoll() has a variant that doesn't require a NUL-terminated string, unlike glibc's strcoll(), which has to copy a text datum's data to a buffer on the stack to NUL-terminate first.
Suffix truncation of normalized keys
Update: Postgres 12 has whole-column suffix truncation within nbtree. It may not be independently worth pursuing suffix truncation based on normalized keys, so the following section should be considered context for the wider discussion.
High keys at the leaf level, and all internal items, are all merely separator keys. A pointer in a parent page has index tuple values that represent a lower bound on child page. It's immediate neighbor downlink index tuple (if any) is a lower bound for its own child, but is also an upper bound of the first downlink's child. Together, these values represent the minimum and maximum values that belong on the child pointed to by the first downlink.
Even today, these are just separators. For example, the values that appear in the root page of an index must be based on some index tuple that was inserted at some point in the past, but there is no reason to think that that original inserted index tuple wasn't removed by VACUUM long ago. The job of separators is to separate, and nobody really cares about anything else, provided they do that job correctly.
This suggests a pretty obvious optimization: Why not just pick a contrived value that accomplishes this with the smallest possible storage overhead in the first place? This optimization is suffix truncation. A B-Tree implementation with this optimization is sometimes said to be a "prefix B-Tree", which is not to be confused with an implementation that implements prefix compression, a rather different optimization that is complementary.
The logic for picking the point we truncate at can be completely generic with normalized keys, but would be complicated with other approaches based on storing values that are compared using the original representation.
Algorithm
When we run out of space on a leaf page, and cannot compact the page with LP_DEAD recycling [11], it's clear that a leaf page split is needed.
- Have the pick split logic give some weight to how early distinguishing bytes appear for the first item on the right page after a page split.
- Ideally, it will differ from the left hand item (the last on the right) fairly early on.
- We ought to be willing to make a small sacrifice in how well balanced the page split will be in terms of space utilization in support of this new goal.
- We don't even bother making totally optimal decisions about space utilization today [12], so this doesn't seem too sensitive.
- Perform key normalization on the item at the left of the split point, and at the right of the split point.
- Store as our new high key in the new left leaf page the minimal distinguishing normalized key, down to the bit.
- In essence, we truncate the right item so that it's one bit different to the left item.
Note that this only happens on the leaf level; never at the internal levels. This is our only opportunity to truncate; truncating at higher levels is wrong.
Note also that we're now done -- no extra steps are needed on the next level up. This is because during the second phase of a page split (where we insert a new downlink to the new right half), the implementation already gets its new downlink for the right page by retrieving the high key in the new left page on the leaf level [13]. All internal items end up as truncated normalized keys as a consequence of inserting these extra steps into leaf page splits. We end up with separator keys that have a consistent representation (they use normalized keys), and with all real keys that have a consistent representation (they use the ordinary representation -- same as today).
Prefix compression of normalized keys
Suffix truncation will be particularly effective if we can also store a prefix on each internal page -- a "low key" can be stored, which is conceptually the exact opposite of a high key. While it might be a bit more efficient to store a key lower bound that works with all index tuples that are currently on the page, it will be far simpler to use a value that will work with any possible index tuple that belongs on the page. This should be the same as the internal page's parent's downlink for the page (could be negative infinity at leftmost page on a level, defeating our simple prefix compression).
In practice, most normalized key index tuples only need to store the bytes that distinguish their entry. This could easily be as little as 1 or 2 bytes, even where the typical size of uncompressed/untruncated normalized keys is far larger. Normalized keys are effective despite potentially being significantly larger than the original representation because that overhead can easily be made up for through page-level compression, and suffix truncation. A new low key will be needed on the right page following an internal page split, and the items that end up in the right half of the split will need to be re-compressed, but with normalized keys that's guaranteed to leave us with index tuples that are no larger than they were with the original compressed representation. This really does greatly simplify space accounting, dealing with the 1/3 of a page restriction, etc. Keeping subtle, hard to test page split code as simple as possible almost seems essential.
Using existing "minus infinity" index tuple as a low key
Currently, the first item on all internal pages is the minus infinity tuple. It's always truncated to zero attributes, and there is hard-coded logic within _bt_compare() to treat it as less than any possible index tuple during a descent of the tree. However, only the minus infinity tuple on the leftmost page on an entire level is truly minus infinity. All other internal pages have minus infinity tuples that are better understood as "minus infinity within the subtree rooted at this page". Logically, nothing stops us from not truncating away all attributes when creating a new first item for the right half during an internal page split. The page split code could just not do that (we only need to tweak _bt_pgaddtup() and _bt_sortaddtup(), so that they don't truncate). This would naturally work as a low key, since it's naturally a stable lower bound, since the truncated minus infinity item's original key values live on in the parent page -- it becomes the downlink in the parent.
Leaf pages
A low key for prefix compression on leaf pages might be a good idea too, but that will have to work in a fairly different way, and in any case is not within the scope of this wiki page.