NBTree Prefix Truncation
Prefix truncation
Prefix truncation allows us to skip saving duplicated data, and potentially skip comparing these duplicated attributes. This can be a huge win for multi-column indexes where not all attributes can be safely deduplicated, and for inner pages of indexes with a lot of duplicate prefixes (e.g. non-unique indexes, and multi-column indexes).
In this research document, we truncate away attributes of index tuples that they share when compared to a reference index tuple, and store the count of truncated attributes in the tuple. As btree-indexes have an absolute limit of 2^12 - 1 columns (as of this moment), this count should not take more than 2 bytes. Prefix truncation, when correctly applied, will therefore start saving data once the 2 first bytes of a tuple can be truncated away.
Once we've compared the truncated columns of a tuple, we can reuse the compare-result of these columns for other tuples that have been truncated with the same reference tuple. This allows us to potentially improve the binary search efficiency of pages by a large margin without getting into problems with concurrent page deletion (as detailed in [1]).
Prefix truncation relative to an arbitrary tuple
You could truncate attributes away based on the attributes shared by your right (or left) neighbour tuple.
This, however has the following limitations:
1.) Tuple delete efficiency
When the attribute is extracted from an arbitrary sibling index tuple, tuple deletion can become a O(n) operation, because it needs to check all tuples on the page, and potentially update all of them to point to a different index tuple.
2.) Tuple attribute extraction efficiency
When a prefix-truncated tuple points to another tuple to indicate where the truncated attributes are stored, and this other tuple may also be truncated, this makes an attribute lookup effectively traversing a linked list, which is O(n).
This issue can be solved by having all tuples prefix-truncated against a single static tuple on that page. A good candidate is the highkey - it is available on all pages except the rightmost on each level, and is positioned at a well-known offset.
Prefix truncation relative to the page highkey
The HighKey is a static value on each page, and can only become more specific to that page through page splits. Page deletions will grow the key space, but as deletions (at this moment) only happen for empty pages, that case is not a problem here - there is no truncated key that would need de-truncation on an empty page.
Additional benefits to highkey-based prefix truncation are that the highkey already is compared to the searchkey in _bt_moveright, so that once you have a pin on the page and start searching, for a prefix-truncated tuple you can compare the amount of truncated attributes with the amount of equal attributes with the high-key, and compare those to skip comparing several attributes. This is in addition to 'normal' keeping track of equal attributes for the high and low of the binary search.
But, prefix truncation that uses a page's highkey as a reference
needs a lot of consideration, especially with how postgres has
currently implemented nbtrees:
1.) Suffix-truncation; -INF in high keys
A suffix-truncated HighKey with N untruncated attributes only shares (up to) N-1 attributes with the data on that page, because to the -INF used as the seperator will put all tuples with a shared prefix of N attributes on the next page over. This limits the amount of attributes that can be truncated away, and with that limits the effectiveness of prefix truncation.
Example:
A page split between two indexed tuples, (New York, 1010) and (New Jersey, 1010) would currently result in a page (New York, 1010) with high key (New Jersey, <-INF>), and a right page of (New Jersey, 1010). As you can see, this puts a 'foreign' value onto the left page, limiting the ability of prefix truncation to truncate away common attributes based on the P_HIKEY.
This limitation can be solved by altering suffix truncation to instead truncate the left key, and interpret truncated attributes as positive infinity. The example above would then result in a left page (New York, 1010) with high key (New York, +INF) and a right page (New Jersey, 1010). It is clear that prefix truncation could benefit from this situation, as there are more similar values on the same page.
A side effect of the solution would be that index tuples that sort between the lastleft and firstright of a page split would now be inserted on the right page instead of on the left page. I do not know whether or not this could be a problem, except that some optimizations could not apply anymore (or would be less effective).
It is my understanding that this solution can be achieved in a backwards-compatible manner by using a spare bit in the high-key's t_tid.t_offset to indicate that this highkey has has a +INF truncated value.
This optimization could also be limited to only inner pages; such that the
2.) index_getattr, attribute extraction
BTree currently makes extensive use of index_getattr to extract each attribute from the tuple. This cannot directly be used with prefix-truncated tuples, due to the possibly incorrect offsets in Form_pg_attribute->attcacheoff which is used by index_getattr to efficiently find the offset of the attribute in the stored index tuple. This can be mitigated by explicitly setting the truncated attributes to NULL, but that would potentially add the overhead of a NULL-bitmap, and would force index_getattr into a slow path:
index_getattr is horribly non-performant for tuples which have early null or varlen attributes, due to the inability to use the pg_indatt_desc->attcacheoff offset cache.
The need to set the attribute to NULL, and the inefficient iterations over index attributes with index_getattr could be solved through iterating over the tuple's attributes whilst keeping track of the current offset in the tuple, instead of requesting each attribute sequentially through index_getattr. This could be done using new iterating helper functions and structures, which would keep track of the current offset in the tuple.
The solution to this limitation would be a useful addition to the current toolset in btrees, as I've seen large amounts of time being wasted in index_getattr (see also this footnote).
3.) reindex, rightmost page insertions
When we use the highkey as the reference tuple for on-page prefix truncation, we cannot truncate away the prefix of tuples located on the rightmost page. This is problematic for reindex and indexes on incremental columns, as they insert on the rightmost page, which has no highkey and therefore cannot benefit from the prefix truncation.
The problem for reindex can be solved by constructing indexes right-to-left instead of left-to-right. This way, each tuple is written to a page that has a definite highkey (except the rightmost page of each level), and can therefore benefit from prefix truncation. This does not solve the problem of no prefix truncation on rightmost pages, but it does work around the problem for indexes with random insertion patterns and index creation.
4.) leaf pages, deduplication
On inner pages we can freely truncate prefixes away, as they are nothing but guides to the binary search algorithm. On leaf pages, the truncation of prefixes is actually not distinguishable from attribute-level deduplication, which means the same problems from tuple deduplication exist for prefix truncation, and subsequently the same limitations apply as for 'normal' deduplication. So, prefix attribute deduplication on leaf pages can only happen for allequalimage indexes (or, prefix attributes that are equalimage, whenever we get to storing per-attribute equalimage-ness).
For optimal use of prefix truncation, whilst keeping in mind the limitations of non-allequalimage-attributes, it would be useful to add a 'prefix equal image' attribute to the index (where keyattributes >= prefixequalimage), such that prefix truncation can still happen for indexes that do not have allequalimage, but could benefit from deduplication through prefix truncation. Similarly, it could be useful for other index optimizations to add a precomputed 'att-is-equalimage' bitmap to the btree index data, from which we can derive the allEqualImage and prefixEqualImage fields and other future optimizations based on the equalImage-ness of attributes.
5.) representation, pivot tuples
For pivot tuples, we potentially need to store an additional number in each tuple. This can be achieved in two ways:
- using part of the space available for the number of attributes stored in this tuple (12) and using the reclaimed bits. This approach allows for up to 63 truncated attributes, but also limits the maximum amount of attributes in an index that can make use of prefix truncation: currently, technically indexes with up to 4095 columns are possible (by setting the correct values in header files). If we require that all attributes are truncatable, the max number of of attributes we can support in a btree-index would drop to 63, just shy of double the default amount. This approach would not require any additional flag bits specific to prefix truncation, but would limit adoption in versions of PG that are compiled with INDEX_MAX_KEYS>=64.
- using a flag bit, and adding an extra data field of 1 or 2 bytes in the stored data, similar how to the workings of the BT_PIVOT_HEAP_TID_ATTR flag. This would also be backwards compatible, as there are flag bits left in BT_STATUS_OFFSET_MASK. However, it would increase the minimum size of a prefix-truncated pivot tuple. That is not too much of a problem, though: A saving of the equivalent of 2 1-aligned attributes, or 1 2-aligned attributes would already make this investment worthwhile by being able to skip comparing these attributes.
6.) representation, leaf tuples
For leaf tuples, we only need to store the amount of attributes we've truncated away. As there are no unused flag bits, the only way to store more information is to use INDEX_ALT_TID_MASK, fully claiming the itup->t_tid->ip_posid field. This requires us to find a new place to store the posid containd, which should be possible by adding a 2-byte data holder that contains this ip_posid.
Progress
These are the current results of my personal research branch on nbtree prefix truncation / prefix-skip search. I have found several limitations, most of which are solvable without a major version change in the physical storage format.
My research branch is mostly work on the proposed highkey-based solution, and contains a complete solution to #2 (attribute extraction) and I have a solution to #3 somewhere in my history as well.
For highkey limitations 1, 4, 5 and 6 I would like to have some feedback before I start to work on the actual implementation, as tuple de/serialization is not a part I've familiarized myself with yet.
My current priorities are, in order:
- #2, IndexTuple attribute extraction improvements: ready. Patch will be provided on-list shortly, for CF 2021-07. includes short patch with page-wise dynamic prefix truncation.
- #5, PT inner page tuples: work is starting
- #3, backwards index build: Patch requires updates.
- #1, (optional) +INF truncation, which improves the overall efficiency in #5
- #6, PT leaf tuples
- #4, attIsEqualImage for index atts / nPrefixEqualImage. This will increast the efficiency of #6
Footnotes
_bt_compare profile
For an index `(text COLLATE NONSTANDARD, text COLLATE NONSTANDARD, text COLLATE NONSTANDARD)` with short columns, the profile of that _bt_compare flamegraph section was:
74% FunctionCall2Coll -- thanks lookup_collation_cache! 17.4% nocache_index_getattr 4% BTreeTupleIsPivot ...tailings
effectively, nocache_index_getattr was slow and taking up >50% (17.4 %pt out of 26%pt) of the overhead of _bt_compare when we assume all compare function calls are useful work.