Better indexing for ranges GSoC 2012
From PostgreSQL wiki
Range types are significant new feature of PostgreSQL. Indexing of range types is necesary to provide efficient search of ranges. The current approach for GiST indexing implemented for 9.2 holds ranges in both internal and leaf pages entries. This approach could be very inefficient in the case of highly overlapping ranges and "@>", "<@" operators, because the cost of search is similar to the cost of search using "&&" operator. Mapping ranges into 2d-space could handle such cases much more efficiently. This project is focused on implementating 2d-space mapping based GiST and SP-GiST operator classes for range types.
The priorities of this project is following:
- GiST and SP-GiST operator classes for ranges.
- Selectivity estimations for ranges.
- GiST and SP-GiST operator classes for arrays.
Minimal completeness criteria is only #1, but real project goal is to do best on all mentioned options.
Better ranges indexing
The current indexing approach implemented for 9.2 defines a range in an internal page as a bounding range of underlying ranges. So, if a leaf page contain ranges (a1, b1), (a2, b2), ... (an, bn), a corresponding entry of an internal page would be (min(a1, a2, ... an), max(b1, b2, ... bn)). However, some research papers  recommend to map ranges into 2d-space. In this case range (a,b) will be presented as a point with coordinates a and b.
In the case of such mapping "&&", "@>", "<@" search operators have a corresponding rectangular area on the 2d-space. There is a proof of concept message  which shows a dramatic benefit using existing spatial operator classes. However, use of spatial operator classes is inconvinient and it doesn't take care of inclusive and non-inclusive bounds and infinities. That's why it's important to implement specific operator classes for such indexing of ranges. Therefore the following 2d-space trees could be implemented for range indexing:
- R-tree using GiST
- Quad-tree using SP-GiST
- KD-tree using SP-GiST
Selectivity estimation for ranges
The second goal of this project is to provide a better selectivity esitmation for &&, @>, <@ operators. One idea is to collect the following statistics:
- Histogram of "density" of ranges.
- Histogram of ranges length.
and do selectivity estimations for &&, @>, <@ according to them.
GiST and SP-GiST indexing for arrays
PostgreSQL core supports index-based search for operators "@>", "<@" and "&&" on arrays only using GIN. The intarray contrib module also provides GiST index support for integer arrays. However, similar GiST indexing is possible for other array types, not just integer arrays. This project is focused on the implementation of universal GiST indexing for arrays and implementation of experimental SP-GiST indexing of arrays.
The proposed GiST indexing for arrays is quite similar to those implemented in the intarray contrib module, but it has following differences. The following representations of array are possible:
- Original array
- Array of hashes of original array elements (suitable when array element is larger than its hash)
- Signature (bitmap where bits corresponding to array element hashes are set)
The second two representations are lossy. Representation is selected based on array size. Any of the mentioned representations could be used for both leaf and internal entries. Representation selection for internal entries would be runtime, i.e. no gist__int_ops vs. gist__intbig_ops like dilemma is planned.
SP-GiST indexing for arrays is quite a hard task, and I have the following idea about how it would be possible. A leaf tuple could be represented in one of the ways mentioned for GiST indexing before. An inner tuple node represents number of bits in signature and the inner tuple prefix represents set of bits in signature. Bits mentioned in inner tuple node and prefix must be not set in the signatures corresponding to all underlying arrays. Thus, if it requires the presence of some bits enumerated in prefix or node then subtree could be skipped during index scan.
- Bela Stantic, Rodney Topor, Justin Terry, Abdul Sattar, "Advanced Indexing Technique for Temporal Data"
- Until June 1
Implement 2d-mapping basid GiST opclass for ranges
- June 1 - June 17
Implement 2d-mapping based SP-GiST quad-tree for ranges
- June 18 - June 24
Implement 2d-mapping based SP-GiST k-d tree for ranges
- June 25 - July 8
Comprehensive testing on various datasets. Conclusion about applicability of various opclasses
- July 9 - July 12
Rework opclasses according to testing results.
- July 13 - August 2
Implement better statistics for ranges with selectivity estimation for &&, <@, @> etc. operators
- From August 3
Testing and refactoring.