GIN generalization

General

The initial purpose of GIN generalization was to improve in-core fulltext search so that it becomes comparable on search speed with stand-alone solutions like Sphinx, Solr etc. After some investigation it appears that GIN generalization could be also useful in following areas (at least):

Similarity search on arrays
Positioned n-grams (could provide better acceleration than just n-grams)
Search of regular expressions (inverse task: search regular expressions which matches given string)

Slides‎

Prototype

Prototype was demonstrated on PGConf.EU 2012 Full-text search in PostgreSQL in milliseconds (Oleg Bartunov, Alexander Korotkov). General conclusion is that we can even outperform Sphinx in search speed with new GIN.

Prototype is pretty big and tangled. So we decided to split it into several clear patches.

Part 1: New posting lists/trees storage

This patch introduces following:

Store additional information in posting lists/trees (store arbitrary Datum with each item pointer)
Corresponding interface changes
Posting lists/trees compression

Question to discuss: how to handle partial match with additional information?

Part 2: Fast scan

This patch addresses problem of "frequest_term && rare_term". It implements algorithm which is able to skip parts of posting tree while scan. It introduces new interface method "preConsistent" which is used for preliminary checks and allows false positives on imput.

Part 3: Ordering in index

This patch implements returning ordered data from GIN index. Now it behaves like KNN-GiST. It inroduces new interface function which calculates distance. When GIN index have to return ordered data it calculates distance for each item pointer (using additional information), sort that dataset, and then returns results one by one. This is probably a wrong desing. Probably we should teach planner/executor that GIN can return value of "column op constant", then other planner node would be responsible for sorting.

Question to discuss: what is right design for ordering using GIN index?

Part 4: TSearch opclass revision

This patch modifies tsvector_ops to support features of parts 1-3. It stores compressed representation of positions vector as additional information. It introduces "tsvector <-> tsquery" operator which returns inverse value to ts_rank(). Also it implements new GIN interface method to calculate this operator from index.

Part 5: RUM access method

PostgreSQL 9.6 introduced custom AM and generic WAL, which made possible development of a new AM as an extension - RUM ( slides), which implements all the features of Generalized GIN. RUM is available from the Postgres Professional [repository].

GIN generalization

Contents

General

Prototype

Part 1: New posting lists/trees storage

Part 2: Fast scan

Part 3: Ordering in index

Part 4: TSearch opclass revision

Part 5: RUM access method

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Tools

Search