GIN generalization

From PostgreSQL wiki
Jump to navigationJump to search

General

The initial purpose of GIN generalization was to improve in-core fulltext search so that it becomes comparable on search speed with stand-alone solutions like Sphinx, Solr etc. After some investigation it appears that GIN generalization could be also useful in following areas (at least):

  • Similarity search on arrays
  • Positioned n-grams (could provide better acceleration than just n-grams)
  • Search of regular expressions (inverse task: search regular expressions which matches given string)

Slides‎

Prototype

Prototype was demonstrated on PGConf.EU 2012 Full-text search in PostgreSQL in milliseconds (Oleg Bartunov, Alexander Korotkov). General conclusion is that we can even outperform Sphinx in search speed with new GIN.

Prototype is pretty big and tangled. So we decided to split it into several clear patches.

Part 1: New posting lists/trees storage

This patch introduces following:

  • Store additional information in posting lists/trees (store arbitrary Datum with each item pointer)
  • Corresponding interface changes
  • Posting lists/trees compression

Question to discuss: how to handle partial match with additional information?

Part 2: Fast scan

This patch addresses problem of "frequest_term && rare_term". It implements algorithm which is able to skip parts of posting tree while scan. It introduces new interface method "preConsistent" which is used for preliminary checks and allows false positives on imput.

Part 3: Ordering in index

This patch implements returning ordered data from GIN index. Now it behaves like KNN-GiST. It inroduces new interface function which calculates distance. When GIN index have to return ordered data it calculates distance for each item pointer (using additional information), sort that dataset, and then returns results one by one. This is probably a wrong desing. Probably we should teach planner/executor that GIN can return value of "column op constant", then other planner node would be responsible for sorting.

Question to discuss: what is right design for ordering using GIN index?

Part 4: TSearch opclass revision

This patch modifies tsvector_ops to support features of parts 1-3. It stores compressed representation of positions vector as additional information. It introduces "tsvector <-> tsquery" operator which returns inverse value to ts_rank(). Also it implements new GIN interface method to calculate this operator from index.

Part 5: RUM access method

PostgreSQL 9.6 introduced custom AM and generic WAL, which made possible development of a new AM as an extension - RUM ( slides), which implements all the features of Generalized GIN. RUM is available from the Postgres Professional [repository].