GIN generalization

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
(Part 2: Fast scan)
(Part 3: Ordering in index)
Line 27: Line 27:
  
 
== Part 3: Ordering in index ==
 
== Part 3: Ordering in index ==
 +
 +
This patch implements returning ordered data from GIN index. Now it behaves like KNN-GiST. It inroduces new interface function which calculates distance. When GIN index have to return ordered data it calculates distance for each item pointer (using additional information), sort that dataset, and then returns results one by one. This is probably a wrong desing. Probably we should teach planner/executor that GIN can return value of "column op constant", then other planner node would be responsible for sorting.
 +
 +
'''Question to discuss: what is right design for ordering using GIN index?'''
  
 
== Part 4: TSearch opclass revision ==
 
== Part 4: TSearch opclass revision ==

Revision as of 11:15, 25 April 2013

Contents

General

The initial purpose of GIN generalization was to improve in-core fulltext search so that it becomes comparable on search speed with stand-alone solutions like Sphinx, Solr etc. After some investigation it appears that GIN generalization could be also useful in following areas (at least):

  • Similarity search on arrays
  • Positioned n-grams (could provide better acceleration than just n-grams)
  • Search of regular expressions (inverse task: search regular expressions which matches given string)

Prototype

Prototype was demonstrated on PGConf.EU 2012 Full-text search in PostgreSQL in milliseconds (Oleg Bartunov, Alexander Korotkov). General conclusion is that we can even outperform Sphinx in search speed with new GIN.

Prototype is pretty big and tangled. So we decided to split it into several clear patches.

Part 1: New posting lists/trees storage

This patch introduces following:

  • Store additional information in posting lists/trees (store arbitrary Datum with each item pointer)
  • Corresponding interface changes
  • Posting lists/trees compression

Question to discuss: how to handle partial match with additional information?

Part 2: Fast scan

This patch addresses problem of "frequest_term && rare_term". It implements algorithm which is able to skip parts of posting tree while scan. It introduces new interface method "preConsistent" which is used for preliminary checks and allows false positives on imput.

Part 3: Ordering in index

This patch implements returning ordered data from GIN index. Now it behaves like KNN-GiST. It inroduces new interface function which calculates distance. When GIN index have to return ordered data it calculates distance for each item pointer (using additional information), sort that dataset, and then returns results one by one. This is probably a wrong desing. Probably we should teach planner/executor that GIN can return value of "column op constant", then other planner node would be responsible for sorting.

Question to discuss: what is right design for ordering using GIN index?

Part 4: TSearch opclass revision

Personal tools