PGCon2013 Unconference Future of Full-Text Search

From PostgreSQL wiki
Jump to navigationJump to search

Notes:

Dan Scott talked about how Evergreen (http://evergreen-ils.org) has been using full text search (FTS) for the last 7 years or so. One of the things Evergreen does is to pre-munge the text to work around decisions made by the default parser that are not optimal (for example, ISSNs have the pattern "\d{4}-\d{4}", which PostgreSQL treats as two signed integers); this impacts performance as the munging has to be done at both indexing and query time. Dan would love to see would be a more flexible parser, that would be possible to prototype in Perl (or other languages) first for correctness, then converted to C for performance, but currently the parser takes function arguments of type "internal" that essentially block that approach. Oleg Bartunov noted that externalizing the function arguments to support writing a parser function in different languages was do-able, but warned that it is difficult or impossible to write a custom parser that will satisfy everybody.

Dan - After the parser does its work, Solr/Sphinx offer pluggable and different normalizers for text. This is similar to how PostgreSQL's dictionaries work, but there is a broader base of contributors to the dedicated text search engines and Emanuel's talk on Sphinx suggested that the "lemmatizer" offered advantages over, say, stemming dictionaries.

Oleg Bartunov / Alexander Korotkov - "we are faster than sphinx" -- some slides shown to compare performance, taken from presentation given at PGCon.EU (http://wiki.postgresql.org/images/2/25/Full-text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf) based on experimental enhancements to GIN that embed additional metadata in the index; only a slight increase in index size for significant performance gains.

Oleg - We are still looking for sponsors to work on improving performance (if we get real commitments, then maybe their work might make it to PostgreSQL 9.4)

Emanuel Calvo - We need more documentation on the code to explain how features work in FTS, there needs to be a README file to explain...

Future Ideas:

  • Maybe next PGCon/PGConf.EU/PG-Open(less likely) tutorial -- step by step how to write queries, parsers,
  • Some documentation efforts need to be setup; maybe an area on the wiki
  • Dan thinking to contribute technical documentation for FTS; beginning after July 1

Which should we prioritize? -- Flexible parser, or massive lightning speed? (we want both, of course!)

Tutorial should include architecture too: there's the fast database, but you could cache or use pooling for the application connecting to handle speed

Need improved support with OpenOffice dictionaries; for example, affix dictionaries?

Needs more documentation explaining ts_rank, ts_rankcd?, cover density, and maybe other relevance ranking theory? There's a real possibility of implementing other, newer relevance ranking algorithms and publishing papers on the effort. One barrier may be the fees required for access to and/or restrictions around dissemination of the TREC document collection used to evaluate ranking algorithms at the annual TREC conferences (http://trec.nist.gov/)

Phrase search was worked on by Oleg and Teodor five years ago, might be able to kick off the dust and bring it back to core for possible inclusion towards 9.4 before it bitrots completely.

General commitment towards keeping discussion going on PGSQL-hackers list.