Full Text Indexing with PostgreSQL

From PostgreSQL wiki
Jump to navigationJump to search

This article discusses full text indexing with the contrib/fulltextindex module only. More advanced indexing is available via the contrib/tsearch module, which is not covered here.

The contrib/fulltextindex module is a quick and easy way of achieving basic text indexing under PostgreSQL. The module is available in 7.1 and below with the ability to index one column per table, and in 7.2 is able to index an arbitrary number of columns per table.

This form of full text indexing works by recording all possible suffixes of the words it indexes. For instance, the word 'sydney' would have strings of Sydney, 'ydney', 'dney', 'ney' and 'ey' stored. This means that no matter what part of the word Sydney your user types, it will be found in the index.

Anyway - that's technical stuff! Just follow the steps below and you don't even need to know how it works!

Installation

Once you have configured and installed PostgreSQL, you need to change to the contrib/fulltextindex directory in the source tarball. Typing 'gmake all' will compile the module, and 'gmake install' will install it. The installation will setup at least an fti.so shared library and an fti.pl index generation convenience script. I recommend looking at the README.fti file.

To enable full text indexing in one of your databases, issue the following SQL:

CREATE FUNCTION fti() RETURNS opaque AS '/path/to/fti.so' LANGUAGE 'C';

If you want the function to be enabled automatically in any new databases you create, simply add it to the 'template1' database.

Creating An Index On An Empty Table

To index a table that is yet to be filled with data, you need to follow these steps:

1. Create a new table to hold the index. I suggest giving it the same name as the table it's indexing, but with '_fti' appended.

eg.

      CREATE TABLE articles_fti (
      	string type,
      	id oid
      );

Where 'type' should match the type of the fields to index, or at least be as large as the fields it's indexing. Note that this assumed the existence of an 'articles' table.

2. Properly index the tables. This includes indexing both the indexed table and the full text index table.

eg.

      CREATE INDEX "articles_fti_string_idx" ON articles_fti(string);
      CREATE INDEX "articles_fti_id_idx" ON articles_fti(id); 
      CREATE INDEX "articles_oid_idx" ON articles(oid);

This indexing scheme makes lookups, insertions and deletions fast.

3. Create the trigger that will be fired whenever changes are made in the indexed table, to update the full text index table.

eg.

      CREATE TRIGGER "articles_fti_trigger" AFTER UPDATE OR INSERT OR DELETE
      ON articles FOR EACH ROW 
      EXECUTE PROCEDURE fti(articles_fti, title, body, keywords);

Where in this case 'articles_fti' is the table where the index for 'articles' is kept, and 'title', 'body' and 'keywords' are the fields in the 'articles' table that we are indexing.

Note: In PostgreSQL 7.1 and below, you can only index ONE column, not multiple columns.

4. Now just insert, update and delete the 'articles' table however you like and the full text index will be kept updated.

Indexing An Existing Table

There are a couple of situations where you might want to index an existing table. You might have an already existing table that you want to index and then use a trigger to keep up-to-date, or instead of using a trigger you can schedule a cron job to re-index the table once a day (or something like that).

Anyway, you use fti.pl.

Let's suppose that the 'articles' table mentioned above already has data in it. Follow these steps to index it.

1. Perform step one above to create the full text index table.

2. Run fti.pl: /path/to/fti.pl -d mydb -t articles -c title,body,keywords -f /dev/stdout | sort -u > fti.sql

Note that the '-c' option can only specify a single column on PostgreSQL 7.1 and below.

3. Log in to the database with psql. Use this command to fill the full text index table. (Making sure you've already emptied it!)

      \copy articles_fti from fti.sql

4. Now, perform steps two, three and four above to complete the process.

If you ever need to regenerate the index from scratch, remember to first drop the indicies on it and recreate them afterwards - this will speed things up greatly.

Using The Index

Now that we have created our index, it's time to use it for fast querying! The best way to illustrate this is with an example:

Suppose we are searching our articles table for the words 'perth' and 'stralia'. (The user left the 'au' off the front of 'australia'!)

The query is constructed like this:

	SELECT 
		DISTINCT(a.*) 
	FROM 
		articles a, 
		articles_fti f1, 
		articles_fti f2 
	WHERE 
		f1.string ~ '^perth 
		AND f2.string ~ '^stralia' 
		AND f1.id=f2.id 
		AND a.oid=f1.id;


Seem complicated? Well, there's a few things you have to note:

1. You need to add a table alias to the 'articles_fti' table for every keyword you use.

2. The '~' operator indicates a regular expression. The '^' means "match beginning of word". The tricky use of these operators ensures that PostgreSQL's btree indexes will work.

3. You must lowercase all the keywords, as the full text index will store only lowercase versions.

4. By default, words of two letters or more will be indexed.

Optimising Your Queries

If you have indexed as above, then the query should already be optimal. You can check this, however, by putting and 'EXPLAIN' in front of the SELECT query above.

There are three operations that need to be optimised: JOINing the full text and indexed tables (based on id and oid); deleting from the full text table based on id; and simply finding matching text in the string column.

If you don't like the idea of a trigger that keeps the index up-to-date, you can forgoe it altogether and just manually regenerate the index whenever you like. Of course, if your data never changes - you don't need a trigger at all.