Loose indexscan

From PostgreSQL wiki

Jump to: navigation, search

Performance Snippets

Loose indexscan

Works with PostgreSQL

8.4

Written in

SQL

Depends on

Nothing

Selecting Distinct Values

The term "loose indexscan" is used in some other databases for the operation of using a btree index to retrieve the distinct values of a column efficiently; rather than scanning all equal values of a key, as soon as a new value is found, restart the search by looking for a larger value. This is much faster when the index has many equal keys.

Postgres does not support loose indexscans natively, but they can be emulated using a recursive CTE as follows:

WITH RECURSIVE
     t AS (SELECT min(col) AS col FROM tablename
           UNION ALL
           SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
SELECT col FROM t WHERE col IS NOT NULL
UNION ALL
SELECT NULL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NULL);

The recursive part of the query processes one new row in each pass, finding the smallest column value greater than that on the previous pass; the main query adds in a single null result if needed. The above query is therefore equivalent to

SELECT DISTINCT col FROM tablename;

but runs orders of magnitude faster when "col" has only a proportionally small number of distinct values.

Making use of a non-leading column of an index

The same technique can be used to allow a query to benefit from an index which has a leading column with few distinct values (and which would not naturally be specified in the query), and a 2nd column which is highly specific and is used by the query. The leading column can be introduced into the query in a way which does not alter the meaning.

To set up the example:

CREATE TABLE tablename (col integer NOT NULL, col2 integer NOT NULL);
INSERT INTO tablename SELECT floor(random()*5) AS col, (random()*1000000)::integer AS col2 FROM generate_series(1,10000000);
CREATE INDEX ON tablename (col,col2 );

The natural specification of the query, which cannot use the index and takes 1758 ms:

SELECT count(*) FROM tablename WHERE col2=9854 ;

The modified version which can benefit from the index, and runs in 2 ms:

SELECT count(*) FROM tablename WHERE col2=9854 AND col IN 
(
  WITH RECURSIVE
     t AS (SELECT min(col) AS col FROM tablename
           UNION ALL
           SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
  SELECT col FROM t
);

Writing a correct version of this query would be much more complex if col could be NULL

Personal tools