Loose indexscan

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
(useful for queries other than distinct, as well)
 
Line 1: Line 1:
 
{{SnippetInfo|Loose indexscan|version=8.4|lang=SQL|category=Performance}}
 
{{SnippetInfo|Loose indexscan|version=8.4|lang=SQL|category=Performance}}
 
+
= Selecting Distinct Values =
 
The term "loose indexscan" is used in some other databases for the operation of using a btree index to retrieve the distinct values of a column efficiently; rather than scanning all equal values of a key, as soon as a new value is found, restart the search by looking for a larger value. This is much faster when the index has many equal keys.
 
The term "loose indexscan" is used in some other databases for the operation of using a btree index to retrieve the distinct values of a column efficiently; rather than scanning all equal values of a key, as soon as a new value is found, restart the search by looking for a larger value. This is much faster when the index has many equal keys.
  
Line 22: Line 22:
  
 
but runs orders of magnitude faster when "col" has only a proportionally small number of distinct values.
 
but runs orders of magnitude faster when "col" has only a proportionally small number of distinct values.
 +
 +
= Making use of a non-leading column of an index =
 +
 +
The same technique can be used to allow a query to benefit from an index which has a leading column with few distinct values (and which would not naturally be specified in the query), and a 2nd column which is highly specific and is used by the query.  The leading column can be introduced into the query in a way which does not alter the meaning.
 +
 +
To set up the example:
 +
<source lang="SQL">
 +
create table tablename (col integer NOT NULL, col2 integer NOT NULL);
 +
insert into tablename select floor(random()*5) as col, (random()*1000000)::integer as col2 from generate_series(1,10000000);
 +
create index on tablename (col,col2 );
 +
</source>
 +
 +
The natural specification of the query, which cannot use the index and takes 1758 ms:
 +
 +
<source lang="SQL">
 +
select count(*) from tablename where col2=9854 ;
 +
</source>
 +
 +
The modified version which can benefit from the index, and runs in 2 ms:
 +
<source lang="SQL">
 +
select count(*) from tablename where col2=9854 and col in
 +
(
 +
  WITH RECURSIVE
 +
    t AS (SELECT min(col) AS col FROM tablename
 +
          UNION ALL
 +
          SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
 +
  SELECT col FROM t
 +
);
 +
</source>
 +
 +
Writing a correct version of this query would be much more complex if col could be NULL
  
 
[[Category:SQL]]
 
[[Category:SQL]]

Latest revision as of 21:47, 18 January 2013

Performance Snippets

Loose indexscan

Works with PostgreSQL

8.4

Written in

SQL

Depends on

Nothing

[edit] Selecting Distinct Values

The term "loose indexscan" is used in some other databases for the operation of using a btree index to retrieve the distinct values of a column efficiently; rather than scanning all equal values of a key, as soon as a new value is found, restart the search by looking for a larger value. This is much faster when the index has many equal keys.

Postgres does not support loose indexscans natively, but they can be emulated using a recursive CTE as follows:

WITH RECURSIVE
     t AS (SELECT min(col) AS col FROM tablename
           UNION ALL
           SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
SELECT col FROM t WHERE col IS NOT NULL
UNION ALL
SELECT NULL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NULL);

The recursive part of the query processes one new row in each pass, finding the smallest column value greater than that on the previous pass; the main query adds in a single null result if needed. The above query is therefore equivalent to

SELECT DISTINCT col FROM tablename;

but runs orders of magnitude faster when "col" has only a proportionally small number of distinct values.

[edit] Making use of a non-leading column of an index

The same technique can be used to allow a query to benefit from an index which has a leading column with few distinct values (and which would not naturally be specified in the query), and a 2nd column which is highly specific and is used by the query. The leading column can be introduced into the query in a way which does not alter the meaning.

To set up the example:

CREATE TABLE tablename (col integer NOT NULL, col2 integer NOT NULL);
INSERT INTO tablename SELECT floor(random()*5) AS col, (random()*1000000)::integer AS col2 FROM generate_series(1,10000000);
CREATE INDEX ON tablename (col,col2 );

The natural specification of the query, which cannot use the index and takes 1758 ms:

SELECT count(*) FROM tablename WHERE col2=9854 ;

The modified version which can benefit from the index, and runs in 2 ms:

SELECT count(*) FROM tablename WHERE col2=9854 AND col IN 
(
  WITH RECURSIVE
     t AS (SELECT min(col) AS col FROM tablename
           UNION ALL
           SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
  SELECT col FROM t
);

Writing a correct version of this query would be much more complex if col could be NULL

Personal tools