From PostgreSQL wiki
Selecting Distinct Values
The term "loose indexscan" is used in some other databases for the operation of using a btree index to retrieve the distinct values of a column efficiently; rather than scanning all equal values of a key, as soon as a new value is found, restart the search by looking for a larger value. This is much faster when the index has many equal keys.
Postgres does not support loose indexscans natively, but they can be emulated using a recursive CTE as follows:
WITH RECURSIVE t AS ( SELECT MIN(col) AS col FROM tbl UNION ALL SELECT (SELECT MIN(col) FROM tbl WHERE col > t.col) FROM t WHERE t.col IS NOT NULL ) SELECT col FROM t WHERE col IS NOT NULL UNION ALL SELECT NULL WHERE EXISTS(SELECT 1 FROM tbl WHERE col IS NULL);
The recursive part of the query processes one new row in each pass, finding the smallest column value greater than that on the previous pass; the main query adds in a single null result if needed. The above query is therefore equivalent to
SELECT DISTINCT col FROM tbl;
but runs orders of magnitude faster when "col" has only a proportionally small number of distinct values.
col IS NOT NULL
col is defined
NOT NULL the query can be simplified.
Also using alternative syntax with
ORDER BY /
LIMIT 1, which typically results in a simpler (and faster) query plan:
WITH RECURSIVE t AS ( (SELECT col FROM tbl ORDER BY col LIMIT 1) -- parentheses required UNION ALL SELECT (SELECT col FROM tbl WHERE col > t.col ORDER BY col LIMIT 1) FROM t WHERE t.col IS NOT NULL ) SELECT col FROM t WHERE col IS NOT NULL;
Making use of a non-leading column of an index
The same technique can be used to allow a query to benefit from an index which has a leading column with few distinct values (and which would not naturally be specified in the query), and a 2nd column which is highly specific and is used by the query. The leading column can be introduced into the query in a way which does not alter the meaning.
To set up the example:
CREATE TABLE tbl (col INTEGER NOT NULL, col2 INTEGER NOT NULL); INSERT INTO tbl SELECT trunc(random()*5) AS col, (random()*1000000)::INT AS col2 FROM generate_series(1,10000000); CREATE INDEX ON tbl (col, col2);
The natural specification of the query, which typically doesn't use the index takes 1758 ms:
SELECT COUNT(*) FROM tbl WHERE col2 = 9854;
The modified version can benefit from the index and runs in 2 ms:
SELECT COUNT(*) FROM tbl WHERE col2 = 9854 AND col IN ( WITH RECURSIVE t AS (SELECT MIN(col) AS col FROM tbl UNION ALL SELECT (SELECT MIN(col) FROM tbl WHERE col > t.col) FROM t WHERE t.col IS NOT NULL) SELECT col FROM t );
A correct version of this query would be much more complex if
col could be NULL.