Index-only scans
Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.
During a regular index scan, indexes are traversed, in a manner similar to any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Btree indexes contain what are technically redundant copies of the column data that is indexed.
PostgreSQL indexes do not contain visibility information. That is, it is not directly possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).
Example queries where index-only scans could be used in principle
Assuming that there is some (non-expression) index on a column (typically a primary key):
select count(*) from categories;
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):
select 1st_indexed_col, 2nd_indexed_col from categories;
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:
select indexed_col from categories where indexed_col in (4, 5, 6);
Index-only scans and index-access methods
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may support index-only scans.
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0; amname | amcanreturn --------+-------------- btree | btcanreturn spgist | spgcanreturn (2 rows)
SP-GiST opclasses may or may not imply that the actual on-disk index is "lossy"; there will be full redundant copies of Datums stored only for certain operator classes, and so index-only scans are only actually supported by some SP-GiST indexes. Support for additional index AMs will probably follow in a future release of PostgreSQL - GiST and GIN operator classes like btree_gist and btree_gin, or in 9.3, SP-GiST's "quad tree over a range" opclass, are not lossy, and so could in principle support index-only scans. Also, even with lossy indexes, it is still possible in principle to solve "select count(*)" queries, which may follow in a future release.
The Visibility Map (and other relation forks)
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples (row versions) in a heap page are visible, the page's corresponding bit may be set.
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.
[peter@peterlaptop 12935]$ ls -l -h -a -rw-------. 1 peter peter 8.0K Sep 28 00:00 12910 -rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm -rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm ***SNIP***
The FSM is structured as a binary tree [1]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [2].
Crash safety, recovery and the visibility map
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.
Covering indexes
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.
Interaction with HOT
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.
You can monitor the total proportion of HOT updates for each relation using this query.
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;
What types of queries may be satisfied by an index-only scan?
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.
postgres=# explain analyze select count(*) from categories; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1) -> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1) Heap Fetches: 16 Total runtime: 0.108 ms (4 rows)
As the number of heap fetches (or "visits") that are projected to be needed by the planner goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.
Is "count(*)" much faster now?
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.
Why isn't my query using an index-only scan?
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated when VACUUM is run. It is advisable to run VACUUM ANALYZE immediately after upgrading to PostgreSQL 9.2, in order to ensure that relallvisible roughly accords with reality.
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!
Summary
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:
- Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).
- When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is (i.e. the total number of inserts, updates and deletes), and therefore how likely it is that most heap pages are known to be all-visible at any given time.
- Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.