https://wiki.postgresql.org/api.php?action=feedcontributions&user=Sternocera&feedformat=atomPostgreSQL wiki - User contributions [en]2024-03-19T01:17:36ZUser contributionsMediaWiki 1.35.13https://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18702Index-only scans2012-12-13T14:55:18Z<p>Sternocera: /* Why isn't my query using an index-only scan? */</p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, in a manner similar to any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Btree indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
PostgreSQL indexes do not contain visibility information. That is, it is not directly possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may support index-only scans.<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
SP-GiST opclasses may or may not imply that the actual on-disk index is "lossy"; there will be full redundant copies of Datums stored only for certain operator classes, and so index-only scans are only actually supported by some SP-GiST indexes. Support for additional index AMs will probably follow in a future release of PostgreSQL - GiST and GIN operator classes like btree_gist and btree_gin, or in 9.3, SP-GiST's "quad tree over a range" opclass, are not lossy, and so could in principle support index-only scans. Also, even with lossy indexes, it is still possible in principle to solve "select count(*)" queries, which may follow in a future release.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples (row versions) in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap fetches (or "visits") that are projected to be needed by the planner goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated when VACUUM is run. It is advisable to run VACUUM ANALYZE immediately after upgrading to PostgreSQL 9.2, in order to ensure that relallvisible roughly accords with reality.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is (i.e. the total number of inserts, updates and deletes), and therefore how likely it is that most heap pages are known to be all-visible at any given time.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18663Index-only scans2012-12-01T12:06:55Z<p>Sternocera: /* Index-only scans and index-access methods */</p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, in a manner similar to any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Btree indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
PostgreSQL indexes do not contain visibility information. That is, it is not directly possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may support index-only scans.<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
SP-GiST opclasses may or may not imply that the actual on-disk index is "lossy"; there will be full redundant copies of Datums stored only for certain operator classes, and so index-only scans are only actually supported by some SP-GiST indexes. Support for additional index AMs will probably follow in a future release of PostgreSQL - GiST and GIN operator classes like btree_gist and btree_gin, or in 9.3, SP-GiST's "quad tree over a range" opclass, are not lossy, and so could in principle support index-only scans. Also, even with lossy indexes, it is still possible in principle to solve "select count(*)" queries, which may follow in a future release.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples (row versions) in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap fetches (or "visits") that are projected to be needed by the planner goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is (i.e. the total number of inserts, updates and deletes), and therefore how likely it is that most heap pages are known to be all-visible at any given time.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18662Index-only scans2012-12-01T12:03:01Z<p>Sternocera: /* Summary */</p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, in a manner similar to any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Btree indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
PostgreSQL indexes do not contain visibility information. That is, it is not directly possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may optionally support index-only scans.<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
SP-GiST opclasses may or may not imply that an index is "lossy"; there will be full redundant copies of Datums stored only for certain operator classes, and so index-only scans are only actually supported by some SP-GiST indexes. Support for additional index AMs will probably follow in a future release of PostgreSQL - GiST and GIN operator classes like btree_gist and btree_gin, or in 9.3, SP-GiST's "quad tree over a range" opclass, are not lossy, and so could in principle support index-only scans. Also, even with lossy indexes, it is still possible in principle to solve "select count(*)" queries, which may follow in a future release.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples (row versions) in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap fetches (or "visits") that are projected to be needed by the planner goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is (i.e. the total number of inserts, updates and deletes), and therefore how likely it is that most heap pages are known to be all-visible at any given time.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18661Index-only scans2012-12-01T12:01:13Z<p>Sternocera: /* What types of queries may be satisfied by an index-only scan? */</p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, in a manner similar to any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Btree indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
PostgreSQL indexes do not contain visibility information. That is, it is not directly possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may optionally support index-only scans.<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
SP-GiST opclasses may or may not imply that an index is "lossy"; there will be full redundant copies of Datums stored only for certain operator classes, and so index-only scans are only actually supported by some SP-GiST indexes. Support for additional index AMs will probably follow in a future release of PostgreSQL - GiST and GIN operator classes like btree_gist and btree_gin, or in 9.3, SP-GiST's "quad tree over a range" opclass, are not lossy, and so could in principle support index-only scans. Also, even with lossy indexes, it is still possible in principle to solve "select count(*)" queries, which may follow in a future release.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples (row versions) in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap fetches (or "visits") that are projected to be needed by the planner goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is, and therefore how likely it is that most heap pages are known to be all-visible.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18660Index-only scans2012-12-01T12:00:30Z<p>Sternocera: /* What types of queries may be satisfied by an index-only scan? */</p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, in a manner similar to any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Btree indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
PostgreSQL indexes do not contain visibility information. That is, it is not directly possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may optionally support index-only scans.<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
SP-GiST opclasses may or may not imply that an index is "lossy"; there will be full redundant copies of Datums stored only for certain operator classes, and so index-only scans are only actually supported by some SP-GiST indexes. Support for additional index AMs will probably follow in a future release of PostgreSQL - GiST and GIN operator classes like btree_gist and btree_gin, or in 9.3, SP-GiST's "quad tree over a range" opclass, are not lossy, and so could in principle support index-only scans. Also, even with lossy indexes, it is still possible in principle to solve "select count(*)" queries, which may follow in a future release.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples (row versions) in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap fetches (or "visits") goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is, and therefore how likely it is that most heap pages are known to be all-visible.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18659Index-only scans2012-12-01T11:57:28Z<p>Sternocera: /* The Visibility Map (and other relation forks) */</p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, in a manner similar to any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Btree indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
PostgreSQL indexes do not contain visibility information. That is, it is not directly possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may optionally support index-only scans.<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
SP-GiST opclasses may or may not imply that an index is "lossy"; there will be full redundant copies of Datums stored only for certain operator classes, and so index-only scans are only actually supported by some SP-GiST indexes. Support for additional index AMs will probably follow in a future release of PostgreSQL - GiST and GIN operator classes like btree_gist and btree_gin, or in 9.3, SP-GiST's "quad tree over a range" opclass, are not lossy, and so could in principle support index-only scans. Also, even with lossy indexes, it is still possible in principle to solve "select count(*)" queries, which may follow in a future release.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples (row versions) in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap "visits" goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is, and therefore how likely it is that most heap pages are known to be all-visible.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18658Index-only scans2012-12-01T11:55:25Z<p>Sternocera: Clarifying tweaks</p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, in a manner similar to any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Btree indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
PostgreSQL indexes do not contain visibility information. That is, it is not directly possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may optionally support index-only scans.<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
SP-GiST opclasses may or may not imply that an index is "lossy"; there will be full redundant copies of Datums stored only for certain operator classes, and so index-only scans are only actually supported by some SP-GiST indexes. Support for additional index AMs will probably follow in a future release of PostgreSQL - GiST and GIN operator classes like btree_gist and btree_gin, or in 9.3, SP-GiST's "quad tree over a range" opclass, are not lossy, and so could in principle support index-only scans. Also, even with lossy indexes, it is still possible in principle to solve "select count(*)" queries, which may follow in a future release.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap "visits" goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is, and therefore how likely it is that most heap pages are known to be all-visible.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18629Index-only scans2012-11-19T17:26:21Z<p>Sternocera: </p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, like any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Btree indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
PostgreSQL indexes do not contain visibility information. That is, it is not possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may optionally support index-only scans.<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
SP-GiST opclasses may or may not imply that an index is "lossy"; there will be full redundant copies of Datums stored only for certain operator classes, and so index-only scans are only actually supported by some SP-GiST indexes. Support for additional index AMs will probably follow in a future release of PostgreSQL - GiST and GIN operator classes like btree_gist and btree_gin, or in 9.3, SP-GiST's "quad tree over a range" opclass, are not lossy, and so could in principle support index-only scans. Also, even with lossy indexes, it is still possible in principle to solve "select count(*)" queries, which may follow in a future release.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap "visits" goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is, and therefore how likely it is that most heap pages are known to be all-visible.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18607Index-only scans2012-11-16T14:01:16Z<p>Sternocera: /* Index-only scans and index-access methods */</p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, like any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
PostgreSQL indexes do not contain visibility information. That is, it is not possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may optionally support index-only scans.<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
SP-GiST opclasses may or may not imply that an index is "lossy"; there will be full redundant copies of Datums stored only for certain operator classes, and so index-only scans are only actually supported by some SP-GiST indexes. Support for additional index AMs will probably follow in a future release of PostgreSQL - GiST and GIN operator classes like btree_gist and btree_gin, or in 9.3, SP-GiST's "quad tree over a range" opclass, are not lossy, and so could in principle support index-only scans. Also, even with lossy indexes, it is still possible in principle to solve "select count(*)" queries, which may follow in a future release.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap "visits" goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is, and therefore how likely it is that most heap pages are known to be all-visible.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18605Index-only scans2012-11-16T11:41:01Z<p>Sternocera: Don't use "qual" hacker lingo; just say predicate</p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, like any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
Indexes do not contain visibility information. That is, it is not possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may optionally support index-only scans (naturally, this is only sensible when the index isn't lossy).<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
Support for additional index AMs may follow in a future release of PostgreSQL.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's predicate, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap "visits" goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is, and therefore how likely it is that most heap pages are known to be all-visible.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18597Index-only scans2012-11-15T22:20:11Z<p>Sternocera: User-level discussion of index-only scans feature</p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, like any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
Indexes do not contain visibility information. That is, it is not possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key):<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col):<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries:<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may optionally support index-only scans (naturally, this is only sensible when the index isn't lossy).<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
Support for additional index AMs may follow in a future release of PostgreSQL.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations.<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree [http://www.postgresql.org/docs/9.2/static/storage-fsm.html]. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true [http://www.postgresql.org/docs/9.2/static/storage-vm.html].<br />
<br />
=== Crash safety, recovery and the visibility map ===<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's qual, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query.<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice.<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap "visits" goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any qual" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is, and therefore how likely it is that most heap pages are known to be all-visible.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a qual is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Index-only_scans&diff=18596Index-only scans2012-11-15T22:15:09Z<p>Sternocera: </p>
<hr />
<div>Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This can result in a significant reduction in the amount of I/O necessary to satisfy queries.<br />
<br />
During a regular index scan, indexes are traversed, like any other tree structure, by comparing a constant against Datums that are stored in the index. Btree-indexed types must satisfy the trichotomy property; that is, the type must follow the reflexive, symmetric and transitive law. Those laws accord with our intuitive understanding of how a type ought to behave anyway, but the fact that an index's physical structure reflects the relative values of Datums actually mandates that these rules be followed by types. Indexes contain what are technically redundant copies of the column data that is indexed.<br />
<br />
Indexes do not contain visibility information. That is, it is not possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented. Writing an implementation with a cheap but reliable visibility look-aside proved challenging.<br />
<br />
The implementation of the feature disproportionately involved making an existing on-disk structure called the visibility map crash-safe. It was necessary for the structure to reliably (and inexpensively) indicate visibility of index tuples - to do any less would imply the possibility of index-only scans producing incorrect results, which of course would be absolutely unacceptable.<br />
<br />
The fact that indexes only contain data that is actually indexed, and not other unindexed columns, naturally precludes using an index-only scan when the other columns are queried (by appearing in a query select list, for example).<br />
<br />
=== Example queries where index-only scans could be used in principle ===<br />
<br />
Assuming that there is some (non-expression) index on a column (typically a primary key)::<br />
<br />
select count(*) from categories;<br />
<br />
Assuming that there is a composite index on (1st_indexed_col, 2nd_indexed_col)::<br />
<br />
select 1st_indexed_col, 2nd_indexed_col from categories;<br />
<br />
Postgres 9.2 added the capability of allowing indexed_col op ANY(ARRAY[...]) conditions to be used in plain index scans and index-only scans. Previously, such conditions could only be used in bitmap index scans. For this reason, it is possible to see an index-only scan for these ScalarArrayOpExpr queries::<br />
<br />
select indexed_col from categories where indexed_col in (4, 5, 6);<br />
<br />
=== Index-only scans and index-access methods ===<br />
<br />
Index-only scans are not actually limited to scans on btree indexes. SP-GiST operator classes may optionally support index-only scans (naturally, this is only sensible when the index isn't lossy)::<br />
<br />
postgres=# select amname, amcanreturn from pg_am where amcanreturn != 0;<br />
amname | amcanreturn<br />
--------+--------------<br />
btree | btcanreturn<br />
spgist | spgcanreturn<br />
(2 rows)<br />
<br />
Support for additional index AMs may follow in a future release of PostgreSQL.<br />
<br />
=== The Visibility Map (and other relation forks) ===<br />
<br />
The Visibility Map is a simple data structure associated with every heap relation (table). It is a "relation fork"; an on-disk ancillary file associated with a particular relation (table or index). Note that index relations (that is, indexes) do not have a visibility map associated with them. The visibility map is concerned with tracking which tuples are visible to all transactions at a high level. Tuples from one transaction may or may not be visible to any given other transaction, depending on whether or not their originating transaction actually committed (yet, or ever, if the transaction aborted), and when that occurred relative to our transaction's current snapshot. Note that the exact<br />
behaviour depends on our transaction isolation level. Note also that it is quite possible for one transaction to see one physical tuple/set of values for one logical tuple, while another transaction sees other, distinct values for that same logical tuple, because, in effect, each of the two transaction has a differing idea of what constitutes "now". This is the core idea of MVCC. When there is absolute consensus that all physical tuples in a heap page are visible, the page's corresponding bit may be set.<br />
<br />
Another relation fork that you may be familiar with is the freespace map. In contrast to the visibility map, there is a FSM for both heap and index relations (with the sole exception of hash index relations, which have none).<br />
<br />
The purpose of the freespace map is to quickly locate a page with enough free space to hold a tuple to be stored, or to determine if no such page exists and the relation has to be extended.<br />
<br />
In PostgreSQL 8.4, the current freespace map implementation was added. It made the freespace map an on-disk relation fork. The previous implementation required administrators to guestimate the number of relations, and the required freespace map size for each, so that the freespace map existed only in a fixed allocation of shared memory. This tended to result in wasted space due to undersizing, as the core system's storage manager needlessly extended relations::<br />
<br />
[peter@peterlaptop 12935]$ ls -l -h -a<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910<br />
-rw-------. 1 peter peter 24K Sep 28 00:00 12910_fsm<br />
-rw-------. 1 peter peter 8.0K Sep 28 00:00 12910_vm<br />
***SNIP***<br />
<br />
The FSM is structured as a binary tree <ref>[http://www.postgresql.org/docs/9.2/static/storage-fsm.html]</ref>. There is one leaf node per heap page, with non-leaf nodes stores the maximum amount of free space for any of its children. So, unlike EXPLAIN output's node costs, the values are not cumulative.<br />
<br />
The visibility map is a simpler structure. There is one bit for each page in the heap relation that the visibility map corresponds to.<br />
<br />
The primary practical reason for having and maintaining the visibility map is to optimise VACUUM. A set bit indicates that all tuples on the corresponding heap page are known to be visible to all transactions, and therefore that vacuuming the page is unnecessary. Like the new freespace map implementation, the visibility map was added in Postgres 8.4.<br />
<br />
The visibility map is conservative in that a set bit (1) indicates that all tuples are visible on the page, but an unset bit (0) indicates that that condition may or may not be true <ref>[http://www.postgresql.org/docs/9.2/static/storage-vm.html]</ref>.<br />
<br />
Crash safety, recovery and the visibility map<br />
---------------------------------------------<br />
<br />
This involves WAL-logging setting a bit within the visibility map during VACUUM, and taking various special measures during recovery.<br />
<br />
The Postgres write-ahead log is widely used to ensure crash-safety, but it is also intergral to the built-in Hot Standby/Streaming replication feature.<br />
<br />
Recovery treats marking a page all-visible as a recovery conflict for snapshots that could still fail to see XIDs on that page. PostgreSQL may in the future try to soften this, so that the implementation simply forces index scans to do heap fetches in cases where this may be an issue, rather than throwing a hard conflict.<br />
<br />
=== Covering indexes ===<br />
<br />
Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist. PostgreSQL supports using just the first few columns of the index in a regular index scan if that is in the query's qual, so covering indexes need not be completely useless for regular index scans.<br />
<br />
=== Interaction with HOT ===<br />
<br />
HOT (Heap-only tuples) is a major performance feature that was added in Postgres 8.3. This allowed UPDATES to rows (which, owing to Postgres's MVCC architecture, are implemented with a deletion and insertion of physical tuples) to only have to create a new physical heap tuple when inserting, and not a new index tuple, if and only if the update did not affect indexed columns.<br />
<br />
With HOT, it became possible for an index scan to traverse a so-called HOT chain; it could get from the physical index tuple (which would probably have been created by an original INSERT, and related to an earlier version of the logical tuple), to the corresponding physical heap tuple. The heap tuple would itself contain a pointer to the next version of the tuple (that is, the tuple ctid), which might, in turn, have a pointer of its own. The index scan eventually arrives at tuple that is current according to the query's snapshot.<br />
<br />
HOT also enables opportunistic mini-vacuums, where the HOT chain is "pruned".<br />
<br />
All told, this performance optimisation has been found to be very valuable, particularly for OLTP workloads. It is quite natural that tuples that are frequently updated are generally not indexed. However, when considering creating a covering index, the need to maximise the number of HOT updates should be carefully weighed.<br />
<br />
You can monitor the total proportion of HOT updates for each relation using this query::<br />
<br />
postgres=# select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables;<br />
<br />
=== What types of queries may be satisfied by an index-only scan? ===<br />
<br />
Aside from the obvious restriction that queries cannot reference columns that are not indexed by a single index in order to use an index-only scan, the need to visit the heap where all tuples are not known to be visible is relatively expensive. The planner weighs this factor heavily when considering an index-only scan, and in general the need to ensure that the bulk of the table's tuples have their visibility map bits set is likely to restrict index-only scans' usefulness to queries against infrequently updated tables.<br />
<br />
It is not necessary for all bits to be set; index-only scans may "visit the heap" if that is necessary. Index-only scans are something of a misnomer, in fact - index mostly scans might be a more appropriate appellation. An explain analyze involving an index-only scan will indicate how frequently that occurred in practice::<br />
<br />
postgres=# explain analyze select count(*) from categories;<br />
QUERY PLAN<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=12.53..12.54 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)<br />
-> Index Only Scan using categories_pkey on categories (cost=0.00..12.49 rows=16 width=0) (actual time=0.018..0.038 rows=16 loops=1)<br />
Heap Fetches: 16<br />
Total runtime: 0.108 ms<br />
(4 rows)<br />
<br />
As the number of heap "visits" goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.<br />
<br />
=== Is "count(*)" much faster now? ===<br />
<br />
A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans *can* be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.<br />
<br />
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any qual" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.<br />
<br />
=== Why isn't my query using an index-only scan? ===<br />
<br />
VACUUM does not have any particular tendency to behave more aggressively to facilitate using index-only scans more frequently. While VACUUM can be set to behave more aggressively in various ways, it's far from clear that to do so specifically to make index-only scans occur more frequently represents a sensible course of action.<br />
<br />
The planner doesn't directly examine the entire visibility map of a relation when considering an index-only scan (however, the executor does maintain a running tally, which is visible in explain analyze output). However, the planner does naturally weigh the proportion of pages which are known visible to all.<br />
<br />
In Postgres 9.2, statistics are gathered about the proportion of pages that are known all-visible. The pg_class.relallvisible column indicates how many pages are visible (the proportion can be obtained by calculating it as a proportion of pg_class.relpages). These statistics are updated during VACUUM and ANALYZE.<br />
<br />
Note that it is possible to examine the number of index scans (including index-only scans and bitmap index scans) by examining<br />
pg_stat_user_indexes.idx_scan. If your covering index isn't being used, you're essentially paying for the overhead of maintaining it during writes with no benefit in return. Drop the index!<br />
<br />
=== Summary ===<br />
<br />
It is possible for index-only scans to greatly decrease the amount of I/O required to execute some queries. For certain queries, particularly queries that are characteristic of data warehousing (i.e. relatively large amounts of static, infrequently-updated data where reports on historic data is frequently required), they can considerably improve performance. Such queries have been observed to execute anything from twice to twenty times as fast with index-only scans. However, one should bear in mind that:<br />
<br />
* Index-only scans are opportunistic, in that they take advantage of a pre-existing state of affairs where it happens to be possible to elide heap access. However, the server doesn't make any particular effort to facilitate index-only scans, and it is difficult to recommend a course of action to make index-only scans occur more frequently, except to define covering indexes in response to a measured need (For example, when pg_stat_statements indicates that a disproportionate amount of I/O is being used to execute a query against fairly static data, with a smallish subset of table columns retrieved).<br />
<br />
* When creating a covering index, the likely effect on HOT updates should be weighed heavily. Are there many HOT updates on the table to begin with? This is a general point of concern, because creating an index may prevent HOT updates from occurring, and because the number of HOT updates is a reasonably good proxy for just how static a table is, and therefore how likely it is that most heap pages are known to be all-visible.<br />
<br />
* Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a qual is, etc), and if there is actually an index available that could be used by an index-only scan in principle.<br />
<br />
[[Category:Indexes]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=PostgreSQL_Conference_Europe_Talks_2012&diff=18509PostgreSQL Conference Europe Talks 20122012-11-01T16:17:50Z<p>Sternocera: </p>
<hr />
<div>= PostgreSQL Conference Europe 2012 Talks =<br />
<br />
== Trainings: Tuesday 23th October, 2012 ==<br />
<br />
=== Clyde ===<br />
<br />
* Mastering PostgreSQL Administration (Bruce Momjian, Devrim GÜNDÜZ)<br />
<br />
=== Seine ===<br />
<br />
* PostgreSQL Performance Training (Greg Smith, Peter Geoghegan)<br />
* PostgreSQL Replication Training (Dimitri Fontaine, Simon Riggs)<br />
<br />
=== Thames ===<br />
<br />
* [http://www.pgsql.cz/skoleni/skoleni_plpgsql_web.pdf Implementace uložených procedur v PostgreSQL] (Pavel Stehule)<br />
* Čtení exekučních plánů (Tomas Vondra)<br />
<br />
=== Vltava ===<br />
<br />
* A day of SQL with Celko (Joe Celko)<br />
<br />
== Talks: Wednesday 24th October, 2012 ==<br />
<br />
=== Bellevue ===<br />
<br />
* Opening keynote (Joe Celko)<br />
<br />
=== Seine ===<br />
<br />
* [http://momjian.us/main/presentations/features.html#cte Programming the SQL Way with Common Table Expressions (Bruce Momjian)]<br />
* PostgreSQL on AWS (Christophe Pettus)<br />
* Understanding EXPLAIN's output (Guillaume Lelarge)<br />
* [http://www.sraoss.co.jp/event_seminar/2012/20121024_pgpool-II_pgconfEU2012_sraoss.pdf Boosting Performance and Reliability by using pgpool-II (Tatsuo Ishii)]<br />
* CREATE EXTENSION pgchess; (Gianni Ciolli)<br />
<br />
=== Thames ===<br />
<br />
* [[Media:Pg-fdw.pdf|Writing a foreign data wrapper (Bernd Helmle)]]<br />
* [http://anarazel.de/2ndquadrant/pgconf-eu-2012-10-24 MultiMaster Replication: Applications, Comparison, Implementation ] (Andres Freund, Simon Riggs)<br />
* [[Media:Pgconfeu12-collectd%2Bpsql.pdf|Watch your Elephants -- Using collectd for PostgreSQL performance analysis]] ([http://tokkee.org/ Sebastian 'tokkee' Harl])<br />
* [[Media:Range-types.pdf|Range Types in PostgreSQL 9.2 - Your Life Will Never Be the Same (Jonathan S. Katz)]]<br />
* [[Media:Pgxc_HA_20121024.pdf|High availability in Postgres-XC, the symmetric PostgreSQL cluster (Koichi Suzuki)]]<br />
<br />
=== Vltava ===<br />
<br />
* Provoz PostgreSQL na AWS (Tomas Vondra)<br />
* [[Media:Plpgsql internals.pdf| PL/pgSQL internals -- some details from PL/pgSQL environment]]<br />
* Migrace z MySQL na PostgreSQL (Tomas Vondra)<br />
* [[Media:Indexy.pdf| Indexy jsou grunt -- basic and enhanced using of indexes in PostgreSQL]]<br />
* Load dat do PostgreSQL (Jan Holčapek)<br />
<br />
== Talks: Thursday 25th October, 2012 ==<br />
<br />
=== Bellevue (Lightning Talks) ===<br />
<br />
* [[Media:Full-text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf| Full-text search in PostgreSQL in milliseconds (Oleg Bartunov, Alexander Korotkov)]]<br />
* [[Media:Pgconfeu-2012-docbot-print.pdf|#PostgreSQL pg_docbot (Andreas 'ads' Scherbaum)]]<br />
* [http://tapoueh.org/images/pgq-coop.pdf PGQ Cooperative Consumers] (Dimitri Fontaine & Marko Kreen)<br />
* [http://www.pgexperts.com/document.html?id=60 PostgreSQL Drinking Game] (Josh Berkus)<br />
<br />
=== Seine ===<br />
<br />
* How fast is PostgreSQL? (Cédric Villemain)<br />
* [http://tapoueh.org/images/high-availability.pdf Implementing High Availability] (Dimitri Fontaine)<br />
* [http://momjian.us/main/presentations/internals.html#shared_memory Inside PostgreSQL Shared Memory (Bruce Momjian)]<br />
* [https://plv8-pgconfeu12.herokuapp.com Embracing the Web with JSON and PLV8] ([http://bitfission.com Will Leinweber])<br />
<br />
=== Thames ===<br />
<br />
* [https://github.com/Oslandia/presentations/tree/master/pgconf_eu_2012 Topology and network analysis with PostgreSQL and PostGIS (Vincent Picavet)]<br />
* [[Media:Universal_Data_Access_with_SQL_MED.pdf|Universal Data Access with SQL/MED (David Fetter)]]<br />
* [https://github.com/Oslandia/presentations/tree/master/pgconf_eu_2012 PostGIS 2.0 and beyond (Vincent Picavet)]<br />
* Practical Tips for Better PostgreSQL Applications (Marc Balmer)<br />
* Pacemaker and PostgreSQL: to serve and protect your data (Jehan-Guillaume (ioguix) de Rorthais)<br />
<br />
=== Vltava ===<br />
<br />
* [[Media: Marketing-postgres.pdf|Marketing PostgreSQL (Jonathan S. Katz)]]<br />
* [[Media: Pgconf2012_sprocwrapper.pdf|Java Stored Procedure Wrapper and PGObserver (Jan Mussler)]]<br />
* [http://www.pgexperts.com/document.html?id=59 Elephants and Windmills] (Josh Berkus)<br />
* [[Media: Pgeu2012.pdf|PostgreSQL in Research and Development: Three success stories. (Roland Sonnenschein)]]<br />
* [[Media: Index_support_for_regular_expression_search.pdf |Index support for regular expression search (Alexander Korotkov)]]<br />
<br />
== Talks: Friday 26th October, 2012 ==<br />
<br />
=== Bellevue ===<br />
<br />
* Postgres Adoption at the Tipping Point: Users Around the World and Their Deployment Profile (Ed Boyajian)<br />
* Community PostgreSQL (Simon Riggs, Harald Armin Massa)<br />
* Closing (Dave Page)<br />
<br />
=== Seine ===<br />
<br />
* [http://2ndquadrant.com/media/cms_page_media/59/BeyondQueryLogging.pdf Beyond Query Logging] (Greg Smith, Peter Geoghegan)<br />
* [http://www.hagander.net/talks/Backup_strategies_pgeu.pdf PostgreSQL Backup Strategies] (Magnus Hagander)<br />
* [http://www.gunduz.org/download.php?dlid=196 Maintaining Very Large Databases (VLDs)] (Devrim GÜNDÜZ)<br />
* [http://tapoueh.org/images/fotolog.pdf Large Scale MySQL Migration to PostgreSQL] (Dimitri Fontaine)<br />
<br />
=== Thames ===<br />
<br />
* [[Media: Postbis_pgcon_eu_2012.pdf|PostBIS - A Bioinformatics Booster for PostgreSQL (Michael Schneider, Renzo Kottmann)]]<br />
* Migrating Oracle queries to PostgreSQL (Alexey Klyukin)<br />
* Debugging complex SQL queries with writable CTEs (Gianni Ciolli)<br />
* [http://www.cybertec.at/download/2012_prag_linux_v5.pdf Limiting PostgreSQL resource consumption using the Linux kernel (Hans-Jürgen Schönig)]<br />
<br />
=== Vltava ===<br />
<br />
* [[Media: Pg_xnode_pgconf_2012.pdf|pg_xnode extension (Antonin Houska)]]<br />
[[Category:PostgreSQL Europe]]<br />
* PostgreSQL makes dev happy, a pgAgent + pl/pgsql use case (Julien Rouhaud)<br />
* [[Media: PGconEU2012-KaiGai-PGStrom.pdf|PG-Strom - GPU Accelerated Asynchronous Query Execution Module (KaiGai Kohei)]]<br />
[[Category:PostgreSQL Europe]]<br />
* [http://tokkee.org/talks/pgconfeu12-time-series-data.pdf Using PostgreSQL for storing time-series data] ([http://tokkee.org/ Sebastian 'tokkee' Harl])</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Performance_QA_Testing&diff=17986Performance QA Testing2012-08-06T16:40:04Z<p>Sternocera: pgbench-tools has been maintained on github for some time now.</p>
<hr />
<div>This page centralizes the efforts on performances QA testing: available hardware, available tools, continuous benchmarking effort...<br />
<br />
The PostgreSQL Performance lab is being created to allow community members of the Open Source database [http://www.postgresql.org/ PostgreSQL] to have enterprise class hardware to test on.<br />
<br />
The testing that will occur includes industry standard workloads such as OLTP, DSS and BI. Furthermore we will also use the hardware for other practical and customer oriented testing to improve scalability (processor utilization, i/o, load balancing, etc.) and managing large data sets (loading, backups, restores, replication, etc).<br />
<br />
=== Donations ===<br />
<br />
For donation inquiries, please contact [mailto:josh@postgresql.org Josh Berkus <josh @t postgresql.org>] and [mailto:jdrake@postgresql.org Joshua Drake <jdrake @t postgresql.org>].<br />
<br />
=== Mailing List ===<br />
<br />
There is a [http://lists.pgfoundry.org/mailman/listinfo/perflab-general mailing list] available to discuss administrative aspects of community equipment. Please continue to use the -hackers and -performance mailing lists for performance and technical discussions.<br />
<br />
== QA platforms ==<br />
<br />
* [[QA Platform hosted at Command Prompt]] - Portland, Oregon, USA<br />
* [[QA Platform hosted at Open Wide (France)]]<br />
<br />
== Tools ==<br />
<br />
* Former OSDL work: [http://osdldbt.sourceforge.net/ Database Test Suite] and [http://crucible.svn.sourceforge.net/viewvc/crucible/ Web interface]<br />
* [https://github.com/gregs1104/pgbench-tools pgbench-tools from Greg Smith]. See [[Regression Testing with pgbench]].<br />
* [http://bristlecone.continuent.org/HomePage Bristlecone from Continuent]<br />
* [http://tsung.erlang-projects.org/ Tsung load injector] allows to define sessions (containing queries and thinktime, etc) and replay them with very high concurrency setup. Can use many loading nodes at a time, multi OS support (written in [http://www.erlang.org/ erlang], extensible in this language)<br />
* [http://dim.tapoueh.org/temp/tsung-plotter/ Tsung Plotter] plots several tsung runs onto the same graphs set, for easy comparing. Uses python and matplotlib.<br />
* Tsung DBT2 Implementation (tsung module in erlang), WIP, to get published asap.<br />
<br />
== Ideas ==<br />
<br />
* look into [http://sysbench.sourceforge.net/ sysbench] - it has some issues with locking on postgresql but at least read-only it seems to work fine. See [[SysBench]] for more info.<br />
<br />
* collecting all the various small samples and testcases posted over the last few years on -performance, -hackers & -bugs and put them into a test set<br />
<br />
* consider doing tests using pgbench -M (simple|extended|prepared) to catch regressions in one of those modes<br />
<br />
* resurrect Jan Wiecks tpc-w implementation available on [http://pgfoundry.org/projects/tpc-w-php/ pgfoundry]<br />
<br />
* add full text search benchmarking by using [http://www.sigaev.ru/cvsweb/cvsweb.cgi/ftsbench/ ftsbench] from teodor<br />
<br />
* XML benchmarking ?<br />
<br />
* investigate [http://advogato.org/person/nconway/diary.html?start=21 QuickCheck] and http://advogato.org/person/nconway/diary/23.html<br />
<br />
* Implement the [http://www.cs.umb.edu/~poneil/StarSchemaB.PDF Star Schema Benchmark].<br />
<br />
== Datasets ==<br />
<br />
[[Sample_Databases|see the sample databases page for some free datasources]]<br />
<br />
== Information ==<br />
* [http://wiki.postgresql.org/wiki/Performance_Optimization In depth performance articles on PostgreSQL]<br />
* [http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide DL380 Tuning Guide]<br />
* [http://www.vimeo.com/channels/postgres Videos on Performance and other topics]<br />
* [http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ Performance measurements between load and filesystems (Linux)]<br />
<br />
[[Category:Benchmarking]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=What%27s_new_in_PostgreSQL_9.2&diff=17949What's new in PostgreSQL 9.22012-07-19T10:25:11Z<p>Sternocera: /* pg_stat_statements */</p>
<hr />
<div>{{Languages}}<br />
<br />
This document showcases many of the latest developments in PostgreSQL 9.2, compared to the last major release &ndash; PostgreSQL 9.1. There are many improvements in this release, so this wiki page covers many of the more important changes in detail. The full list of changes is itemised in ''Release Notes''.<br />
<br />
'''This page is incomplete!'''<br />
<br />
=Major new features=<br />
<br />
==Index-only scans <!-- Robert Haas, Ibrar Ahmed, Heikki Linnakangas, Tom Lane -->==<br />
<br />
In PostgreSQL, indexes have no "visibility" information. It means that when you access a record by its index, PostgreSQL has to visit the real tuple in the table to be sure it is visible to you: the tuple the index points to may simply be an old version of the record you are looking for.<br />
<br />
It can be a very big performance problem: the index is mostly ordered, so accessing its records is quite efficient, while the records may be scattered all over the place (that's a reason why PostgreSQL has a cluster command, but that's another story). In 9.2, PostgreSQL will use an "Index Only Scan" when possible, and not access the record itself if it doesn't need to.<br />
<br />
There is still no visibility information in the index. So in order to do this, PostgreSQL uses the visibility map ([http://www.postgresql.org/docs/devel/static/storage-vm.html visibility map]) , which tells it whether the whole content of a (usually) 8K page is visible to all transactions or not. When the index record points to a tuple contained in an «all visible» page, PostgreSQL won't have to access the tuple, it will be able to build it directly from the index. Of course, all the columns requested by the query must be in the index.<br />
<br />
The visibility map is maintained by VACUUM (it sets the visible bit), and by the backends doing SQL work (they unset the visible bit).<br />
<br />
Here is an example.<br />
<br />
create table demo_ios (col1 float, col2 float, col3 text);<br />
<br />
In this table, we'll put random data, in order to have "scattered" data. We'll put 100 million records, to have a big recordset, and have it not fit in memory (that's a 4GB-ram machine). This is an ideal case, made for this demo. The gains wont be that big in real life.<br />
<br />
insert into demo_ios select generate_series(1,100000000),random(), 'mynotsolongstring';<br />
<br />
select pg_size_pretty(pg_total_relation_size('demo_ios'));<br />
pg_size_pretty <br />
----------------<br />
6512 MB<br />
<br />
Let's pretend that the query is this:<br />
<br />
SELECT col1,col2 FROM demo_ios where col2 BETWEEN 0.02 AND 0.03<br />
<br />
In order to use an index only scan on this, we need an index on col2,col1 (col2 first, as it is used in the WHERE clause).<br />
<br />
CREATE index idx_demo_ios on demo_ios(col2,col1);<br />
<br />
We vacuum the visibility map to be up-to-date:<br />
<br />
VACUUM demo_ios;<br />
<br />
All the timing you'll see below are done on a cold OS and PostgreSQL cache (that's where the gains are, as the purpose on Index Only Scans is to reduce I/O).<br />
<br />
Let's first try without Index Only Scans:<br />
<br />
set enable_indexonlyscan to off;<br />
<br />
explain (analyze,buffers) select col1,col2 from demo_ios where col2 between 0.01 and 0.02;<br />
QUERY PLAN <br />
----------------------------------------------------------------------------------------------------------------------------------------<br />
Bitmap Heap Scan on demo_ios (cost=25643.01..916484.44 rows=993633 width=16) (actual time=763.391..362963.899 rows=1000392 loops=1)<br />
Recheck Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))<br />
Rows Removed by Index Recheck: 68098621<br />
Buffers: shared hit=2 read=587779<br />
-> Bitmap Index Scan on idx_demo_ios (cost=0.00..25394.60 rows=993633 width=0) (actual time=759.011..759.011 rows=1000392 loops=1)<br />
Index Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))<br />
Buffers: shared hit=2 read=3835<br />
Total runtime: 364390.127 ms<br />
<br />
<br />
With Index Only Scans:<br />
<br />
explain (analyze,buffers) select col1,col2 from demo_ios where col2 between 0.01 and 0.02;<br />
QUERY PLAN <br />
-----------------------------------------------------------------------------------------------------------------------------------------------<br />
Index Only Scan using idx_demo_ios on demo_ios (cost=0.00..35330.93 rows=993633 width=16) (actual time=58.100..3250.589 rows=1000392 loops=1)<br />
Index Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))<br />
Heap Fetches: 0<br />
Buffers: shared hit=923073 read=3848<br />
Total runtime: 4297.405 ms<br />
<br />
<br />
<br />
As nothing is free, there are a few things to note:<br />
<br />
* Adding indexes for index only scans obviously adds indexes to your table. So updates will be slower.<br />
* You will index columns that weren't indexed before. So there will be less opportunities for HOT updates.<br />
* Gains will probably be smaller in real life situations.<br />
<br />
This required making visibility map changes crash-safe, so visibility map bit changes are now WAL-logged.<br />
<br />
==Replication improvements <!-- Fujii Masao, Simon Riggs, Magnus Hagander, Jun Ishizuka -->==<br />
<br />
Streaming Replication is getting even more polished with this release. One on the main remaining gripes about streaming replication is that all the slaves have to be connected to the same and unique master, consuming its resources.<br />
<br />
Moreover, in case of a failover, it was very complicated to reconnect all the remaining slaves to the newly promoted master.<br />
<br />
To be on the safe side, it was easier to re-synchronize the slaves to the new masters from scratch, meaning that during this failover, only one server was active, and under heavy load, as it was used to rebuild all the slaves.<br />
<br />
* With 9.2, a slave can also be a replication master, allowing for cascading replication.<br />
<br />
Let's build this. We start with an already working 9.2 database.<br />
<br />
We set it up for replication:<br />
<br />
postgresql.conf:<br />
wal_level=hot_standby #(could be archive too)<br />
max_wal_senders=5<br />
hot_standby=on<br />
<br />
You'll probably also want to activate archiving in production, it won't be done here.<br />
<br />
pg_hba.conf (do not use trust in production):<br />
host replication replication_user 0.0.0.0/0 md5<br />
<br />
Create the user:<br />
create user replication_user replication password 'secret';<br />
<br />
Clone the database:<br />
<br />
pg_basebackup -h localhost -U replication_user -D data2<br />
Password:<br />
<br />
We have a brand new cluster in the data2 directory. We'll change the port so that it can start (postgresql.conf):<br />
port=5433<br />
<br />
We add a recovery.conf to tell it how to stream from the master database:<br />
standby_mode = on<br />
primary_conninfo = 'host=localhost port=5432 user=replication_user password=secret' <br />
<br />
pg_ctl -D data2 start<br />
server starting<br />
LOG: database system was interrupted; last known up at 2012-07-03 17:58:09 CEST<br />
LOG: creating missing WAL directory "pg_xlog/archive_status"<br />
LOG: entering standby mode<br />
LOG: streaming replication successfully connected to primary<br />
LOG: redo starts at 0/9D000020<br />
LOG: consistent recovery state reached at 0/9D0000B8<br />
LOG: database system is ready to accept read only connections<br />
<br />
Now, let's add a second slave, which will use this slave:<br />
<br />
<br />
pg_basebackup -h localhost -U replication_user -D data3 -p 5433<br />
Password: <br />
<br />
We edit data3's postgresql.conf to change the port:<br />
port=5434<br />
<br />
We modify the recovery.conf to stream from the slave:<br />
standby_mode = on<br />
primary_conninfo = 'host=localhost port=5433 user=replication_user password=secret' # e.g. 'host=localhost port=5432'<br />
<br />
We start the cluster:<br />
pg_ctl -D data3 start<br />
server starting<br />
LOG: database system was interrupted while in recovery at log time 2012-07-03 17:58:09 CEST<br />
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.<br />
LOG: creating missing WAL directory "pg_xlog/archive_status"<br />
LOG: entering standby mode<br />
LOG: streaming replication successfully connected to primary<br />
LOG: redo starts at 0/9D000020<br />
LOG: consistent recovery state reached at 0/9E000000<br />
LOG: database system is ready to accept read only connections<br />
<br />
Now, everything modified on the master cluster get streamed to the first slave, and from there to the second slave. This second replication has to be monitored from the first slave (the master knows nothing about it).<br />
<br />
<br />
* As you may have noticed from the examble, pg_basebackup now works from slaves.<br />
<br />
* There is another use case that wasn't covered: what if a user didn't care for having a full fledged slave, and only wanted to stream the WAL files to another location, to benefit from the reduced data loss without the burden of maintaining a slave ?<br />
<br />
pg_receivexlog is provided just for this purpose: it pretends to be a PostgreSQL slave, but only stores the log files as they are streamed, in a directory:<br />
pg_receivexlog -D /tmp/new_logs -h localhost -U replication_user<br />
<br />
will connect to the master (or a slave), and start creating files: <br />
ls /tmp/new_logs/<br />
00000001000000000000009E.partial<br />
<br />
Files are of the segment size, so they can be used for a normal recovery of the database. It's the same as an archive command, but with a much smaller granularity.<br />
<br />
* synchronous_commit has a new value: remote_write. It can be used when there is a synchronous slave (synchronous_standby_names is set), meaning that the master doesn't have to wait for the slave to have written the data to disk, only for the slave to have acknowledged the data. With this set, data is protected from a crash on the master, but could still be lost if the slave crashed at the same time (i.e. before having written the in flight data to disk). As this is a quite remote possibility, some people will be interested in this compromise.<br />
<br />
<br />
<br />
<br />
==JSON datatype==<br />
The JSON datatype is meant for storing JSON-structured data. (More info: [http://www.depesz.com/2012/02/12/waiting-for-9-2-json/ depesz blog])<br />
<br />
== Range Types ==<br />
[[RangeTypes]] are added.<br />
(More info: [http://www.depesz.com/2011/11/07/waiting-for-9-2-range-data-types/])<br />
<br />
=Performance improvements=<br />
<br />
This version has performance improvements on a very large range of domains (non-exaustive):<br />
<br />
* The most visible will probably be the Index Only Scans, which has already been introduced in this document.<br />
<br />
* The lock contention of several big locks has been significantly reduced, leading to better multi-processor scalability, for machines with over 32 cores mostly. <!-- Robert Haas --><br />
<br />
* The performance of in-memory sorts has been improved by up to 25% in some situations, with certain specialized sort functions introduced. <!-- Peter Geoghegan --><br />
<br />
* An idle PostgreSQL server now makes less wakeups, leading to lower power consumption <!--Peter Geoghegan-->. This is especially useful on virtualized and embedded environments.<br />
<br />
* COPY has been improved, it will generate less WAL volume and less locks of tables's pages. <!-- Heikki Linnakangas --><br />
<br />
* The system can now track IO durations <!--Ants Aasma --><br />
<br />
This one deserves a little explanation, as it can be a little tricky. Tracking IO durations means asking repeatedly the time to the operating system. Depending on the operating system and the hardware, this can be quite cheap, or extremely costly. The most import factor here is where the system gets its time from. It could be directly retrieved from the processor (TSC), dedicated hardware such as HPET, or an ACPI call. What's most important is that the cost of getting time can vary from a factor of thousands.<br />
<br />
If you are interested in this timing data, it's better to first check if your system will support it without to much of a performance hit. PostgreSQL provides you with the pg_test_timing tool:<br />
<br />
<pre><br />
$ pg_test_timing <br />
Testing timing overhead for 3 seconds.<br />
Per loop time including overhead: 28.02 nsec<br />
Histogram of timing durations:<br />
< usec: count percent<br />
32: 41 0.00004%<br />
16: 1405 0.00131%<br />
8: 200 0.00019%<br />
4: 388 0.00036%<br />
2: 2982558 2.78523%<br />
1: 104100166 97.21287%<br />
</pre><br />
<br />
Here, everything is good: getting time costs around 28 nanoseconds, and has a very small variation. Anything under 100 nanoseconds should be good for production. If you get higher values, you may still find a way to tune your system. You'd better check on the [http://www.postgresql.org/docs/9.2/static/pgtesttiming.html documentation].<br />
<br />
Anyway, here is the data you'll be able to collect if your system is ready for this:<br />
<br />
First, you'll get per-database statistics, which will now give accurate information about which database is doing most I/O:<br />
<br />
<pre><br />
=# select * from pg_stat_database where datname = 'mydb';<br />
-[ RECORD 1 ]--+------------------------------<br />
datid | 16384<br />
datname | mydb<br />
numbackends | 1<br />
xact_commit | 270<br />
xact_rollback | 2<br />
blks_read | 1961<br />
blks_hit | 17944<br />
tup_returned | 269035<br />
tup_fetched | 8850<br />
tup_inserted | 16<br />
tup_updated | 4<br />
tup_deleted | 45<br />
conflicts | 0<br />
temp_files | 0<br />
temp_bytes | 0<br />
deadlocks | 0<br />
blk_read_time | 583.774<br />
blk_write_time | 0<br />
stats_reset | 2012-07-03 17:18:54.796817+02<br />
</pre><br />
We see here that mydb has only consumed 583.774 milliseconds of read time.<br />
<br />
Explain will benefit from this too:<br />
<pre><br />
=# explain (analyze,buffers) select count(*) from mots ;<br />
QUERY PLAN <br />
----------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=1669.95..1669.96 rows=1 width=0) (actual time=21.943..21.943 rows=1 loops=1)<br />
Buffers: shared read=493<br />
I/O Timings: read=2.578<br />
-> Seq Scan on mots (cost=0.00..1434.56 rows=94156 width=0) (actual time=0.059..12.933 rows=94156 loops=1)<br />
Buffers: shared read=493<br />
I/O Timings: read=2.578<br />
Total runtime: 22.059 ms<br />
</pre><br />
We now have a separate information about the time taken to retrieve data from the operating system. Obviously, here, the data was in the operating system's cache (2 milliseconds to read 493 blocks).<br />
<br />
And last, if you have enabled pg_stat_statements:<br />
<pre><br />
select * from pg_stat_statements where query ~ 'words';<br />
-[ RECORD 1 ]-------+---------------------------<br />
userid | 10<br />
dbid | 16384<br />
query | select count(*) from words;<br />
calls | 2<br />
total_time | 78.332<br />
rows | 2<br />
shared_blks_hit | 0<br />
shared_blks_read | 986<br />
shared_blks_dirtied | 0<br />
shared_blks_written | 0<br />
local_blks_hit | 0<br />
local_blks_read | 0<br />
local_blks_dirtied | 0<br />
local_blks_written | 0<br />
temp_blks_read | 0<br />
temp_blks_written | 0<br />
blk_read_time | 58.427<br />
blk_write_time | 0<br />
</pre><br />
<br />
* As for every version, the optimizer has received its share of improvements <!-- Tom Lane--><br />
** Prepared statements used to be optimized once, without any knowledge of the parameters' values. With 9.2, the planner will use specific plans regarding to the parameters sent (the query will be planned at execution), except if the query is executed several times and the planner decides that the generic plan is not too much more expensive than the specific plans.<br />
** A new feature has been added: parameterized paths. Simply put, it means that a sub-part of a query plan can use parameters it has got from a parent node. It fixes several bad plans that could occur, especially when the optimizer couldn't reorder joins to put nested loops where it wanted to.<br />
<br />
This example is straight from the developpers mailing lists <!-- Andres Freund -->:<br />
<br />
<pre><br />
CREATE TABLE a (<br />
a_id serial PRIMARY KEY NOT NULL,<br />
b_id integer<br />
);<br />
CREATE INDEX a__b_id ON a USING btree (b_id);<br />
<br />
<br />
CREATE TABLE b (<br />
b_id serial NOT NULL,<br />
c_id integer<br />
);<br />
CREATE INDEX b__c_id ON b USING btree (c_id);<br />
<br />
<br />
CREATE TABLE c (<br />
c_id serial PRIMARY KEY NOT NULL,<br />
value integer UNIQUE<br />
);<br />
<br />
INSERT INTO b (b_id, c_id)<br />
SELECT g.i, g.i FROM generate_series(1, 50000) g(i);<br />
<br />
INSERT INTO a(b_id)<br />
SELECT g.i FROM generate_series(1, 50000) g(i);<br />
<br />
INSERT INTO c(c_id,value)<br />
VALUES (1,1);<br />
</pre><br />
<br />
So we have a referencing b, b referencing c.<br />
<br />
Here is an example of a query working badly with PostgreSQL 9.1:<br />
<br />
<pre><br />
EXPLAIN ANALYZE SELECT 1 <br />
FROM <br />
c<br />
WHERE<br />
EXISTS (<br />
SELECT * <br />
FROM a<br />
JOIN b USING (b_id)<br />
WHERE b.c_id = c.c_id)<br />
AND c.value = 1;<br />
QUERY PLAN <br />
-----------------------------------------------------------------------------------------------------------------------<br />
Nested Loop Semi Join (cost=1347.00..3702.27 rows=1 width=0) (actual time=13.799..13.802 rows=1 loops=1)<br />
Join Filter: (c.c_id = b.c_id)<br />
-> Index Scan using c_value_key on c (cost=0.00..8.27 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=1)<br />
Index Cond: (value = 1)<br />
-> Hash Join (cost=1347.00..3069.00 rows=50000 width=4) (actual time=13.788..13.788 rows=1 loops=1)<br />
Hash Cond: (a.b_id = b.b_id)<br />
-> Seq Scan on a (cost=0.00..722.00 rows=50000 width=4) (actual time=0.007..0.007 rows=1 loops=1)<br />
-> Hash (cost=722.00..722.00 rows=50000 width=8) (actual time=13.760..13.760 rows=50000 loops=1)<br />
Buckets: 8192 Batches: 1 Memory Usage: 1954kB<br />
-> Seq Scan on b (cost=0.00..722.00 rows=50000 width=8) (actual time=0.008..5.702 rows=50000 loops=1)<br />
Total runtime: 13.842 ms<br />
</pre><br />
<br />
Not that bad, 13 milliseconds. Still, we are doing sequential scans on a and b, when our common sense tells us that c.value=1 should be used to filter rows more aggressively.<br />
<br />
Here's what 9.2 does with this query:<br />
<br />
<pre><br />
QUERY PLAN <br />
----------------------------------------------------------------------------------------------------------------------------<br />
Nested Loop Semi Join (cost=0.00..16.97 rows=1 width=0) (actual time=0.035..0.037 rows=1 loops=1)<br />
-> Index Scan using c_value_key on c (cost=0.00..8.27 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)<br />
Index Cond: (value = 1)<br />
-> Nested Loop (cost=0.00..8.69 rows=1 width=4) (actual time=0.025..0.025 rows=1 loops=1)<br />
-> Index Scan using b__c_id on b (cost=0.00..8.33 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=1)<br />
Index Cond: (c_id = c.c_id)<br />
-> Index Only Scan using a__b_id on a (cost=0.00..0.35 rows=1 width=4) (actual time=0.014..0.014 rows=1 loops=1)<br />
Index Cond: (b_id = b.b_id)<br />
Total runtime: 0.089 ms<br />
</pre><br />
<br />
The «parameterized path» is:<br />
<pre><br />
-> Nested Loop (cost=0.00..8.69 rows=1 width=4) (actual time=0.025..0.025 rows=1 loops=1)<br />
-> Index Scan using b__c_id on b (cost=0.00..8.33 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=1)<br />
Index Cond: (c_id = c.c_id)<br />
-> Index Only Scan using a__b_id on a (cost=0.00..0.35 rows=1 width=4) (actual time=0.014..0.014 rows=1 loops=1)<br />
Index Cond: (b_id = b.b_id)<br />
Total runtime: 0.089 ms<br />
</pre><br />
<br />
This part of the plan depends on a parent node (c_id=c.c_id). This part of the plan is called each time with a different parameter coming from the parent node.<br />
<br />
This plan is of course much faster, as there is no need to fully scan a, and to fully scan AND hash b.<br />
<br />
<br />
=SP-GIST=<br />
TODO<br />
<br />
=pg_stat_statements=<br />
<br />
This contrib module has received a lot of improvements in this version:<br />
<br />
* Queries are normalized: queries that are identical except for their constant values will be considered the same, as long as their post-parse analysis query tree (that is, the internal representation of the query before rule expansion) are the same. This also implies that differences that are not semantically essential to the query, such as variations in whitespace or alias names, or the use of one particular syntax over another equivalent one will not differentiate queries.<br />
<br />
<pre><br />
=#select * from words where word= 'foo';<br />
word <br />
------<br />
(0 ligne)<br />
<br />
=# select * from words where word= 'bar';<br />
word <br />
------<br />
bar<br />
<br />
=#select * from pg_stat_statements where query like '%words where%';<br />
-[ RECORD 1 ]-------+-----------------------------------<br />
userid | 10<br />
dbid | 16384<br />
query | select * from words where word= ?;<br />
calls | 2<br />
total_time | 142.314<br />
rows | 1<br />
shared_blks_hit | 3<br />
shared_blks_read | 5<br />
shared_blks_dirtied | 0<br />
shared_blks_written | 0<br />
local_blks_hit | 0<br />
local_blks_read | 0<br />
local_blks_dirtied | 0<br />
local_blks_written | 0<br />
temp_blks_read | 0<br />
temp_blks_written | 0<br />
blk_read_time | 142.165<br />
blk_write_time | 0<br />
<br />
</pre><br />
<br />
The two queries are shown as one in pg_stat_statements.<br />
<br />
* For prepared statements, the execution part (execute statement) is charged on the prepare statement. That way it is easier to use, and avoids the double-counting there was with PostgreSQL 9.1.<br />
<br />
* pg_stat_statements displays timing in milliseconds, to be consistent with other system views.<br />
<br />
= Explain improvements=<br />
<br />
* Timing can now be disabled with EXPLAIN (analyze on, timing off), leading to lower overhead on platforms where getting the current time is expensive <!--Tomas Vondra--><br />
<br />
<br />
Have EXPLAIN ANALYZE report the number of rows rejected by filter steps (Marko Tiikkaja)<br />
<br />
=Backward compatibility=<br />
<br />
These changes may incur regressions in your applications.<br />
<br />
==Ensure that xpath() escapes special characters in string values <!-- (Florian Pflug)--> ==<br />
<br />
Before 9.2:<br />
<pre><br />
SELECT (XPATH('/*/text()', '<root>&lt;</root>'))[1];<br />
xpath <br />
-------<br />
<<br />
<br />
'<' Isn't valid XML.<br />
</pre><br />
With 9.2:<br />
<pre><br />
SELECT (XPATH('/*/text()', '<root>&lt;</root>'))[1];<br />
xpath <br />
-------<br />
&amp;lt;<br />
</pre><br />
<br />
==Remove hstore's => operator <!-- (Robert Haas)-->==<br />
Up to 9.1, one could use the => operator to create a hstore. Hstore is a contrib, used to store key/values pairs in a column.<br />
<br />
In 9.1:<br />
<pre><br />
SELECT pg_typeof('a'=>'b');<br />
pg_typeof <br />
-----------<br />
hstore<br />
(1 row)<br />
<br />
=# SELECT 'a'=>'b';<br />
?column? <br />
----------<br />
"a"=>"b"<br />
(1 row)<br />
<br />
=# SELECT pg_typeof('a'=>'b');<br />
pg_typeof <br />
-----------<br />
hstore<br />
(1 row)<br />
</pre><br />
<br />
With 9.2:<br />
<pre><br />
SELECT 'a'=>'b';<br />
ERROR: operator does not exist: unknown => unknown at character 11<br />
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.<br />
STATEMENT: SELECT 'a'=>'b';<br />
ERROR: operator does not exist: unknown => unknown<br />
LINE 1: SELECT 'a'=>'b';<br />
^<br />
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.<br />
</pre><br />
<br />
It doesn't mean one cannot use '=>' in hstores, it just isn't an operator anymore:<br />
<br />
<pre><br />
=# select hstore('a=>b');<br />
hstore <br />
----------<br />
"a"=>"b"<br />
(1 row)<br />
<br />
=# select hstore('a','b');<br />
hstore <br />
----------<br />
"a"=>"b"<br />
(1 row)<br />
</pre><br />
are still two valid ways to input a hstore.<br />
<br />
"=>" is removed as an operator as it is a reserved keyword in SQL.<br />
<br />
<br />
==Have pg_relation_size() and friends return NULL if the object does not exist <!-- (Phil Sorber)-->==<br />
<br />
A relation could be dropped by a concurrent session, while one was doing a pg_relation_size on it, leading to a SQL exception. Now, it merely returns NULL for this record.<br />
<br />
<br />
==Remove the spclocation field from pg_tablespace <!-- (Magnus Hagander)-->==<br />
<br />
The spclocation field provided the real location of the tablespace. It was filled in during the CREATE or ALTER TABLESPACE command. So it could be wrong: somebody just had to shutdown the cluster, move the tablespace's directory, re-create the symlink in pg_tblspc, and forget to update the spclocation field. The cluster would still run, as the spclocation wasn't used.<br />
<br />
So this field has been removed. To get the tablespace's location, use pg_tablespace_location():<br />
<br />
<pre><br />
=# select *, pg_tablespace_location(oid) as spclocation from pg_tablespace;<br />
spcname | spcowner | spcacl | spcoptions | spclocation <br />
------------+----------+--------+------------+----------------<br />
pg_default | 10 | | | <br />
pg_global | 10 | | | <br />
tmptblspc | 10 | | | /tmp/tmptblspc<br />
</pre><br />
<br />
==Have EXTRACT of a non-timezone-aware value measure the epoch from local midnight, not UTC midnight <!-- (Tom Lane) -->==<br />
<br />
<br />
With PostgreSQL 9.1:<br />
<br />
<pre><br />
=#SELECT extract(epoch from '2012-07-02 00:00:00'::timestamp);<br />
date_part <br />
------------<br />
1341180000<br />
(1 row)<br />
<br />
=# SELECT extract(epoch from '2012-07-02 00:00:00'::timestamptz);<br />
date_part <br />
------------<br />
1341180000<br />
(1 row)<br />
</pre><br />
<br />
There is no difference in behaviour between a timstamp with or without timezone.<br />
<br />
With 9.1:<br />
<pre><br />
=#SELECT extract(epoch from '2012-07-02 00:00:00'::timestamp);<br />
date_part <br />
------------<br />
1341187200<br />
(1 row)<br />
<br />
=# SELECT extract(epoch from '2012-07-02 00:00:00'::timestamptz);<br />
date_part <br />
------------<br />
1341180000<br />
(1 row)<br />
</pre><br />
When the timestamp has no timezone, the epoch is calculated with the "local midnight", meaning the 1st january of 1970 at midnight, local-time.<br />
<br />
<br />
==Fix to_date() and to_timestamp() to wrap incomplete dates toward 2020 <!-- (Bruce Momjian)-->==<br />
<br />
The wrapping was not consistent between 2 digit dates and 3 digit dates: 2 digit dates always chose the date closest to 2020, 3 digit dates mapped dates from 100 to 999 on 1100 to 1999, and 000 to 099 on 2000 to 2099.<br />
<br />
Now PostgreSQL chooses the date closest to 2020, for 2 and 3 digit dates.<br />
<br />
With 9.1:<br />
<pre><br />
=# SELECT to_date('200-07-02','YYY-MM-DD');<br />
to_date <br />
------------<br />
1200-07-02<br />
</pre><br />
<br />
With 9.2:<br />
<pre><br />
SELECT to_date('200-07-02','YYY-MM-DD');<br />
to_date <br />
------------<br />
2200-07-02<br />
</pre><br />
<br />
==pg_stat_activity's definition has changed <!--Magnus Hagander -->==<br />
<br />
The view pg_stat_activity has changed. It's not backward compatible, but let's see what this new definition brings us:<br />
<br />
* current_query disappears and is replaced by two columns:<br />
** state: is the session running a query, waiting<br />
** query: what is the last run (or still running) query<br />
* The column procpid is renamed to pid, to be consistent with other system views<br />
<br />
The benefit is mostly for tracking «idle in transaction» sessions. Up until now, all we could know was that one of these sessions was idle in transaction, meaning it has started a transaction, maybe done some operations, but still not committed. If that session stayed in this state for a while, there was no way of knowing how it got in this state.<br />
<br />
Here is an example:<br />
<pre><br />
-[ RECORD 1 ]----+---------------------------------<br />
datid | 16384<br />
datname | postgres<br />
pid | 20804<br />
usesysid | 10<br />
usename | postgres<br />
application_name | psql<br />
client_addr | <br />
client_hostname | <br />
client_port | -1<br />
backend_start | 2012-07-02 15:02:51.146427+02<br />
xact_start | 2012-07-02 15:15:28.386865+02<br />
query_start | 2012-07-02 15:15:30.410834+02<br />
state_change | 2012-07-02 15:15:30.411287+02<br />
waiting | f<br />
state | idle in transaction<br />
query | DELETE FROM test;<br />
</pre><br />
<br />
With PostgreSQL 9.1, all we would have would be «idle in transaction».<br />
<br />
As this change was backward-incompatible, procpid was also renamed to pid, to be more consistent with other system views.<br />
<br />
==Change all SQL-level statistics timing values to float8-stored milliseconds <!-- (Tom Lane) -->==<br />
<br />
pg_stat_user_functions.total_time, pg_stat_user_functions.self_time, pg_stat_xact_user_functions.total_time, pg_stat_xact_user_functions.self_time, and pg_stat_statements.total_time (contrib) are now in milliseconds, to be consistent with the rest of the timing values.<br />
<br />
==postgresql.conf parameters changes <!-- (Heikki Linnakangas, Tom Lane, Peter Eisentraut) -->==<br />
<br />
* silent_mode has been removed. Use pg_ctl -l postmaster.log<br />
* wal_sender_delay has been removed. It is no longer needed<br />
* custom_variable_classes has been removed. All «classes» are accepted without declaration now<br />
* ssl_ca_file, ssl_cert_file, ssl_crl_file, ssl_key_file have been added, meaning you can now specify the ssl files<br />
<br />
[[Category:PostgreSQL 9.2]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=What%27s_new_in_PostgreSQL_9.2&diff=17948What's new in PostgreSQL 9.22012-07-19T10:21:12Z<p>Sternocera: /* pg_stat_statements */ Query trees, not plans</p>
<hr />
<div>{{Languages}}<br />
<br />
This document showcases many of the latest developments in PostgreSQL 9.2, compared to the last major release &ndash; PostgreSQL 9.1. There are many improvements in this release, so this wiki page covers many of the more important changes in detail. The full list of changes is itemised in ''Release Notes''.<br />
<br />
'''This page is incomplete!'''<br />
<br />
=Major new features=<br />
<br />
==Index-only scans <!-- Robert Haas, Ibrar Ahmed, Heikki Linnakangas, Tom Lane -->==<br />
<br />
In PostgreSQL, indexes have no "visibility" information. It means that when you access a record by its index, PostgreSQL has to visit the real tuple in the table to be sure it is visible to you: the tuple the index points to may simply be an old version of the record you are looking for.<br />
<br />
It can be a very big performance problem: the index is mostly ordered, so accessing its records is quite efficient, while the records may be scattered all over the place (that's a reason why PostgreSQL has a cluster command, but that's another story). In 9.2, PostgreSQL will use an "Index Only Scan" when possible, and not access the record itself if it doesn't need to.<br />
<br />
There is still no visibility information in the index. So in order to do this, PostgreSQL uses the visibility map ([http://www.postgresql.org/docs/devel/static/storage-vm.html visibility map]) , which tells it whether the whole content of a (usually) 8K page is visible to all transactions or not. When the index record points to a tuple contained in an «all visible» page, PostgreSQL won't have to access the tuple, it will be able to build it directly from the index. Of course, all the columns requested by the query must be in the index.<br />
<br />
The visibility map is maintained by VACUUM (it sets the visible bit), and by the backends doing SQL work (they unset the visible bit).<br />
<br />
Here is an example.<br />
<br />
create table demo_ios (col1 float, col2 float, col3 text);<br />
<br />
In this table, we'll put random data, in order to have "scattered" data. We'll put 100 million records, to have a big recordset, and have it not fit in memory (that's a 4GB-ram machine). This is an ideal case, made for this demo. The gains wont be that big in real life.<br />
<br />
insert into demo_ios select generate_series(1,100000000),random(), 'mynotsolongstring';<br />
<br />
select pg_size_pretty(pg_total_relation_size('demo_ios'));<br />
pg_size_pretty <br />
----------------<br />
6512 MB<br />
<br />
Let's pretend that the query is this:<br />
<br />
SELECT col1,col2 FROM demo_ios where col2 BETWEEN 0.02 AND 0.03<br />
<br />
In order to use an index only scan on this, we need an index on col2,col1 (col2 first, as it is used in the WHERE clause).<br />
<br />
CREATE index idx_demo_ios on demo_ios(col2,col1);<br />
<br />
We vacuum the visibility map to be up-to-date:<br />
<br />
VACUUM demo_ios;<br />
<br />
All the timing you'll see below are done on a cold OS and PostgreSQL cache (that's where the gains are, as the purpose on Index Only Scans is to reduce I/O).<br />
<br />
Let's first try without Index Only Scans:<br />
<br />
set enable_indexonlyscan to off;<br />
<br />
explain (analyze,buffers) select col1,col2 from demo_ios where col2 between 0.01 and 0.02;<br />
QUERY PLAN <br />
----------------------------------------------------------------------------------------------------------------------------------------<br />
Bitmap Heap Scan on demo_ios (cost=25643.01..916484.44 rows=993633 width=16) (actual time=763.391..362963.899 rows=1000392 loops=1)<br />
Recheck Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))<br />
Rows Removed by Index Recheck: 68098621<br />
Buffers: shared hit=2 read=587779<br />
-> Bitmap Index Scan on idx_demo_ios (cost=0.00..25394.60 rows=993633 width=0) (actual time=759.011..759.011 rows=1000392 loops=1)<br />
Index Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))<br />
Buffers: shared hit=2 read=3835<br />
Total runtime: 364390.127 ms<br />
<br />
<br />
With Index Only Scans:<br />
<br />
explain (analyze,buffers) select col1,col2 from demo_ios where col2 between 0.01 and 0.02;<br />
QUERY PLAN <br />
-----------------------------------------------------------------------------------------------------------------------------------------------<br />
Index Only Scan using idx_demo_ios on demo_ios (cost=0.00..35330.93 rows=993633 width=16) (actual time=58.100..3250.589 rows=1000392 loops=1)<br />
Index Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))<br />
Heap Fetches: 0<br />
Buffers: shared hit=923073 read=3848<br />
Total runtime: 4297.405 ms<br />
<br />
<br />
<br />
As nothing is free, there are a few things to note:<br />
<br />
* Adding indexes for index only scans obviously adds indexes to your table. So updates will be slower.<br />
* You will index columns that weren't indexed before. So there will be less opportunities for HOT updates.<br />
* Gains will probably be smaller in real life situations.<br />
<br />
This required making visibility map changes crash-safe, so visibility map bit changes are now WAL-logged.<br />
<br />
==Replication improvements <!-- Fujii Masao, Simon Riggs, Magnus Hagander, Jun Ishizuka -->==<br />
<br />
Streaming Replication is getting even more polished with this release. One on the main remaining gripes about streaming replication is that all the slaves have to be connected to the same and unique master, consuming its resources.<br />
<br />
Moreover, in case of a failover, it was very complicated to reconnect all the remaining slaves to the newly promoted master.<br />
<br />
To be on the safe side, it was easier to re-synchronize the slaves to the new masters from scratch, meaning that during this failover, only one server was active, and under heavy load, as it was used to rebuild all the slaves.<br />
<br />
* With 9.2, a slave can also be a replication master, allowing for cascading replication.<br />
<br />
Let's build this. We start with an already working 9.2 database.<br />
<br />
We set it up for replication:<br />
<br />
postgresql.conf:<br />
wal_level=hot_standby #(could be archive too)<br />
max_wal_senders=5<br />
hot_standby=on<br />
<br />
You'll probably also want to activate archiving in production, it won't be done here.<br />
<br />
pg_hba.conf (do not use trust in production):<br />
host replication replication_user 0.0.0.0/0 md5<br />
<br />
Create the user:<br />
create user replication_user replication password 'secret';<br />
<br />
Clone the database:<br />
<br />
pg_basebackup -h localhost -U replication_user -D data2<br />
Password:<br />
<br />
We have a brand new cluster in the data2 directory. We'll change the port so that it can start (postgresql.conf):<br />
port=5433<br />
<br />
We add a recovery.conf to tell it how to stream from the master database:<br />
standby_mode = on<br />
primary_conninfo = 'host=localhost port=5432 user=replication_user password=secret' <br />
<br />
pg_ctl -D data2 start<br />
server starting<br />
LOG: database system was interrupted; last known up at 2012-07-03 17:58:09 CEST<br />
LOG: creating missing WAL directory "pg_xlog/archive_status"<br />
LOG: entering standby mode<br />
LOG: streaming replication successfully connected to primary<br />
LOG: redo starts at 0/9D000020<br />
LOG: consistent recovery state reached at 0/9D0000B8<br />
LOG: database system is ready to accept read only connections<br />
<br />
Now, let's add a second slave, which will use this slave:<br />
<br />
<br />
pg_basebackup -h localhost -U replication_user -D data3 -p 5433<br />
Password: <br />
<br />
We edit data3's postgresql.conf to change the port:<br />
port=5434<br />
<br />
We modify the recovery.conf to stream from the slave:<br />
standby_mode = on<br />
primary_conninfo = 'host=localhost port=5433 user=replication_user password=secret' # e.g. 'host=localhost port=5432'<br />
<br />
We start the cluster:<br />
pg_ctl -D data3 start<br />
server starting<br />
LOG: database system was interrupted while in recovery at log time 2012-07-03 17:58:09 CEST<br />
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.<br />
LOG: creating missing WAL directory "pg_xlog/archive_status"<br />
LOG: entering standby mode<br />
LOG: streaming replication successfully connected to primary<br />
LOG: redo starts at 0/9D000020<br />
LOG: consistent recovery state reached at 0/9E000000<br />
LOG: database system is ready to accept read only connections<br />
<br />
Now, everything modified on the master cluster get streamed to the first slave, and from there to the second slave. This second replication has to be monitored from the first slave (the master knows nothing about it).<br />
<br />
<br />
* As you may have noticed from the examble, pg_basebackup now works from slaves.<br />
<br />
* There is another use case that wasn't covered: what if a user didn't care for having a full fledged slave, and only wanted to stream the WAL files to another location, to benefit from the reduced data loss without the burden of maintaining a slave ?<br />
<br />
pg_receivexlog is provided just for this purpose: it pretends to be a PostgreSQL slave, but only stores the log files as they are streamed, in a directory:<br />
pg_receivexlog -D /tmp/new_logs -h localhost -U replication_user<br />
<br />
will connect to the master (or a slave), and start creating files: <br />
ls /tmp/new_logs/<br />
00000001000000000000009E.partial<br />
<br />
Files are of the segment size, so they can be used for a normal recovery of the database. It's the same as an archive command, but with a much smaller granularity.<br />
<br />
* synchronous_commit has a new value: remote_write. It can be used when there is a synchronous slave (synchronous_standby_names is set), meaning that the master doesn't have to wait for the slave to have written the data to disk, only for the slave to have acknowledged the data. With this set, data is protected from a crash on the master, but could still be lost if the slave crashed at the same time (i.e. before having written the in flight data to disk). As this is a quite remote possibility, some people will be interested in this compromise.<br />
<br />
<br />
<br />
<br />
==JSON datatype==<br />
The JSON datatype is meant for storing JSON-structured data. (More info: [http://www.depesz.com/2012/02/12/waiting-for-9-2-json/ depesz blog])<br />
<br />
== Range Types ==<br />
[[RangeTypes]] are added.<br />
(More info: [http://www.depesz.com/2011/11/07/waiting-for-9-2-range-data-types/])<br />
<br />
=Performance improvements=<br />
<br />
This version has performance improvements on a very large range of domains (non-exaustive):<br />
<br />
* The most visible will probably be the Index Only Scans, which has already been introduced in this document.<br />
<br />
* The lock contention of several big locks has been significantly reduced, leading to better multi-processor scalability, for machines with over 32 cores mostly. <!-- Robert Haas --><br />
<br />
* The performance of in-memory sorts has been improved by up to 25% in some situations, with certain specialized sort functions introduced. <!-- Peter Geoghegan --><br />
<br />
* An idle PostgreSQL server now makes less wakeups, leading to lower power consumption <!--Peter Geoghegan-->. This is especially useful on virtualized and embedded environments.<br />
<br />
* COPY has been improved, it will generate less WAL volume and less locks of tables's pages. <!-- Heikki Linnakangas --><br />
<br />
* The system can now track IO durations <!--Ants Aasma --><br />
<br />
This one deserves a little explanation, as it can be a little tricky. Tracking IO durations means asking repeatedly the time to the operating system. Depending on the operating system and the hardware, this can be quite cheap, or extremely costly. The most import factor here is where the system gets its time from. It could be directly retrieved from the processor (TSC), dedicated hardware such as HPET, or an ACPI call. What's most important is that the cost of getting time can vary from a factor of thousands.<br />
<br />
If you are interested in this timing data, it's better to first check if your system will support it without to much of a performance hit. PostgreSQL provides you with the pg_test_timing tool:<br />
<br />
<pre><br />
$ pg_test_timing <br />
Testing timing overhead for 3 seconds.<br />
Per loop time including overhead: 28.02 nsec<br />
Histogram of timing durations:<br />
< usec: count percent<br />
32: 41 0.00004%<br />
16: 1405 0.00131%<br />
8: 200 0.00019%<br />
4: 388 0.00036%<br />
2: 2982558 2.78523%<br />
1: 104100166 97.21287%<br />
</pre><br />
<br />
Here, everything is good: getting time costs around 28 nanoseconds, and has a very small variation. Anything under 100 nanoseconds should be good for production. If you get higher values, you may still find a way to tune your system. You'd better check on the [http://www.postgresql.org/docs/9.2/static/pgtesttiming.html documentation].<br />
<br />
Anyway, here is the data you'll be able to collect if your system is ready for this:<br />
<br />
First, you'll get per-database statistics, which will now give accurate information about which database is doing most I/O:<br />
<br />
<pre><br />
=# select * from pg_stat_database where datname = 'mydb';<br />
-[ RECORD 1 ]--+------------------------------<br />
datid | 16384<br />
datname | mydb<br />
numbackends | 1<br />
xact_commit | 270<br />
xact_rollback | 2<br />
blks_read | 1961<br />
blks_hit | 17944<br />
tup_returned | 269035<br />
tup_fetched | 8850<br />
tup_inserted | 16<br />
tup_updated | 4<br />
tup_deleted | 45<br />
conflicts | 0<br />
temp_files | 0<br />
temp_bytes | 0<br />
deadlocks | 0<br />
blk_read_time | 583.774<br />
blk_write_time | 0<br />
stats_reset | 2012-07-03 17:18:54.796817+02<br />
</pre><br />
We see here that mydb has only consumed 583.774 milliseconds of read time.<br />
<br />
Explain will benefit from this too:<br />
<pre><br />
=# explain (analyze,buffers) select count(*) from mots ;<br />
QUERY PLAN <br />
----------------------------------------------------------------------------------------------------------------<br />
Aggregate (cost=1669.95..1669.96 rows=1 width=0) (actual time=21.943..21.943 rows=1 loops=1)<br />
Buffers: shared read=493<br />
I/O Timings: read=2.578<br />
-> Seq Scan on mots (cost=0.00..1434.56 rows=94156 width=0) (actual time=0.059..12.933 rows=94156 loops=1)<br />
Buffers: shared read=493<br />
I/O Timings: read=2.578<br />
Total runtime: 22.059 ms<br />
</pre><br />
We now have a separate information about the time taken to retrieve data from the operating system. Obviously, here, the data was in the operating system's cache (2 milliseconds to read 493 blocks).<br />
<br />
And last, if you have enabled pg_stat_statements:<br />
<pre><br />
select * from pg_stat_statements where query ~ 'words';<br />
-[ RECORD 1 ]-------+---------------------------<br />
userid | 10<br />
dbid | 16384<br />
query | select count(*) from words;<br />
calls | 2<br />
total_time | 78.332<br />
rows | 2<br />
shared_blks_hit | 0<br />
shared_blks_read | 986<br />
shared_blks_dirtied | 0<br />
shared_blks_written | 0<br />
local_blks_hit | 0<br />
local_blks_read | 0<br />
local_blks_dirtied | 0<br />
local_blks_written | 0<br />
temp_blks_read | 0<br />
temp_blks_written | 0<br />
blk_read_time | 58.427<br />
blk_write_time | 0<br />
</pre><br />
<br />
* As for every version, the optimizer has received its share of improvements <!-- Tom Lane--><br />
** Prepared statements used to be optimized once, without any knowledge of the parameters' values. With 9.2, the planner will use specific plans regarding to the parameters sent (the query will be planned at execution), except if the query is executed several times and the planner decides that the generic plan is not too much more expensive than the specific plans.<br />
** A new feature has been added: parameterized paths. Simply put, it means that a sub-part of a query plan can use parameters it has got from a parent node. It fixes several bad plans that could occur, especially when the optimizer couldn't reorder joins to put nested loops where it wanted to.<br />
<br />
This example is straight from the developpers mailing lists <!-- Andres Freund -->:<br />
<br />
<pre><br />
CREATE TABLE a (<br />
a_id serial PRIMARY KEY NOT NULL,<br />
b_id integer<br />
);<br />
CREATE INDEX a__b_id ON a USING btree (b_id);<br />
<br />
<br />
CREATE TABLE b (<br />
b_id serial NOT NULL,<br />
c_id integer<br />
);<br />
CREATE INDEX b__c_id ON b USING btree (c_id);<br />
<br />
<br />
CREATE TABLE c (<br />
c_id serial PRIMARY KEY NOT NULL,<br />
value integer UNIQUE<br />
);<br />
<br />
INSERT INTO b (b_id, c_id)<br />
SELECT g.i, g.i FROM generate_series(1, 50000) g(i);<br />
<br />
INSERT INTO a(b_id)<br />
SELECT g.i FROM generate_series(1, 50000) g(i);<br />
<br />
INSERT INTO c(c_id,value)<br />
VALUES (1,1);<br />
</pre><br />
<br />
So we have a referencing b, b referencing c.<br />
<br />
Here is an example of a query working badly with PostgreSQL 9.1:<br />
<br />
<pre><br />
EXPLAIN ANALYZE SELECT 1 <br />
FROM <br />
c<br />
WHERE<br />
EXISTS (<br />
SELECT * <br />
FROM a<br />
JOIN b USING (b_id)<br />
WHERE b.c_id = c.c_id)<br />
AND c.value = 1;<br />
QUERY PLAN <br />
-----------------------------------------------------------------------------------------------------------------------<br />
Nested Loop Semi Join (cost=1347.00..3702.27 rows=1 width=0) (actual time=13.799..13.802 rows=1 loops=1)<br />
Join Filter: (c.c_id = b.c_id)<br />
-> Index Scan using c_value_key on c (cost=0.00..8.27 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=1)<br />
Index Cond: (value = 1)<br />
-> Hash Join (cost=1347.00..3069.00 rows=50000 width=4) (actual time=13.788..13.788 rows=1 loops=1)<br />
Hash Cond: (a.b_id = b.b_id)<br />
-> Seq Scan on a (cost=0.00..722.00 rows=50000 width=4) (actual time=0.007..0.007 rows=1 loops=1)<br />
-> Hash (cost=722.00..722.00 rows=50000 width=8) (actual time=13.760..13.760 rows=50000 loops=1)<br />
Buckets: 8192 Batches: 1 Memory Usage: 1954kB<br />
-> Seq Scan on b (cost=0.00..722.00 rows=50000 width=8) (actual time=0.008..5.702 rows=50000 loops=1)<br />
Total runtime: 13.842 ms<br />
</pre><br />
<br />
Not that bad, 13 milliseconds. Still, we are doing sequential scans on a and b, when our common sense tells us that c.value=1 should be used to filter rows more aggressively.<br />
<br />
Here's what 9.2 does with this query:<br />
<br />
<pre><br />
QUERY PLAN <br />
----------------------------------------------------------------------------------------------------------------------------<br />
Nested Loop Semi Join (cost=0.00..16.97 rows=1 width=0) (actual time=0.035..0.037 rows=1 loops=1)<br />
-> Index Scan using c_value_key on c (cost=0.00..8.27 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)<br />
Index Cond: (value = 1)<br />
-> Nested Loop (cost=0.00..8.69 rows=1 width=4) (actual time=0.025..0.025 rows=1 loops=1)<br />
-> Index Scan using b__c_id on b (cost=0.00..8.33 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=1)<br />
Index Cond: (c_id = c.c_id)<br />
-> Index Only Scan using a__b_id on a (cost=0.00..0.35 rows=1 width=4) (actual time=0.014..0.014 rows=1 loops=1)<br />
Index Cond: (b_id = b.b_id)<br />
Total runtime: 0.089 ms<br />
</pre><br />
<br />
The «parameterized path» is:<br />
<pre><br />
-> Nested Loop (cost=0.00..8.69 rows=1 width=4) (actual time=0.025..0.025 rows=1 loops=1)<br />
-> Index Scan using b__c_id on b (cost=0.00..8.33 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=1)<br />
Index Cond: (c_id = c.c_id)<br />
-> Index Only Scan using a__b_id on a (cost=0.00..0.35 rows=1 width=4) (actual time=0.014..0.014 rows=1 loops=1)<br />
Index Cond: (b_id = b.b_id)<br />
Total runtime: 0.089 ms<br />
</pre><br />
<br />
This part of the plan depends on a parent node (c_id=c.c_id). This part of the plan is called each time with a different parameter coming from the parent node.<br />
<br />
This plan is of course much faster, as there is no need to fully scan a, and to fully scan AND hash b.<br />
<br />
<br />
=SP-GIST=<br />
TODO<br />
<br />
=pg_stat_statements=<br />
<br />
This contrib module has received a lot of improvements in this version:<br />
<br />
* Queries are normalized: queries that are identical except for their constant values will be considered the same, as long as their post-parse analysis query tree (that is, the internal representation of the query before rule expansion) are the same. This implies that differences that are not semantically essential to the query, such as variations in whitespace or alias names, or the use of one particular syntax over another equivalent one will not differentiate queries.<br />
<br />
<pre><br />
=#select * from words where word= 'foo';<br />
word <br />
------<br />
(0 ligne)<br />
<br />
=# select * from words where word= 'bar';<br />
word <br />
------<br />
bar<br />
<br />
=#select * from pg_stat_statements where query like '%words where%';<br />
-[ RECORD 1 ]-------+-----------------------------------<br />
userid | 10<br />
dbid | 16384<br />
query | select * from words where word= ?;<br />
calls | 2<br />
total_time | 142.314<br />
rows | 1<br />
shared_blks_hit | 3<br />
shared_blks_read | 5<br />
shared_blks_dirtied | 0<br />
shared_blks_written | 0<br />
local_blks_hit | 0<br />
local_blks_read | 0<br />
local_blks_dirtied | 0<br />
local_blks_written | 0<br />
temp_blks_read | 0<br />
temp_blks_written | 0<br />
blk_read_time | 142.165<br />
blk_write_time | 0<br />
<br />
</pre><br />
<br />
The two queries are shown as one in pg_stat_statements.<br />
<br />
* For prepared statements, the execution part (execute statement) is charged on the prepare statement. That way it is easier to use, and avoids the double-counting there was with PostgreSQL 9.1.<br />
<br />
* pg_stat_statements displays timing in milliseconds, to be consistent with other system views.<br />
<br />
= Explain improvements=<br />
<br />
* Timing can now be disabled with EXPLAIN (analyze on, timing off), leading to lower overhead on platforms where getting the current time is expensive <!--Tomas Vondra--><br />
<br />
<br />
Have EXPLAIN ANALYZE report the number of rows rejected by filter steps (Marko Tiikkaja)<br />
<br />
=Backward compatibility=<br />
<br />
These changes may incur regressions in your applications.<br />
<br />
==Ensure that xpath() escapes special characters in string values <!-- (Florian Pflug)--> ==<br />
<br />
Before 9.2:<br />
<pre><br />
SELECT (XPATH('/*/text()', '<root>&lt;</root>'))[1];<br />
xpath <br />
-------<br />
<<br />
<br />
'<' Isn't valid XML.<br />
</pre><br />
With 9.2:<br />
<pre><br />
SELECT (XPATH('/*/text()', '<root>&lt;</root>'))[1];<br />
xpath <br />
-------<br />
&amp;lt;<br />
</pre><br />
<br />
==Remove hstore's => operator <!-- (Robert Haas)-->==<br />
Up to 9.1, one could use the => operator to create a hstore. Hstore is a contrib, used to store key/values pairs in a column.<br />
<br />
In 9.1:<br />
<pre><br />
SELECT pg_typeof('a'=>'b');<br />
pg_typeof <br />
-----------<br />
hstore<br />
(1 row)<br />
<br />
=# SELECT 'a'=>'b';<br />
?column? <br />
----------<br />
"a"=>"b"<br />
(1 row)<br />
<br />
=# SELECT pg_typeof('a'=>'b');<br />
pg_typeof <br />
-----------<br />
hstore<br />
(1 row)<br />
</pre><br />
<br />
With 9.2:<br />
<pre><br />
SELECT 'a'=>'b';<br />
ERROR: operator does not exist: unknown => unknown at character 11<br />
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.<br />
STATEMENT: SELECT 'a'=>'b';<br />
ERROR: operator does not exist: unknown => unknown<br />
LINE 1: SELECT 'a'=>'b';<br />
^<br />
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.<br />
</pre><br />
<br />
It doesn't mean one cannot use '=>' in hstores, it just isn't an operator anymore:<br />
<br />
<pre><br />
=# select hstore('a=>b');<br />
hstore <br />
----------<br />
"a"=>"b"<br />
(1 row)<br />
<br />
=# select hstore('a','b');<br />
hstore <br />
----------<br />
"a"=>"b"<br />
(1 row)<br />
</pre><br />
are still two valid ways to input a hstore.<br />
<br />
"=>" is removed as an operator as it is a reserved keyword in SQL.<br />
<br />
<br />
==Have pg_relation_size() and friends return NULL if the object does not exist <!-- (Phil Sorber)-->==<br />
<br />
A relation could be dropped by a concurrent session, while one was doing a pg_relation_size on it, leading to a SQL exception. Now, it merely returns NULL for this record.<br />
<br />
<br />
==Remove the spclocation field from pg_tablespace <!-- (Magnus Hagander)-->==<br />
<br />
The spclocation field provided the real location of the tablespace. It was filled in during the CREATE or ALTER TABLESPACE command. So it could be wrong: somebody just had to shutdown the cluster, move the tablespace's directory, re-create the symlink in pg_tblspc, and forget to update the spclocation field. The cluster would still run, as the spclocation wasn't used.<br />
<br />
So this field has been removed. To get the tablespace's location, use pg_tablespace_location():<br />
<br />
<pre><br />
=# select *, pg_tablespace_location(oid) as spclocation from pg_tablespace;<br />
spcname | spcowner | spcacl | spcoptions | spclocation <br />
------------+----------+--------+------------+----------------<br />
pg_default | 10 | | | <br />
pg_global | 10 | | | <br />
tmptblspc | 10 | | | /tmp/tmptblspc<br />
</pre><br />
<br />
==Have EXTRACT of a non-timezone-aware value measure the epoch from local midnight, not UTC midnight <!-- (Tom Lane) -->==<br />
<br />
<br />
With PostgreSQL 9.1:<br />
<br />
<pre><br />
=#SELECT extract(epoch from '2012-07-02 00:00:00'::timestamp);<br />
date_part <br />
------------<br />
1341180000<br />
(1 row)<br />
<br />
=# SELECT extract(epoch from '2012-07-02 00:00:00'::timestamptz);<br />
date_part <br />
------------<br />
1341180000<br />
(1 row)<br />
</pre><br />
<br />
There is no difference in behaviour between a timstamp with or without timezone.<br />
<br />
With 9.1:<br />
<pre><br />
=#SELECT extract(epoch from '2012-07-02 00:00:00'::timestamp);<br />
date_part <br />
------------<br />
1341187200<br />
(1 row)<br />
<br />
=# SELECT extract(epoch from '2012-07-02 00:00:00'::timestamptz);<br />
date_part <br />
------------<br />
1341180000<br />
(1 row)<br />
</pre><br />
When the timestamp has no timezone, the epoch is calculated with the "local midnight", meaning the 1st january of 1970 at midnight, local-time.<br />
<br />
<br />
==Fix to_date() and to_timestamp() to wrap incomplete dates toward 2020 <!-- (Bruce Momjian)-->==<br />
<br />
The wrapping was not consistent between 2 digit dates and 3 digit dates: 2 digit dates always chose the date closest to 2020, 3 digit dates mapped dates from 100 to 999 on 1100 to 1999, and 000 to 099 on 2000 to 2099.<br />
<br />
Now PostgreSQL chooses the date closest to 2020, for 2 and 3 digit dates.<br />
<br />
With 9.1:<br />
<pre><br />
=# SELECT to_date('200-07-02','YYY-MM-DD');<br />
to_date <br />
------------<br />
1200-07-02<br />
</pre><br />
<br />
With 9.2:<br />
<pre><br />
SELECT to_date('200-07-02','YYY-MM-DD');<br />
to_date <br />
------------<br />
2200-07-02<br />
</pre><br />
<br />
==pg_stat_activity's definition has changed <!--Magnus Hagander -->==<br />
<br />
The view pg_stat_activity has changed. It's not backward compatible, but let's see what this new definition brings us:<br />
<br />
* current_query disappears and is replaced by two columns:<br />
** state: is the session running a query, waiting<br />
** query: what is the last run (or still running) query<br />
* The column procpid is renamed to pid, to be consistent with other system views<br />
<br />
The benefit is mostly for tracking «idle in transaction» sessions. Up until now, all we could know was that one of these sessions was idle in transaction, meaning it has started a transaction, maybe done some operations, but still not committed. If that session stayed in this state for a while, there was no way of knowing how it got in this state.<br />
<br />
Here is an example:<br />
<pre><br />
-[ RECORD 1 ]----+---------------------------------<br />
datid | 16384<br />
datname | postgres<br />
pid | 20804<br />
usesysid | 10<br />
usename | postgres<br />
application_name | psql<br />
client_addr | <br />
client_hostname | <br />
client_port | -1<br />
backend_start | 2012-07-02 15:02:51.146427+02<br />
xact_start | 2012-07-02 15:15:28.386865+02<br />
query_start | 2012-07-02 15:15:30.410834+02<br />
state_change | 2012-07-02 15:15:30.411287+02<br />
waiting | f<br />
state | idle in transaction<br />
query | DELETE FROM test;<br />
</pre><br />
<br />
With PostgreSQL 9.1, all we would have would be «idle in transaction».<br />
<br />
As this change was backward-incompatible, procpid was also renamed to pid, to be more consistent with other system views.<br />
<br />
==Change all SQL-level statistics timing values to float8-stored milliseconds <!-- (Tom Lane) -->==<br />
<br />
pg_stat_user_functions.total_time, pg_stat_user_functions.self_time, pg_stat_xact_user_functions.total_time, pg_stat_xact_user_functions.self_time, and pg_stat_statements.total_time (contrib) are now in milliseconds, to be consistent with the rest of the timing values.<br />
<br />
==postgresql.conf parameters changes <!-- (Heikki Linnakangas, Tom Lane, Peter Eisentraut) -->==<br />
<br />
* silent_mode has been removed. Use pg_ctl -l postmaster.log<br />
* wal_sender_delay has been removed. It is no longer needed<br />
* custom_variable_classes has been removed. All «classes» are accepted without declaration now<br />
* ssl_ca_file, ssl_cert_file, ssl_crl_file, ssl_key_file have been added, meaning you can now specify the ssl files<br />
<br />
[[Category:PostgreSQL 9.2]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Development_information&diff=17664Development information2012-05-28T20:59:42Z<p>Sternocera: </p>
<hr />
<div>__NOTOC__<br />
This area includes developer-targeted documentation regarding aspects of PostgreSQL development. Please visit the [http://www.postgresql.org/developer developer area] of the PostgreSQL website for more general information about the development of PostgreSQL. You can find most developers in [irc://irc.freenode.net/postgresql #postgresql on freenode]. A list of IRC nick names with their respective real world names can be found [[IRC2RWNames | here]].<br />
<br />
==PostgreSQL 9.2 - Active Development==<br />
* [[PostgreSQL 9.2 Development Plan]]<br />
* [[PostgreSQL 9.2 Open Items]]<br />
<br />
==Development Process==<br />
* [[Todo|Todo list]]<br />
* [[Todo:Contents|Unofficial Todo Detail]]<br />
* [[Submitting a Patch]]<br />
* [[Reviewing a Patch]]<br />
* [[RRReviewers|Round-robin Patch Review]]<br />
* [[Running a CommitFest]]<br />
* [[Committing with Git]]<br />
<br />
== Developer Resources ==<br />
* [[Developer FAQ]]<br />
* [[Regression test authoring]]<br />
* [[HowToBetaTest|HOWTO Alpha and Beta Test PostgreSQL]]<br />
* [[Working with Git]]<br />
* [[Working with CVS]] (obsolete)<br />
* [[Working with Eclipse]] (using CVS)<br />
* [[Fixing shift/reduce conflicts in Bison]]<br />
* [[PL Matrix|Procedural Language Matrix]]<br />
* [http://www.postgresql.org/about/featurematrix Feature Matrix]<br />
* [http://www.postgresql.org/developer/coding PostgreSQL Coding]<br />
* [http://developer.postgresql.org/pgdocs/postgres/index.html Development docs] (updated every 5 minutes)<br />
* [[Project Hosting]]<br />
* [http://www.pgcon.org/2010/schedule/attachments/142_HackingWithUDFs.pdf Exposing PostgreSQL Internals with UDFs (2010)]<br />
<br />
== Projects and Planning ==<br />
* [http://commitfest.postgresql.org/action/commitfest_view/open Open CommitFest] - New patch submissions for 9.2<br />
* [https://commitfest.postgresql.org/ CommitFest]<br />
* [[PostgreSQL 8.4]]<br />
* [[PgCon 2012 Developer Meeting]]<br />
* [[PgCon 2011 Developer Meeting]]<br />
* [[PgCon 2010 Developer Meeting]]<br />
* [[PgCon 2009 Developer Meeting]]<br />
* [[PgCon 2008 Developer Meeting]]<br />
* [[Development projects]] - links to individual projects<br />
<br />
==PostgreSQL Past Development==<br />
* [[PostgreSQL 9.1 Open Items]]<br />
* [[PostgreSQL 9.1 Development Plan]]<br />
* [[PostgreSQL 9.0 Open Items]]<br />
* [[85AlphaFeatures|PostgreSQL 9.0 Alpha Release Feature List]]<br />
<br />
[[Category:CommitFest]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=PgCon_2012_Developer_Meeting&diff=17112PgCon 2012 Developer Meeting2012-05-18T16:10:57Z<p>Sternocera: /* MERGE */</p>
<hr />
<div>A meeting of the most active PostgreSQL developers is being planned for Wednesday 16th May, 2012 near the University of Ottawa, prior to pgCon 2012. In order to keep the numbers manageable, this meeting is '''by invitation only'''. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org). <br />
<br />
Please note that this year the attendee numbers have been cut to try to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.2 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.<br />
<br />
This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.<br />
<br />
== Time & Location ==<br />
<br />
The meeting will be from 8:30AM to 5PM, and will be in the "Red Experience" room at:<br />
<br />
Novotel Ottawa<br />
33 Nicholas Street<br />
Ottawa<br />
Ontario<br />
K1N 9M7<br />
<br />
Food and drink will be provided throughout the day, including breakfast from 8AM.<br />
<br />
[http://maps.google.ca/maps?f=q&source=s_q&hl=en&geocode=&q=novotel+ottawa&aq=&sll=49.891235,-97.15369&sspn=36.237851,79.013672&ie=UTF8&hq=novotel+ottawa&hnear=&ll=45.421528,-75.683699&spn=0.036869,0.077162&z=14&iwloc=A&layer=c&cbll=45.425741,-75.689638&panoid=Z4FUGnkZkdHAOkIxyjjS9Q&cbp=12,25.83,,0,-0.6 View on Google Maps]<br />
<br />
== Attendees ==<br />
<br />
The following people have RSVPed to the meeting (in alphabetical order, by surname):<br />
<br />
* Oleg Bartunov<br />
* Josh Berkus (Secretary)<br />
* Jeff Davis<br />
* Andrew Dunstan<br />
* Dimitri Fontaine<br />
* Stephen Frost<br />
* Peter Geoghegan<br />
* Kevin Grittner<br />
* Robert Haas<br />
* Magnus Hagander<br />
* Shigeru Hanada<br />
* Hitoshi Harada<br />
* KaiGai Kohei<br />
* Tom Lane<br />
* Noah Misch<br />
* Bruce Momjian<br />
* Dave Page (Chair)<br />
* Simon Riggs<br />
* Teodor Sigaev<br />
* Greg Smith<br />
<br />
== Proposed Agenda Items ==<br />
<br />
Please list proposed agenda items here:<br />
<br />
* Agree CommitFest schedule for 9.3 (Strawman from Simon)<br />
** CF1 June 15, 2012 - 1 month<br />
** CF2 Sep 15, 2012 - 1 month<br />
** CF3 Nov 15, 2012 - 1 month<br />
** CF4 Jan 15, 2013 - 2 months<br />
* Queuing [Dimitri, Kevin]<br />
** Description: efficient and transactional queuing is a very common need for application using databases, and could help implementing some internal features<br />
** Goals: get an agreement that core is the right place where to solve that problem, and what parts of it we want in core exactly<br />
* Materialized views [Kevin]<br />
** Description: Declarative materialized views are a frequently requested feature, but means many things to many people. It's not likely that an initial implementation will address everything. We need a base set of functionality on which to build.<br />
** Goals: Reach consensus on what a minimum feature set for commit would be.<br />
* Partitioning and Segment Exclusion [Dimitri]<br />
** Description: to solve partitioning, we need to agree on a global approach<br />
** Goals: agreeing on SE as a basis for better partitioning, having a "GO" on working on SE<br />
* MERGE: Challenges and priorities [Peter G]<br />
** Description: Implementing the MERGE statement for 9.3. It is envisaged specifically as an atomic "upsert" operation.<br />
** Goals: To get buy-in on various aspects of the feature's development, and, ideally, to secure reviewer resources or other support. Because of the complexity of the feature, early interest from reviewers is preferable.<br />
* Row-level Access Control and SELinux [KaiGai]<br />
** Security label on user tables<br />
** Dynamic expandable enum data types<br />
** Enforcement of triggers by extension<br />
* Enhancement of FDW at v9.3 [KaiGai]<br />
** Writable foreign tables<br />
** Stuffs to be pushed down (Join, Aggregate, Sort, ...)<br />
** Inheritance of foreign/regular tables<br />
** Constraint (PK/FK) & Trigger support.<br />
* Type registry [Andrew]<br />
** Provide for known OIDs for non-builtin types, and possibly for their IO functions too<br />
** Would make it possible to write code in core or in extension X that handles a type defined in extension Y.<br />
* Ending CommitFests in a timely fashion, especially the last one. Avoiding a crush of massive feature patches at the end of the cycle. Handling big patches that aren't quite ready yet. Getting more people to help with patch review. [Robert]<br />
* What Developers Want [Josh]<br />
** Description: a top-5 list of features and obstacles to developer adoption of PostgreSQL (with slides)<br />
** Goal: to set priorities for some features aimed at application users<br />
* In-Place Upgrades & Checksums [Greg Smith, Simon]<br />
** Description: Revisit in-place upgrades of the page format, now that pg_upgrade is available and multiple checksum implementations needing it have been proposed.<br />
** Goal: Nail down some incremental milestones for 9.3 development to aim at.<br />
* Autonomous Transactions [Simon]<br />
** Overview of idea, relationship to stored procedures<br />
** Feedback, buy-in and/or alternatives<br />
* Parallel Query [Bruce Momjian]<br />
** Hope to get buy-in for what parallel operations we are hoping to add in upcoming releases<br />
* Report from Clustering Meeting [Josh] (10 min)<br />
** Description: to summarize the discussions of the cluster-hackers meeting from the previous day<br />
** Goal: inter-team synchronization. Possibly, decisions requested on specific in-core features.<br />
* Double Write Buffers [Simon]<br />
** Is anyone committing to do that for 9.3?<br />
<br />
* Goals, priorities, and resources for 9.3 [All]<br />
** For roadmap and planning purposes, set expectations and coordinate work schedules for 9.3. Confirm who is doing what, identify interested reviewers at start, and check for gaps.<br />
<br />
== Agenda ==<br />
<br />
{| border="1" cellpadding="4" cellspacing="0"<br />
!Time<br />
!Item<br />
!Presenter<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:00<br />
|Breakfast<br />
|<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:30 - 08:45<br />
|Welcome and introductions<br />
|Dave Page<br />
<br />
|-<br />
|08:45 - 09:15<br />
|Autonomous transactions<br />
|Simon Riggs<br />
<br />
|-<br />
|09:15 - 09:40<br />
|[[Queuing]]<br />
|Dimitri Fontaine/Kevin Grittner<br />
<br />
|-<br />
|09:40 - 09:50<br />
|Report from the Clustering Meeting<br />
|Josh Berkus<br />
<br />
|-<br />
|09:50 - 10:10<br />
|Type registry<br />
|Andrew Dunstan<br />
<br />
|-<br />
|10:10 - 10:30<br />
|Access control and SELinux<br />
|KaiGai Kohei<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|10:30 - 10:45<br />
|Coffee break<br />
|<br />
<br />
|-<br />
|10:45 - 11:15<br />
|Enhancement of FDWs in 9.3<br />
|KaiGai Kohei<br />
<br />
|-<br />
|11:15 - 11:30<br />
|What developers want<br />
|Josh Berkus<br />
<br />
|-<br />
|11:30 - 12:00<br />
|Parallel Query<br />
|Bruce Momjian<br />
<br />
|-<br />
|12:00 - 12:30<br />
|MERGE: Challenges and priorities<br />
|Peter Geoghegan<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|12:30 - 13:30<br />
|Lunch <br />
|<br />
<br />
|-<br />
|13:30 - 14:00<br />
|Materialised views<br />
|Kevin Grittner<br />
<br />
|-<br />
|14:00 - 14:20<br />
|In place upgrades and checksums<br />
|Simon Riggs/Greg Smith<br />
<br />
|-<br />
|14:20 - 14:45<br />
|Partitioning and segment exclusion<br />
|Dimitri Fontaine<br />
<br />
|-<br />
|14:45 - 15:00<br />
|Commitfest Schedule<br />
|All<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|15:00 - 15:15<br />
|Tea break<br />
|<br />
<br />
|-<br />
|15:15 - 15:40<br />
|Commitfest management<br />
|Robert Haas<br />
<br />
|-<br />
|15:40 - 16:45<br />
|Goals, priorities, and resources for 9.3<br />
|All<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|16:45 - 17:00<br />
|Any other business/group photo<br />
|Dave Page<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|17:00<br />
|Finish<br />
| <br />
|}<br />
<br />
==Minutes==<br />
<br />
== 2012 Developer Meeting Minutes ==<br />
<br />
Started with introductions.<br />
<br />
=== Autonomous Transactions ===<br />
<br />
Simon brought this to get some feedback on the idea. Autonomous transactions (ATX) are a transaction inside a transaction ... a new top-level transaction. In Oracle, it's not just one new transaction, it's a whole new context which can submit multiple new transactions. There is no connection between parent and child transactions, which can result in new types of deadlocks.<br />
<br />
Each new transaction context would allocate a new pg_exec from a pg_proc call. Implementation is straightforwards, just have to handle locking. Allows us to implement stored procedures in an interesting way. If we treat a stored procedure as an autonomous transaction, then this solves some problems. We can put COMMIT< ROLLBACK, other things in stored procedures. <br />
<br />
Tom suggested that ATX don't need to conflict with parent transaction locks. Noah pointed out some issues with that. We'd need to have a switch for Stored Procedures in order to indicate they are autonomous, like using CREATE STORED PROCEDURE. We'd be using an additional client slot for each ATX, which could be a problem. Oracle's limit on ATX is 70 per connection, which seems like a lot. Maybe we should try to hold them all to a single session like it was a subtransaction. Not sure if we can do this, Simons will need to take a look at is.<br />
<br />
ATX also need to eventually be able to run utility commands, like VACUUM and CREATE INDEX CONCURRENTLY. <br />
<br />
=== Queueing ===<br />
<br />
Ultimately the materialized views will need some kind of queueing. Once we have queueing in core, it could be generally useful. CLUSTER CONCURRENTLY would need it, or application queues will need queueing structure. We might want to have it exposed at the SQL level. You put things in the queue, and at commit, others can see it. LISTEN/NOTIFY is sort of a queue, but is only one item and vanishes if you're not listening.<br />
<br />
Like a table, but access semantics are different. Would need logged/unlogged queues. Some discussion about how queues are different from tables. Haas wondered about whether what we need for interal queues are the same as what users need for user-visible queues.<br />
<br />
Queue-tables also need different performance characteristics. We don't need queues so much as we need deferred action. We also need background processes which wake up and check the queue. Queues could be built on top of tables. Discussion about uses, designs for queues ensued.<br />
<br />
We need a really clear design spec for how queues would work. There are specific performance improvements we want for queueing, but they're likely to be just improvements on table performance. The idea is to have a generalized API instead of reinventing a bunch of times.<br />
<br />
Next steps is to collect use cases. [[Queueing|Kevin & Dimitri will collect use cases on a wiki page]], to design an API. Performance optimization needs to look at access pattern. Simon pointed out that this works similar to fact tables where you want to move stuff forward constantly. Users might not use queues as pure FIFO.<br />
<br />
Unlinking segments works for deleting from the beginning of a table but indexes could be a problem. Block numbers could be a practical problem, we might need wraparound, or reset-to-zero.<br />
<br />
=== Report on Clustering Meeting ===<br />
<br />
See [[PgCon2012CanadaClusterSummit|minutes]].<br />
<br />
=== Type Registry ===<br />
<br />
WIP idea. Hstores aren't build in, so they get an arbitrary OID, which causes issues with writing generic code. Looking up they type name is expensive. It would be nice to have a registry for types where people writing extensions are allocated an OID. Andrew gave example of hacking Postgres to support upgrading from the optional JSON type in 9.1 to the built-in type in 9.2.<br />
<br />
We need to expose the pg_upgrade stuff as well, set_binary_upgrade. Should we use something other than and OID? We need the OIDs for upgrade and for drivers. Driver identicalness isn't the same as pg_upgradability, so we might want two different switches for that. Maybe we should have a new OID if you change the storage of a type?<br />
<br />
What's the criterion for allocating an OID? We'll need some kind of judgement. We'll also need to block off the OID reserved space into sections. People generally found this to be a good idea. Andrew will create a wiki page and follow-up. We could just do this for contrib, but that's not really a good idea.<br />
<br />
We could have CREATE TYPE ... WITH OID = ###, for base types only. The folks who want it for ENUM etc. are just replication/clustering authors. There was discussion of other approaches to handling these problems. Users will create types with OIDs which conflict.<br />
<br />
=== Access Control and SE-Linux ===<br />
<br />
Several components: to add security around user tables. Second, to add additional conditions around user queries. Third, a condition around new tuples which are inserted. Fourth, we should have ENUMs to represent user-defined security labels. Did some performance testing on the last part, having labels as OIDs was much faster and closer to non-SE performance.<br />
<br />
There's concurrency issues around seeing new labels -- we'd have a huge issue with inserting the labels into the system table. Creating a new label could be a downtime event; we can have a utility command, and we can require users to create a new label first manually. But what happens if the new label isn't there? Should error just like a constraint.<br />
<br />
Is there a way to query SE-linux to get all of the security labels? That's hard, because it's four fields. The last field is an issue for prediction. There's a lot of value in having row-level security be completely type-agnosic; we just have a string and we don't care what's in it.<br />
<br />
An SE Label consists of: user, row field, type field, and (something inaudible). That last part is a kind of bitmap. Do we actually need that part, though? What's multi-category security, will we support that? How many different labels would you have on a specific table?<br />
<br />
The idea of row-level security is to force quals on people. Currently it's not transparent. The discussion on labels needs to continue elsewhere.<br />
<br />
Also we need to address FK and PK implementation for security labels.<br />
<br />
=== What Developers want ===<br />
<br />
PostgreSQL is becoming the default for many web applications like Ruby and Django. But there are plenty of users complaints. They don't show up on the PostgreSQL mailing lists. The developer complaints are on stackoverflow, forums for virtual hosting companies, and application specific lists like ORM/framework layers.<br />
<br />
Two categories of developer comments: blockers that cause to use another tool, and enhancers that would expand the market into new areas. Many of these are available features, but they seem to hard to use.<br />
<br />
==== Blockers ====<br />
<br />
1. Installation onto developer laptops (Windows / OS X)<br />
* Re-installs problematic in Windows<br />
* Reinstall of Redis is the competitor here, it is a closer to a true one-click installer.<br />
* People use Redis because it's "easy to install", while PostgreSQL ran into one of multiple problems (reported on lists like pgsql-general)<br />
* postgres.app is aiming at simplifying things for Mac developers, is in beta<br />
* Kevin: also seen issues with Rails + Rake, lots of questions on Stackoverflow.<br />
2. Complexity of configuring PostgreSQL, i.e. postgresql.conf<br />
* Shared memory issues on the Mac<br />
** Could use POSIX shared memory instead Sys V<br />
* Need a configuration generator and hints for settings that are set incorrectly<br />
**Example: need to increase size of the transaction log with pg_xlog having X GB of space. Math to determine settings like checkpoint_segements given a GB target is complicated.<br />
3. Better analysis and troubleshooting<br />
* Expose everything via SQL, i.e. autovacuum ; no parsing logs.<br />
* EXPLAIN needs to be easier to understand, suggest what needs to be done when planner mistakes are made.<br />
* Freeze a stable query plan needed for some apps.<br />
4. Easier to understand replication<br />
* External projects that try to help are often less maintained/robust/documented than core<br />
* Same thing is true for pooling projects<br />
5. Better pg_upgrade<br />
* More trustworthy<br />
* Handle version upgrades across large clusters<br />
* Deliver on <5 minutes promise. Can take a long time for statistics ANALYZE. Needs to save/restore that instead.<br />
6. MERGE UPSERT<br />
<br />
==== Enabling features to broaden userbase ====<br />
<br />
1. Finish JSON support<br />
* Most popular new feature on news sites LWN etc. since 9.0 replication<br />
* Some people want simple document storage like NoSQL, but with PostgreSQl reliability<br />
* Needs indexing performance improvements<br />
* More extract from JSON features<br />
* Schemaless PostgreSQL is possible with JSON or hstore, but it's not obvious that's true.<br />
2. Better extensions<br />
* Packaging for popular extensions on popular <br />
* Extensions should follow replication; move .so to standby? Lots of resistance to that idea.<br />
* Better visibility of extensions, and extension aggregators like PGXN.<br />
3. Client language queries<br />
* Straight from, say, Python to a parse tree<br />
* SQL Server/.Net does move in this direction for C#<br />
* Competition here is the non-relational databases<br />
4. Built-in sharding<br />
* PL/Proxy: must find it, minimal docs, questions around support situation<br />
* Target user base here doesn't like SQL or functions much either<br />
* Base on writable FDW?<br />
* Borrow ideas from notable sharded PostgreSQL deployments?<br />
<br />
==== Enhancements of FDW in 9.3 ====<br />
<br />
What do we need for FDW in 9.3? Want discussion of what to implement. Hanada is working on pgsql_fdw. Wants this in the core distribution, to replace dblink. Currently FDWs are read-only so users still need dblink. There is a list of features Hanada wants to implement. <br />
<br />
One issue is naming. Currently we already have postgresql_fdw in core, which is used by dblink. Proposed pgsql_fdw, but that doesn't fit our naming conventions. We should maybe rename the dblink one to dblink_fdw. There is also an issue around options where it should consult libpq on what options are supported. Since the function name conflicts are internal, this would only mess with pgupgrade. <br />
<br />
Features include:<br />
* writeable FDWs<br />
* aggregation pushdown<br />
* table sorting pushdown<br />
* table inheritance with FDW<br />
* constraint support on foreign tables<br />
<br />
Writeable FDW is the most interesting feature. One issue is transaction control, suggestion is that it's the responsibility of the FD module to control transactions, not PostgreSQL. Two ways to do it: one is that every write to a FT is an autocommit transaction. The other option is that the FT commits when you commit your local transaction. SQL Server automatically does two-phase commit. But it might be better for a first version not to have any transaction control. <br />
<br />
We will implement with no remote transaction control for the version for 9.3. Plus distributed transactions have lots of interesting failure conditions.<br />
<br />
KaiGai plans to get pgsql_fdw into the first CF so that we can play with it.<br />
<br />
=== Parallel Query ===<br />
<br />
Everyone run screaming from the room. First, understand that not everyone is I/O bound. There are cases where the system is primarily memory or CPU-constrained. If you have a handful of very complex queries which are primarily memory-bound, but we're not always I/O constrained, we need to look at ways to parallelize memory/CPU-constrained systems. We need to start looking incrementally with how we can do some things in parallel. <br />
<br />
Already-completed parallel pg_dump is an example of this. We need more cases where we can surgically parallelize stuff. Josh brought up the issue of PostGIS queries which need CPU parallelism. Greg brought up 48-core server with 256GB of RAM for a 100GB server. If we can get 4 CPUs, we get better memory bandwidth. We're sometimes memory-bound because of non-sharable memory bandwidth. Bruce told story of Informix 6's parallelism disaster.<br />
<br />
We need a task list of individual tasks we could parallelize instead of parallelizing everything. We do need a general "helper process" infrastructure so that we can hand work off to them. Simon is working on the parallel worker tasks now. <br />
<br />
Bruce and Greg discussed Greenplum's history. The way we generate query plans makes this hard, since it's kind of a "pull" basis: "gimme a tuple". If our query plan was a task list it would be easier. MPP systems have plans where they look at which steps can be parallelized and what they cost.<br />
<br />
The hard stuff is in the optimizer. Creating a cost model is really difficult. Peter brought up the Intel threading building blocks as a generalized parallelism case with a graph dependency. It has this thing called "task stealing". The classic parallelism case is video rendering, but our tasks are not like that. We need one-off cases for each task. <br />
<br />
It's like the Windows port in terms of scope and complexity. This is different from the Windows port, in that we can do it piecemeal, but we need to decide to go down the road of additional complexity. Dimitri suggested exposing the executor as a virtual machine. A lot of stuff is different. Josh suggested starting with parallel index build as the easiest single task with solid benefit. Bruce points out the even simpler case is to build several indexes in parallel over the same scan.<br />
<br />
Additional items that can be parallelized:<br />
<br />
* Redo<br />
* Vacuum<br />
* Logical dump<br />
* Sorting<br />
* Scans<br />
<br />
=== MERGE ===<br />
<br />
Peter hasn't done as much with this as he expected so far, but plans to get something done for 9.3. What's the best way to solve this problem? Josh spoke about the need for atomic UPSERT, Peter agrees that that's a good version 1 goal.<br />
<br />
There's a fair amount of speculation on how to implement this feature. A lot of people want to use predicate locking, but we need an accessible API and some more features for predicate locking to make it work. We could also have a new kind of lock associated with an index tuple. The UPSERT case requires solving the hard problem, general MERGE beyond that is detail work. One thing we need to do is finish deprecating user-definable RULEs. <br />
<br />
Greg worked with a GSOC project for MERGE, but concurrency completely didn't work. We still have to solve the concurrency issues. Robert remembers that there were intrinsically complex issues without even a possible perfect solution. We need to look at the thread where we looked at the problems; the definition of sensible behavior is in question (thread: http://archives.postgresql.org/message-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com ). We need to define the spec first. We can look at what other databases do.<br />
<br />
We can allow weird things to happen -- corner cases -- with MERGE or UPSERT. We can tell people to use SSI to avoid those weird issues. The SQL standard's MERGE doesn't really give us UPSERT, we should use different syntax. We want INSERT ... ON DUPLICATE KEY UPDATE, not REPLACE INTO. We should ask MySQL folks about the history of this.<br />
<br />
Job #1 is building the simple case, UPSERT. We can do SQL-standard MERGE later. Greg wants reviewers to commit for this. This is really a Heikki thing. The Executor part needs expert review (Tom?).<br />
<br />
=== Materialized Views ===<br />
<br />
What's the minimum committable patch, and what direction should we take it in? Kevin has time to work on it, but it's been hard to schedule that time. <br />
<br />
* syntax for create/alter<br />
* new relkind in pg_class<br />
* pg_dump and restore support<br />
* being able to index them<br />
* statement to regenerate contents of matview (concurrently?)<br />
<br />
Will have an option to create a matview without filling it with data. pg_dump would use this. Would deal with the various ways of updating matviews, like incremental, later. If you wanted incremental updates on a matview which is too complex it would error. Further down, doing incremental updates via queueing mechanism. <br />
<br />
Also, there's the optimizer -- substituting matviews for base tables automatically. That would be much later. Josh mentioned that someone had already written code for that. KaiGai asked about SE-Postgres and matviews, and discussed it with Kevin. Josh also asked about eventually doing on-request refresh.<br />
<br />
Simon wants us to call it something different from Materialized Views, becuase we won't have the optimizer stuff which Oracle does. Kevin is calling it declarative materialized views. And it's not clear that we want to handle query rewrite the same way Oracle does. We can have synchronous update of matviews, but more useful is queueing updates of the views to that they are "eventually consistent". Kevin talked about cranky judges.<br />
<br />
Phase I is just do do the object type and manual refresh. Incremental update will be later. There's a couple other things you can do if you can guarantee that the matview data would produce the same result. There was discussion around what to call the feature given that we'll be implementing matviews in several releases. <br />
<br />
Dimitri suggested that we could use matviews as a working concept for correlation stats. Simon discussed issues of setting acceptable staleness at data request time, both for matviews and for replication. <br />
<br />
=== In place upgrades & Checksums ===<br />
<br />
Where had the page format discussion gone wrong in the past? There's 4 issues:<br />
<br />
* adding more bytes in the header<br />
* having multiple page views<br />
* time required to upgrade<br />
<br />
The whole discussion talked about 32-bit checksums. But with 16-bit checksums, we could borrow pg_tli, and add a checksummed bit. Greg said we bump the page format, Robert said no. Greg wants us to "get practice" in having new page formats. We need to flag whether or not the page is checksummed. Will we ever need 32-bit checksums? If we implement 16-bit, we'll find out. <br />
<br />
Simon analyzed the error rate with 16-bit checksums, and felt that it was enough for an 8K page, but not a 32K page. Not clear on why it makes a difference on what the size of the page is. Plus we're not expecting an error on just one page.<br />
<br />
What are we planning to include in the checksum? What are we going to checksum? Jeff has been looking at issues where whole disk blocks are getting swapped. Suggested including the relfilenode etc. in the checksum in order to make sure that the page is where it's supposed to be. Would it prevent us from moving data around? Changing tablespaces, etc. might be an issue. Is table OID better or worse than relfilenode? Discussion of what pg_upgrade does. OID seems better.<br />
<br />
Need to have some way to track what's checksummed and what's not in a table. Each page will have a checksum bit. Add command VACUUM CHECKSUM ON. And we don't really have to implement an "old page reader". <br />
<br />
Hint bits are the biggest implementation issue. Simon's approach was to full-page-write all pages with hint bits once per checkpoint cycle, but there's still some stuff to be worked out there. There's an issue with hint bits being set while the page is being written by another process. Discussed the performance impact of this. <br />
<br />
For first version, we need to look at whether it's reliable. That is more important that the performance. Bulk loading has a major performance issue. Setting hint bits on the first select of a major table generates a whole bunch of WAL traffic.<br />
<br />
=== Partitioning and Segment Exclusion ===<br />
<br />
Current partitioning is "just good enough" to deter building something better. Dimitri has been thinking about what do to instead. Three problems:<br />
<br />
# when do you create the new partitions<br />
# constraint exclusion has all kinds of issues<br />
# index and constraints -- no primary keys etc.<br />
<br />
We've had several proposals. Declarative partitioning syntax. But as long as we have separate tables, we only solve problem 2. We've had 5 years of partial patches for that problem.<br />
<br />
So how about another idea: the problem is having a table with a huge data set, and addressing only part of that table. We already have table segments -- we could have segments which are determined by ordering. The idea is to have an index which, given the partitioning key, would tell us where the tuples are located -- in which segment. <br />
<br />
At what level in the system should a partition exist? Simon pushed for above-table level. Now we're looking at below table level. So the system defines partitions, not the user. We can look at a large table of 100 segements as having 100 partitions. If we store metadata about each partition, we can look at that to decide which segments to scan. Josh pointed out that this doesn't solve all or even most of the issues which partitioning is intended to solve. This solution is really a heavily compressed index or a performance optimization for scanning large, time-based tables. It's a sort of lossy index.<br />
<br />
Don't get hung up on 1GB segments, we might change that in the future. Or we could change that for this. Jeff Davis pictured something different for constraint exclusion with something simpler. Discussion about index scans, which may not be as efficient as it could be. Index-Only scan needs some optimizations. <br />
<br />
=== CommitFest Schedule ===<br />
<br />
Simon proposed a schedule, which includes the last commitfest being 2 months. Robert would like it to be shorter, not longer. Robert pointed out that the final CFs have been getting longer, not shorter since 9.0. Two issues related to commitfests:<br />
<br />
* works better when lots of people volunteer to review<br />
* last commitfest doesn't end.<br />
<br />
We would all benefit if we ended the CF earlier. Robert thought we should make CF4 shorter, non longer. Josh suggested that we could relese every 6 months. Big problem is people writing patchs still during CF4. We wait until everyone is exhausted and then decide what to bounce. We should make decisions at the beginning of the commitfest. <br />
<br />
Suggested separating review and commitfest. We should triage at the beginning of the commitfest. Robert brought up Dimitri's patches as an example. Robert wants completion over priority, Simon says the opposite. The problem with a consensus process is that there's no consensus. We could have a release manager. It's the big patches which are the real problem, since people really want them and there's lots of stuff in them. <br />
<br />
The problem with prioritization is that we're promoting a big feature over what's not quite there vs. several other patches which are ready. It's not fair to our contributors. But we could triage at the beginning because we're arbitrarily bumping stuff anyway. It's better to do it early than late. You can identify which patches are big or small, and which ones have a certain degree of readiness. Even if you're not correct, it'll help people allocate their time.<br />
<br />
For voting on priorities, we could vote and rate which ones are going to be easy or hard and how important they are for us. Dimitri outlined a system of point allocation and voting. Or we could list the committer on a patch at the beginning of the commitfest. That makes sense for the big patches, but not the small ones. So we should identify them at the beginning of the commitfest.<br />
<br />
Everyone is going to argue for their own stuff, though. People have different priorities. We also can't tell committers what to do, we can only ask. We'd like to get committer signoff early in the process. We might also want to sign off reviewers.<br />
<br />
Triage also needs to flag patches where we don't agree on the spec. <br />
<br />
We need to get better about giving feedback on the design for the patch. The problem with posting a design spec is that there's no formal review process for design spec. After CF3, a week of triage. If we haven't seen the big patch by the triage, it doesn't get into CF4 for big patches. <br />
<br />
Simon pointed out that it's hard to make rules for big patches because each one is different. <br />
<br />
So, changes to the process:<br />
* Planning week after the 3rd commitfest<br />
* "design spec" flagged submissions to the CF<br />
* write docs about the CF process<br />
* one patch, one review requirement<br />
<br />
=== CommitFest Management ===<br />
<br />
CF1: June 15 - July 15<br />
<br />
CF2: Sept 15 - Oct 15<br />
<br />
CF3: Nov 15 - Dec 15<br />
Planning Week - Dec 8-15<br />
<br />
CF4.1: Jan 15 - Feb 15<br />
Final Triage: Feb 1-7<br />
<br />
=== Goals, Priorities, and Resources for 9.3 ===<br />
<br />
Dave: Installers<br />
<br />
Andrew: Aggregation for JSON, Projecting data from JSON, Pretty-printing SQL, PL/perl binary format, binary output for psql, windows builds for extensions.<br />
<br />
Peter: UPSERT, trying to replace Flex, pg_stat_statements for query plans.<br />
<br />
Simon: Bi-Directional Replication<br />
<br />
Hanada: pgsql_fdw, other FDWs.<br />
<br />
Hitoshi: plv8, JSON support, some windowing function improvements.<br />
<br />
Kevin: Declarative materialized views, SSI performance.<br />
<br />
Jeff: statistics for ranges, range keys, range FKs, and range joins.<br />
<br />
Robert: performance, performance, performance. Reducing latency events. Write performance improvements. Can we optimize vacuum some more, reviewing patches.<br />
<br />
Josh: documentation, advocacy, maybe autoconfiguration. Release notes. <br />
<br />
Magnus: configuration directories. Monitoring. Simplifying replication.<br />
<br />
Dimitri: now working on "event triggers". Next step for extensions. Segment exclusion. Queueing in core design spec.<br />
<br />
Tom: backfilling weak spots in the planner.<br />
<br />
Alvaro: finalize FK locks. Allowing ALTER TABLE to reorder columns. <br />
<br />
Bruce: design spec for some parallel operations.<br />
<br />
Oleg & Teodor: improve SP-GiST. Indexing similarity. Also want to work on spatial join. JSON indexing if they can get sponsorship.<br />
<br />
Noah: global temp tables, local XID space for temp tables, more ALTER TABLE improvements.<br />
<br />
Greg: reviving dead projects: config directory, eliminate recovery.conf, adding instrumentation for timing events inside the database.<br />
<br />
KaiGai: SE row-level access control. <br />
<br />
Stephen Frost: list optimization work. SSL under Windows, supporting engines. <br />
<br />
=== Other Business ===<br />
<br />
Josh will write as-we-go release notes for alphas or whatever.<br />
<br />
We could have a mini-developer meeting in Prague. There was discussion about whether we should move the developer meeting around every year. This is the "main" developer meeting, but we could have another one somewhere else. We could have it at FOSDEM, in February.<br />
<br />
Josh brought up the idea of having an unconference day for Postgres contributors. Robert suggested interest group meetings as a refinement of that.<br />
<br />
<br />
[[Category:PostgreSQL Events]]<br />
[[Category:PostgreSQL 9.3]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=PgCon_2012_Developer_Meeting&diff=16788PgCon 2012 Developer Meeting2012-05-09T17:10:48Z<p>Sternocera: </p>
<hr />
<div>A meeting of the most active PostgreSQL developers is being planned for Wednesday 16th May, 2012 near the University of Ottawa, prior to pgCon 2012. In order to keep the numbers manageable, this meeting is '''by invitation only'''. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org). <br />
<br />
Please note that this year the attendee numbers have been cut to try to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.2 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.<br />
<br />
This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.<br />
<br />
== Time & Location ==<br />
<br />
The meeting will be from 9AM to 5PM, and will be in the "Red Experience" room at:<br />
<br />
Novotel Ottawa<br />
33 Nicholas Street<br />
Ottawa<br />
Ontario<br />
K1N 9M7<br />
<br />
Food and drink will be provided throughout the day, including breakfast from 8AM.<br />
<br />
[http://maps.google.ca/maps?f=q&source=s_q&hl=en&geocode=&q=novotel+ottawa&aq=&sll=49.891235,-97.15369&sspn=36.237851,79.013672&ie=UTF8&hq=novotel+ottawa&hnear=&ll=45.421528,-75.683699&spn=0.036869,0.077162&z=14&iwloc=A&layer=c&cbll=45.425741,-75.689638&panoid=Z4FUGnkZkdHAOkIxyjjS9Q&cbp=12,25.83,,0,-0.6 View on Google Maps]<br />
<br />
== Attendees ==<br />
<br />
The following people have RSVPed to the meeting (in alphabetical order, by surname):<br />
<br />
* Oleg Bartunov<br />
* Josh Berkus (Secretary)<br />
* Jeff Davis<br />
* Andrew Dunstan<br />
* Dimitri Fontaine<br />
* Stephen Frost<br />
* Peter Geoghegan<br />
* Kevin Grittner<br />
* Robert Haas<br />
* Magnus Hagander<br />
* Shigeru Hanada<br />
* Hitoshi Harada<br />
* KaiGai Kohei<br />
* Tom Lane<br />
* Noah Misch<br />
* Bruce Momjian<br />
* Dave Page (Chair)<br />
* Simon Riggs<br />
* Teodor Sigaev<br />
* Greg Smith<br />
<br />
== Proposed Agenda Items ==<br />
<br />
Please list proposed agenda items here:<br />
<br />
* Agree CommitFest schedule for 9.3 (Strawman from Simon)<br />
** CF1 June 15, 2012 - 1 month<br />
** CF2 Sep 15, 2012 - 1 month<br />
** CF3 Nov 15, 2012 - 1 month<br />
** CF4 Jan 15, 2013 - 2 months<br />
* Priorities for 9.3 [All]<br />
** Description: discuss what people are working on and what's likely to be in 9.3.<br />
** Goals: set expectations and coordinate work schedules for 9.3.<br />
* Queuing [Dimitri, Kevin]<br />
** Description: efficient and transactional queuing is a very common need for application using databases, and could help implementing some internal features<br />
** Goals: get an agreement that core is the right place where to solve that problem, and what parts of it we want in core exactly<br />
* Materialized views [Kevin]<br />
** Description: Declarative materialized views are a frequently requested feature, but means many things to many people. It's not likely that an initial implementation will address everything. We need a base set of functionality on which to build.<br />
** Goals: Reach consensus on what a minimum feature set for commit would be.<br />
* Partitioning and Segment Exclusion [Dimitri]<br />
** Description: to solve partitioning, we need to agree on a global approach<br />
** Goals: agreeing on SE as a basis for better partitioning, having a "GO" on working on SE<br />
* The MERGE statement: Challenges and priorities [Peter G]<br />
** Description: Implementing the MERGE statement for 9.3. It is envisaged specifically as an atomic "upsert" operation.<br />
** Goals: To get buy-in on various aspects of the feature's development, and, ideally, to secure reviewer resources or other support. Because of the complexity of the feature, early interest from reviewers is preferable.<br />
* Row-level Access Control and SELinux [KaiGai]<br />
** Security label on user tables<br />
** Dynamic expandable enum data types<br />
** Enforcement of triggers by extension<br />
* Enhancement of FDW at v9.3 [KaiGai]<br />
** Writable foreign tables<br />
** Stuffs to be pushed down (Join, Aggregate, Sort, ...)<br />
** Inheritance of foreign/regular tables<br />
** Constraint (PK/FK) & Trigger support.<br />
* Type registry [Andrew]<br />
** Provide for known OIDs for non-builtin types, and possibly for their IO functions too<br />
** Would make it possible to write code in core or in extension X that handles a type defined in extension Y.<br />
* Ending CommitFests in a timely fashion, especially the last one. Avoiding a crush of massive feature patches at the end of the cycle. Handling big patches that aren't quite ready yet. Getting more people to help with patch review. [Robert]<br />
* What Developers Want [Josh]<br />
** Description: a top-5 list of features and obstacles to developer adoption of PostgreSQL (with slides)<br />
** Goal: to set priorities for some features aimed at application users<br />
* In-Place Upgrades & Checksums [Greg Smith, Simon]<br />
** Description: Revisit in-place upgrades of the page format, now that pg_upgrade is available and multiple checksum implementations needing it have been proposed.<br />
** Goal: Nail down some incremental milestones for 9.3 development to aim at.<br />
* Autonomous Transactions [Simon]<br />
** Overview of idea, relationship to stored procedures<br />
** Feedback, buy-in and/or alternatives<br />
* Parallel Query [Bruce Momjian]<br />
** Hope to get buy-in for what parallel operations we are hoping to add in upcoming releases<br />
* Report from Clustering Meeting [Josh] (10 min)<br />
** Description: to summarize the discussions of the cluster-hackers meeting from the previous day<br />
** Goal: inter-team synchronization. Possibly, decisions requested on specific in-core features.<br />
* Double Write Buffers [Simon]<br />
** Is anyone committing to do that for 9.3?<br />
* Summarise Commitments at End of Play [Simon]<br />
** For roadmap and planning purposes, confirm who is doing what, assign interested reviewers at start<br />
** Check gaps, identif priorities early on in cycle<br />
<br />
== Agenda ==<br />
<br />
{| border="1" cellpadding="4" cellspacing="0"<br />
!Time<br />
!Item<br />
!Presenter<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:00<br />
|Breakfast<br />
|<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:30 - 08:45<br />
|Welcome and introductions<br />
|Dave Page<br />
<br />
|-<br />
|08:45 - 09:10<br />
|Goals for 9.3<br />
|Josh Berkus<br />
<br />
|-<br />
|09:10 - 09:35<br />
|Commitfest management<br />
|Robert Haas<br />
<br />
|-<br />
|09:35 - 09:50<br />
|9.2 commitfest schedule<br />
|Simon Riggs<br />
<br />
|-<br />
|09:50 - 10:10<br />
|Type registry<br />
|Andrew Dunstan<br />
<br />
|-<br />
|10:10 - 10:30<br />
|Access control and SELinux<br />
|KaiGai Kohei<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|10:30 - 10:45<br />
|Coffee break<br />
|<br />
<br />
|-<br />
|10:45 - 11:15<br />
|Enhancement of FDWs in 9.3<br />
|KaiGai Kohei<br />
<br />
|-<br />
|11:15 - 11:40<br />
|Autonomous transactions<br />
|Simon Riggs<br />
<br />
|-<br />
|11:40 - 12:05<br />
|Partitioning and segment exclusion<br />
|Dimitri Fontaine<br />
<br />
|-<br />
|12:05 - 12:30<br />
|Queuing<br />
|Dimitri Fontaine/Kevin Grittner<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|12:30 - 13:30<br />
|Lunch <br />
|<br />
<br />
|-<br />
|13:30 - 14:00<br />
|What developers want<br />
|Josh Berkus<br />
<br />
|-<br />
|14:00 - 14:30<br />
|The MERGE statement: Challenges and priorities<br />
|Peter Geoghegan<br />
<br />
|-<br />
|14:30 - 15:00<br />
|Materialised views<br />
|Kevin Grittner<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|15:00 - 15:15<br />
|Tea break<br />
|<br />
<br />
|-<br />
|15:15 - 15:45<br />
|In place upgrades and checksums<br />
|Simon Riggs/Greg Smith<br />
<br />
|-<br />
|15:45 - 16:15<br />
|Parallel Query<br />
|Bruce Momjian<br />
<br />
|-<br />
|16:15 - 16:25<br />
|Report from the Clustering Meeting<br />
|Josh Berkus<br />
<br />
|-<br />
|16:25 - 16:45<br />
|Summarise commitments and identify priorities<br />
|Simon Riggs<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|16:45 - 17:00<br />
|Any other business/group photo<br />
|Dave Page<br />
<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|17:00<br />
|Finish<br />
| <br />
|}<br />
<br />
==Minutes==</div>Sternocerahttps://wiki.postgresql.org/index.php?title=PgCon_2012_Developer_Meeting&diff=16761PgCon 2012 Developer Meeting2012-05-06T21:55:50Z<p>Sternocera: Updating MERGE agenda item as directed</p>
<hr />
<div>A meeting of the most active PostgreSQL developers is being planned for Wednesday 16th May, 2012 near the University of Ottawa, prior to pgCon 2012. In order to keep the numbers manageable, this meeting is '''by invitation only'''. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org). <br />
<br />
Please note that this year the attendee numbers have been cut to try to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.2 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.<br />
<br />
This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.<br />
<br />
== Time & Location ==<br />
<br />
The meeting will be from 9AM to 5PM, and will be in the "Red Experience" room at:<br />
<br />
Novotel Ottawa<br />
33 Nicholas Street<br />
Ottawa<br />
Ontario<br />
K1N 9M7<br />
<br />
Food and drink will be provided throughout the day, including breakfast from 8AM.<br />
<br />
[http://maps.google.ca/maps?f=q&source=s_q&hl=en&geocode=&q=novotel+ottawa&aq=&sll=49.891235,-97.15369&sspn=36.237851,79.013672&ie=UTF8&hq=novotel+ottawa&hnear=&ll=45.421528,-75.683699&spn=0.036869,0.077162&z=14&iwloc=A&layer=c&cbll=45.425741,-75.689638&panoid=Z4FUGnkZkdHAOkIxyjjS9Q&cbp=12,25.83,,0,-0.6 View on Google Maps]<br />
<br />
== Attendees ==<br />
<br />
The following people have RSVPed to the meeting (in alphabetical order, by surname):<br />
<br />
* Oleg Bartunov<br />
* Josh Berkus (Secretary)<br />
* Jeff Davis<br />
* Andrew Dunstan<br />
* Dimitri Fontaine<br />
* Stephen Frost<br />
* Peter Geoghegan<br />
* Kevin Grittner<br />
* Robert Haas<br />
* Magnus Hagander<br />
* Shigeru Hanada<br />
* Hitoshi Harada<br />
* KaiGai Kohei<br />
* Tom Lane<br />
* Noah Misch<br />
* Bruce Momjian<br />
* Dave Page (Chair)<br />
* Simon Riggs<br />
* Teodor Sigaev<br />
* Greg Smith<br />
<br />
== Proposed Agenda Items ==<br />
<br />
Please list proposed agenda items here:<br />
<br />
* Agree CommitFest schedule for 9.3 (Strawman from Simon)<br />
** CF1 June 15, 2012 - 1 month<br />
** CF2 Sep 15, 2012 - 1 month<br />
** CF3 Nov 15, 2012 - 1 month<br />
** CF4 Jan 15, 2013 - 2 months<br />
* Priorities for 9.3 [All]<br />
** Description: discuss what people are working on and what's likely to be in 9.3.<br />
** Goals: set expectations and coordinate work schedules for 9.3.<br />
* Queuing [Dimitri, Kevin]<br />
** Description: efficient and transactional queuing is a very common need for application using databases, and could help implementing some internal features<br />
** Goals: get an agreement that core is the right place where to solve that problem, and what parts of it we want in core exactly<br />
* Materialized views [Kevin]<br />
* Partitioning and Segment Exclusion [Dimitri]<br />
** Description: to solve partitioning, we need to agree on a global approach<br />
** Goals: agreeing on SE as a basis for better partitioning, having a "GO" on working on SE<br />
* The MERGE statement: Challenges and priorities [Peter G]<br />
** Description: Implementing the MERGE statement for 9.3. It is envisaged specifically as an atomic "upsert" operation.<br />
** Goals: To get buy-in on various aspects of the feature's development, and, ideally, to secure reviewer resources or other support. Because of the complexity of the feature, early interest from reviewers is preferable.<br />
* Row-level Access Control and SELinux [KaiGai]<br />
** Security label on user tables<br />
** Dynamic expandable enum data types<br />
** Enforcement of triggers by extension<br />
* Enhancement of FDW at v9.3 [KaiGai]<br />
** Writable foreign tables<br />
** Stuffs to be pushed down (Join, Aggregate, Sort, ...)<br />
** Inheritance of foreign/regular tables<br />
** Constraint (PK/FK) & Trigger support.<br />
* Type registry [Andrew]<br />
** Provide for known OIDs for non-builtin types, and possibly for their IO functions too<br />
** Would make it possible to write code in core or in extension X that handles a type defined in extension Y.<br />
* Ending CommitFests in a timely fashion, especially the last one. Avoiding a crush of massive feature patches at the end of the cycle. Handling big patches that aren't quite ready yet. Getting more people to help with patch review. [Robert]<br />
* What Developers Want [Josh]<br />
** Description: a top-5 list of features and obstacles to developer adoption of PostgreSQL (with slides)<br />
** Goal: to set priorities for some features aimed at application users<br />
* In-Place Upgrades & Checksums [Greg Smith, Simon]<br />
** Description: Revisit in-place upgrades of the page format, now that pg_upgrade is available and multiple checksum implementations needing it have been proposed.<br />
** Goal: Nail down some incremental milestones for 9.3 development to aim at.<br />
* Autonomous Subtransactions [Simon]<br />
* Parallel Query [Bruce Momjian]<br />
** Hope to get buy-in for what parallel operations we are hoping to add in upcoming releases<br />
* Report from Clustering Meeting [Josh] (10 min)<br />
** Description: to summarize the discussions of the cluster-hackers meeting from the previous day<br />
** Goal: inter-team synchronization. Possibly, decisions requested on specific in-core features.<br />
<br />
== Agenda ==<br />
<br />
{| border="1" cellpadding="4" cellspacing="0"<br />
!Time<br />
!Item<br />
!Presenter<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:00<br />
|Breakfast<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:45 - 09:00<br />
|Welcome and introductions<br />
|Dave Page<br />
|-<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|10:30 - 10:45<br />
|Coffee break<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|12:30 - 13:30<br />
|Lunch <br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|15:00 - 15:15<br />
|Tea break<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|16:45 - 17:00<br />
|Any other business/group photo<br />
|Dave Page<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|17:00<br />
|Finish<br />
| <br />
|}<br />
<br />
<br />
==Minutes==</div>Sternocerahttps://wiki.postgresql.org/index.php?title=PgCon2012CanadaInCoreReplicationMeeting&diff=16638PgCon2012CanadaInCoreReplicationMeeting2012-04-27T14:32:21Z<p>Sternocera: /* Attendees (alphabetical) */</p>
<hr />
<div>= PostgreSQL In-Core Replication meeting, pgCon 2012 =<br />
<br />
== Time and Place ==<br />
<br />
Wednesday, May 16th, 6pm to 10pm<br />
<br />
Ottawa somewhere, room TBA<br />
<br />
== Agenda ==<br />
<br />
Draft agenda follows. Please let me know of any contributions/changes to the agenda you have:<br />
<br />
# Discussion of Multi-Master Theory (Simon)<br />
# Demonstration of prototypes (Andres)<br />
# Performance comparisons<br />
# My use case (Keaton)<br />
# Social Media use case (Simon)<br />
<br />
Broad and general discussion throughout. Notes and actions will be taken. Volunteers for tasks welcome.<br />
<br />
The meeting will be from 6pm to 10pm, with various forms of food and possibly a drink or two, sponsored by 2ndQuadrant.<br />
<br />
== Attendees (alphabetical) ==<br />
<br />
* Keaton Adams<br />
* Josh Berkus (prefer vegetarian)<br />
* David Fetter<br />
* Dimitri Fontaine<br />
* Andres Freund<br />
* Peter Geoghegan<br />
* Jim Mlodgenski<br />
* Jim Nasby (plus guest)<br />
* Michael Paquier<br />
* Simon Riggs<br />
* Mark Sloan<br />
* Greg Smith<br />
* Koichi Suzuki<br />
* Peter van Hardenberg<br />
* David Wheeler<br />
...<br />
<br />
Meeting limit about 20-25 people<br />
<br />
=== Joining the Meeting ===<br />
<br />
If you will be able to attend, please email Simon ([mailto:simon@2ndQuadrant.com simon@2ndQuadrant.com]) with the following:<br />
<br />
* Your Name<br />
* What pizza topping you like<br />
<br />
and please come armed with detailed information about your future replication requirements.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=PgCon_2012_Developer_Meeting&diff=16607PgCon 2012 Developer Meeting2012-04-25T22:59:41Z<p>Sternocera: </p>
<hr />
<div>A meeting of the most active PostgreSQL developers is being planned for Wednesday 16th May, 2012 near the University of Ottawa, prior to pgCon 2012. In order to keep the numbers manageable, this meeting is '''by invitation only'''. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org). <br />
<br />
Please note that this year the attendee numbers have been cut to try to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.2 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.<br />
<br />
This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.<br />
<br />
== Time & Location ==<br />
<br />
The meeting will be from 9AM to 5PM, and will be in the "Red Experience" room at:<br />
<br />
Novotel Ottawa<br />
33 Nicholas Street<br />
Ottawa<br />
Ontario<br />
K1N 9M7<br />
<br />
Food and drink will be provided throughout the day, including breakfast from 8AM.<br />
<br />
[http://maps.google.ca/maps?f=q&source=s_q&hl=en&geocode=&q=novotel+ottawa&aq=&sll=49.891235,-97.15369&sspn=36.237851,79.013672&ie=UTF8&hq=novotel+ottawa&hnear=&ll=45.421528,-75.683699&spn=0.036869,0.077162&z=14&iwloc=A&layer=c&cbll=45.425741,-75.689638&panoid=Z4FUGnkZkdHAOkIxyjjS9Q&cbp=12,25.83,,0,-0.6 View on Google Maps]<br />
<br />
== Attendees ==<br />
<br />
The following people have RSVPed to the meeting (in alphabetical order, by surname):<br />
<br />
* Oleg Bartunov<br />
* Josh Berkus (Secretary)<br />
* Jeff Davis<br />
* Andrew Dunstan<br />
* Dimitri Fontaine<br />
* Stephen Frost<br />
* Peter Geoghegan<br />
* Kevin Grittner<br />
* Robert Haas<br />
* Magnus Hagander<br />
* Shigeru Hanada<br />
* Hitoshi Harada<br />
* KaiGai Kohei<br />
* Tom Lane<br />
* Noah Misch<br />
* Bruce Momjian<br />
* Dave Page (Chair)<br />
* Simon Riggs<br />
* Teodor Sigaev<br />
* Greg Smith<br />
<br />
== Proposed Agenda Items ==<br />
<br />
Please list proposed agenda items here:<br />
<br />
* Queuing [Dimitri, Kevin]<br />
* Materialized views [Dimitri, Kevin]<br />
* Partitioning and Segment Exclusion [Dimitri]<br />
* The MERGE statement: Challenges and priorities [Peter G]<br />
* Row-level Access Control and SELinux [KaiGai]<br />
** Security label on user tables<br />
** Dynamic expandable enum data types<br />
** Enforcement of triggers by extension<br />
* Enhancement of FDW at v9.3 [KaiGai]<br />
** Writable foreign tables<br />
** Stuffs to be pushed down (Join, Aggregate, Sort, ...)<br />
** Inheritance of foreign/regular tables<br />
** Constraint (PK/FK) & Trigger support.<br />
* Ending CommitFests in a timely fashion, especially the last one. Avoiding a crush of massive feature patches at the end of the cycle. Handling big patches that aren't quite ready yet. Getting more people to help with patch review. [Robert]<br />
* What Developers Want [Josh]<br />
** a top-5 list of features and obstacles to developer adoption of PostgreSQL (with slides)<br />
* In-Place Upgrades & Checksums [Greg Smith, Simon]<br />
* Future of In-Core Replication [Simon]<br />
* Autonomous Subtransactions [Simon]<br />
<br />
== Agenda ==<br />
<br />
{| border="1" cellpadding="4" cellspacing="0"<br />
!Time<br />
!Item<br />
!Presenter<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:00<br />
|Breakfast<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:45 - 09:00<br />
|Welcome and introductions<br />
|Dave Page<br />
|-<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|10:30 - 10:45<br />
|Coffee break<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|12:30 - 13:30<br />
|Lunch <br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|15:00 - 15:15<br />
|Tea break<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|16:45 - 17:00<br />
|Any other business/group photo<br />
|Dave Page<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|17:00<br />
|Finish<br />
| <br />
|}<br />
<br />
<br />
==Minutes==</div>Sternocerahttps://wiki.postgresql.org/index.php?title=PgCon_2012_Developer_Meeting&diff=16589PgCon 2012 Developer Meeting2012-04-20T22:06:44Z<p>Sternocera: /* Proposed Agenda Items */</p>
<hr />
<div>A meeting of the most active PostgreSQL developers is being planned for Wednesday 16th May, 2012 near the University of Ottawa, prior to pgCon 2012. In order to keep the numbers manageable, this meeting is '''by invitation only'''. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org). <br />
<br />
Please note that this year the attendee numbers have been cut to try to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.2 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.<br />
<br />
This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.<br />
<br />
== Time & Location ==<br />
<br />
The meeting will be from 9AM to 5PM, and will be in the "Red Experience" room at:<br />
<br />
Novotel Ottawa<br />
33 Nicholas Street<br />
Ottawa<br />
Ontario<br />
K1N 9M7<br />
<br />
Food and drink will be provided throughout the day, including breakfast from 8AM.<br />
<br />
[http://maps.google.ca/maps?f=q&source=s_q&hl=en&geocode=&q=novotel+ottawa&aq=&sll=49.891235,-97.15369&sspn=36.237851,79.013672&ie=UTF8&hq=novotel+ottawa&hnear=&ll=45.421528,-75.683699&spn=0.036869,0.077162&z=14&iwloc=A&layer=c&cbll=45.425741,-75.689638&panoid=Z4FUGnkZkdHAOkIxyjjS9Q&cbp=12,25.83,,0,-0.6 View on Google Maps]<br />
<br />
== Attendees ==<br />
<br />
The following people have RSVPed to the meeting (in alphabetical order, by surname):<br />
<br />
* Oleg Bartunov<br />
* Josh Berkus (Secretary)<br />
* Jeff Davis<br />
* Andrew Dunstan<br />
* Dimitri Fontaine<br />
* Stephen Frost<br />
* Peter Geoghegan<br />
* Kevin Grittner<br />
* Robert Haas<br />
* Magnus Hagander<br />
* Shigeru Hanada<br />
* Hitoshi Harada<br />
* KaiGai Kohei<br />
* Tom Lane<br />
* Noah Misch<br />
* Bruce Momjian<br />
* Dave Page (Chair)<br />
* Simon Riggs<br />
* Teodor Sigaev<br />
* Greg Smith<br />
<br />
== Proposed Agenda Items ==<br />
<br />
Please list proposed agenda items here:<br />
<br />
* Queuing [Dimitri, Kevin]<br />
* Materialized views [Dimitri, Kevin]<br />
* Partitioning and Segment Exclusion [Dimitri]<br />
* The MERGE statement: Challenges, priorities and implementation [Peter]<br />
* Row-level Access Control and SELinux [KaiGai]<br />
** Security label on user tables<br />
** Dynamic expandable enum data types<br />
** Enforcement of triggers by extension<br />
* Enhancement of FDW at v9.3 [KaiGai]<br />
** Writable foreign tables<br />
** Stuffs to be pushed down (Join, Aggregate, Sort, ...)<br />
** Inheritance of foreign/regular tables<br />
** Constraint (PK/FK) & Trigger support.<br />
* GPU Acceleration [KaiGai]<br />
* Ending CommitFests in a timely fashion, especially the last one. Avoiding a crush of massive feature patches at the end of the cycle. Handling big patches that aren't quite ready yet. Getting more people to help with patch review. [Robert]<br />
<br />
== Agenda ==<br />
<br />
{| border="1" cellpadding="4" cellspacing="0"<br />
!Time<br />
!Item<br />
!Presenter<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:00<br />
|Breakfast<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:45 - 09:00<br />
|Welcome and introductions<br />
|Dave Page<br />
|-<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|10:30 - 10:45<br />
|Coffee break<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|12:30 - 13:30<br />
|Lunch <br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|15:00 - 15:15<br />
|Tea break<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|16:45 - 17:00<br />
|Any other business/group photo<br />
|Dave Page<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|17:00<br />
|Finish<br />
| <br />
|}<br />
<br />
<br />
==Minutes==</div>Sternocerahttps://wiki.postgresql.org/index.php?title=What%27s_new_in_PostgreSQL_9.2&diff=16588What's new in PostgreSQL 9.22012-04-20T15:31:57Z<p>Sternocera: /* Performance improvements */ type-specific specializations didn't make the cut.</p>
<hr />
<div>{{Languages}}<br />
<br />
This document showcases many of the latest developments in PostgreSQL 9.2, compared to the last major release &ndash; PostgreSQL 9.1. There are many improvements in this release, so this wiki page covers many of the more important changes in detail. The full list of changes is itemised in ''Release Notes''.<br />
<br />
'''This page is incomplete!'''<br />
<br />
=Major new features=<br />
<br />
==Index-only scans==<br />
Index-only scans is a new performance feature whereby PostgreSQL can skip the heap visibility check if the index contains all necessary columns, for pages that are known to be all-visible. This feature is similar to '''covering indexes''' in other database systems, although the implementation is different. (More info: [http://www.depesz.com/2011/10/08/waiting-for-9-2-index-only-scans/ depesz blog])<br />
<br />
In previous PostgreSQL versions, all matching index rows in an index scan had to consult the table heap for visibility information. In version 9.2, an index-only scan first checks the smaller [http://www.postgresql.org/docs/devel/static/storage-vm.html visibility map] to see whether all the rows on the particular page are visible. If true, the table heap fetch can be skipped. VACUUM is responsible for setting the visibility map bits.<br />
<br />
This required making visibility map changes crash-safe, so visibility map bit changes are now WAL-logged.<br />
<br />
==Cascading replication==<br />
Streaming replication slaves can now serve as a source for other slaves. This can be used to reduce the impact of replication on the master server. (More info: [http://www.depesz.com/2011/07/26/waiting-for-9-2-cascading-streaming-replication/ depesz blog])<br />
<br />
A related feature, the pg_basebackup command now also works from slaves (More info: [http://www.depesz.com/2012/02/03/waiting-for-9-2-pg_basebackup-from-slave/ depesz blog])<br />
<br />
==Multi-processor scalability improvements==<br />
The lock contention of several big locks has been significantly reduced, leading to better multi-processor scalability. (More info: [http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html Robert Haas blog])<br />
<br />
==JSON datatype==<br />
The JSON datatype is meant for storing JSON-structured data. (More info: [http://www.depesz.com/2012/02/12/waiting-for-9-2-json/ depesz blog])<br />
<br />
== Range Types ==<br />
[[RangeTypes]] are added.<br />
(More info: [http://www.depesz.com/2011/11/07/waiting-for-9-2-range-data-types/])<br />
<br />
=Performance improvements=<br />
<br />
* The performance of in-memory sorts has been improved by up to 25% in some situations, with certain specialized sort functions introduced. (More info: [http://momjian.us/main/blogs/pgblog/2012.html#February_16_2012 Bruce Momjian's blog])<br />
<br />
* An idle PostgreSQL server now makes less wakeups, leading to lower power consumption ([http://pgeoghegan.blogspot.com/2012/01/power-consumption-in-postgres-92.html Peter Geoghegan's blog])<br />
<br />
* Timing can now be disabled with EXPLAIN (analyze on, timing off), leading to lower overhead on platforms where getting the current time is expensive ([http://www.depesz.com/2012/02/13/waiting-for-9-2-explain-timing/ depesz blog])<br />
<br />
<br />
[[Category:PostgreSQL 9.2]]</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16509Valgrind2012-04-07T22:52:46Z<p>Sternocera: /* General testing procedure */</p>
<hr />
<div>== Valgrind and Postgres ==<br />
<br />
'''Warning: The techniques described here are the subject of ongoing research - your mileage may vary'''<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
== The modifications themselves ==<br />
<br />
[[User:sternocera|Peter Geoghegan]] maintains a feature branch that has the necessary modifications to Postgres:<br />
<br />
https://github.com/Peter2ndQuadrant/postgres/tree/valgrind<br />
<br />
Per recommendations in the Valgrind documentation, the modifications just copy valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 5, 2012.<br />
<br />
It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0, so that all statements are logged. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.<br />
<br />
For reasons that have yet to be ascertained, ''it is necessary to run the regression tests with '''autovacuum = 'off''''''. Otherwise, Postgres will segfault within an autovacuum worker's elog() call.<br />
<br />
== General testing procedure ==<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~<br />
$ mkdir pg-valgrind<br />
$ git clone git@github.com:Peter2ndQuadrant/postgres.git<br />
$ cd ~/postgres<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# contrib regression tests will be run below, make sure the modules are built<br />
$ cd contrib/<br />
$ make && make install<br />
$ cd ~<br />
# If necessary, initdb. Be sure to modify postgresql.conf as appropriate.<br />
# Start Postmaster - core dumps will not appear in $PGDATA, but in pg-valgrind<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgres/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# From another tty, run tests themselves:<br />
$ make installcheck-world<br />
</source><br />
<br />
There are probably other places in the Valgrind branch where specific checks should be injected, because shared_buffers effectively scrubs memory from a valgrind perspective.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16508Valgrind2012-04-07T22:47:12Z<p>Sternocera: /* General testing procedure */</p>
<hr />
<div>== Valgrind and Postgres ==<br />
<br />
'''Warning: The techniques described here are the subject of ongoing research - your mileage may vary'''<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
== The modifications themselves ==<br />
<br />
[[User:sternocera|Peter Geoghegan]] maintains a feature branch that has the necessary modifications to Postgres:<br />
<br />
https://github.com/Peter2ndQuadrant/postgres/tree/valgrind<br />
<br />
Per recommendations in the Valgrind documentation, the modifications just copy valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 5, 2012.<br />
<br />
It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0, so that all statements are logged. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.<br />
<br />
For reasons that have yet to be ascertained, ''it is necessary to run the regression tests with '''autovacuum = 'off''''''. Otherwise, Postgres will segfault within an autovacuum worker's elog() call.<br />
<br />
== General testing procedure ==<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# contrib regression tests will be run below, make sure the modules are built<br />
$ cd contrib/<br />
$ make && make install<br />
# Start Postmaster - core dumps will not appear in $PGDATA, but in pg-valgrind<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
There are probably other places in the Valgrind branch where specific checks should be injected, because shared_buffers effectively scrubs memory from a valgrind perspective.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16507Valgrind2012-04-07T21:59:21Z<p>Sternocera: /* General testing procedure */</p>
<hr />
<div>== Valgrind and Postgres ==<br />
<br />
'''Warning: The techniques described here are the subject of ongoing research - your mileage may vary'''<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
== The modifications themselves ==<br />
<br />
[[User:sternocera|Peter Geoghegan]] maintains a feature branch that has the necessary modifications to Postgres:<br />
<br />
https://github.com/Peter2ndQuadrant/postgres/tree/valgrind<br />
<br />
Per recommendations in the Valgrind documentation, the modifications just copy valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 5, 2012.<br />
<br />
It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0, so that all statements are logged. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.<br />
<br />
For reasons that have yet to be ascertained, ''it is necessary to run the regression tests with '''autovacuum = 'off''''''. Otherwise, Postgres will segfault within an autovacuum worker's elog() call.<br />
<br />
== General testing procedure ==<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster - core dumps will not appear in $PGDATA, but in pg-valgrind<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
There are probably other places in the Valgrind branch where specific checks should be injected, because shared_buffers effectively scrubs memory from a valgrind perspective.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16506Valgrind2012-04-07T21:57:22Z<p>Sternocera: </p>
<hr />
<div>== Valgrind and Postgres ==<br />
<br />
'''Warning: The techniques described here are the subject of ongoing research - your mileage may vary'''<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
== The modifications themselves ==<br />
<br />
[[User:sternocera|Peter Geoghegan]] maintains a feature branch that has the necessary modifications to Postgres:<br />
<br />
https://github.com/Peter2ndQuadrant/postgres/tree/valgrind<br />
<br />
Per recommendations in the Valgrind documentation, the modifications just copy valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 5, 2012.<br />
<br />
It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0, so that all statements are logged. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.<br />
<br />
For reasons that have yet to be ascertained, ''it is necessary to run the regression tests with '''autovacuum = 'off''''''. Otherwise, Postgres will segfault within an autovacuum worker's elog() call.<br />
<br />
== General testing procedure ==<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster - core dumps will not appear in the CWD (that is, $PGDATA) but in pg-valgrind<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
There are probably other places in the Valgrind branch where specific checks should be injected, because shared_buffers effectively scrubs memory from a valgrind perspective.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16505Valgrind2012-04-07T21:23:21Z<p>Sternocera: </p>
<hr />
<div>== Valgrind and Postgres ==<br />
<br />
'''Warning: The techniques described here are the subject of ongoing research - your mileage may vary'''<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
== The patch itself ==<br />
<br />
[[User:sternocera|Peter Geoghegan]] maintains a feature branch that has the necessary modifications to Postgres:<br />
<br />
https://github.com/Peter2ndQuadrant/postgres/tree/valgrind<br />
<br />
Per recommendations in the Valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 3 2012.<br />
<br />
It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
== General testing procedure ==<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster - core dumps will not appear in the CWD (that is, $PGDATA) but in pg-valgrind<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
There are probably other places in the patch where specific checks should be injected, because shared_buffers effectively scrubs memory from a valgrind perspective.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. <br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=User:Sternocera&diff=16504User:Sternocera2012-04-07T21:18:42Z<p>Sternocera: </p>
<hr />
<div>Peter Geoghegan - peter at 2ndquadrant dot com</div>Sternocerahttps://wiki.postgresql.org/index.php?title=User:Sternocera&diff=16503User:Sternocera2012-04-07T21:18:23Z<p>Sternocera: Created page with "Peter Geoghegan peter@2ndquadrant.com"</p>
<hr />
<div>Peter Geoghegan<br />
peter@2ndquadrant.com</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16501Valgrind2012-04-07T20:20:18Z<p>Sternocera: /* General testing procedure */</p>
<hr />
<div>== Valgrind and Postgres ==<br />
<br />
'''Warning: The techniques described here are the subject of ongoing research - your mileage may vary'''<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
== The patch itself ==<br />
<br />
http://wiki.postgresql.org/images/d/d7/Valgrind-hooks-v1.patch.gz<br />
<br />
Per recommendations in the Valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 3 2012.<br />
<br />
It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
== General testing procedure ==<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster - core dumps will not appear in the CWD (that is, $PGDATA) but in pg-valgrind<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
There are probably other places in the patch where specific checks should be injected, because shared_buffers effectively scrubs memory from a valgrind perspective.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. <br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16500Valgrind2012-04-07T20:16:13Z<p>Sternocera: </p>
<hr />
<div>== Valgrind and Postgres ==<br />
<br />
'''Warning: The techniques described here are the subject of ongoing research - your mileage may vary'''<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
== The patch itself ==<br />
<br />
http://wiki.postgresql.org/images/d/d7/Valgrind-hooks-v1.patch.gz<br />
<br />
Per recommendations in the Valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 3 2012.<br />
<br />
It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
== General testing procedure ==<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
Note that shared_buffers effectively scrubs memory from Valgrind's perspective.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. <br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16499Valgrind2012-04-07T20:14:47Z<p>Sternocera: </p>
<hr />
<div>== Valgrind and Postgres ==<br />
<br />
{{warning The techniques described here are the subject of ongoing research - your mileage may vary }}<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
== The patch itself ==<br />
<br />
http://wiki.postgresql.org/images/d/d7/Valgrind-hooks-v1.patch.gz<br />
<br />
Per recommendations in the Valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 3 2012.<br />
<br />
It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
== General testing procedure ==<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
Note that shared_buffers effectively scrubs memory from Valgrind's perspective.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. <br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16489Valgrind2012-04-04T12:28:37Z<p>Sternocera: </p>
<hr />
<div>== Valgrind and Postgres ==<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
== The patch itself ==<br />
<br />
http://wiki.postgresql.org/images/d/d7/Valgrind-hooks-v1.patch.gz<br />
<br />
Per recommendations in the Valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 3 2012.<br />
<br />
It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
== General testing procedure ==<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
Note that shared_buffers effectively scrubs memory from Valgrind's perspective.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. <br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16488Valgrind2012-04-04T12:25:09Z<p>Sternocera: </p>
<hr />
<div>=== Valgrind and Postgres ===<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
=== The patch itself ===<br />
<br />
http://wiki.postgresql.org/images/d/d7/Valgrind-hooks-v1.patch.gz<br />
<br />
Per recommendations in the Valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 3 2012.<br />
<br />
=== General testing procedure ===<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
Note that shared_buffers effectively scrubs memory from Valgrind's perspective.<br />
<br />
== Co-ordination when running tests ==<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16487Valgrind2012-04-04T12:13:53Z<p>Sternocera: </p>
<hr />
<div>[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike to the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
shared_buffers effectively scrubs memory from Valgrind's perspective.<br />
<br />
=== General testing procedure ===<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
=== The patch itself ===<br />
<br />
http://wiki.postgresql.org/images/d/d7/Valgrind-hooks-v1.patch.gz<br />
<br />
Per recommendations in the Valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 3 2012.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16486Valgrind2012-04-04T12:06:08Z<p>Sternocera: </p>
<hr />
<div>[[Valgrind]]<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, unlike to the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
shared_buffers effectively scrubs memory from Valgrind's perspective.<br />
<br />
=== General testing procedure ===<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
=== The patch itself ===<br />
<br />
http://wiki.postgresql.org/images/d/d7/Valgrind-hooks-v1.patch.gz<br />
<br />
Per recommendations in the Valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 3 2012.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16485Valgrind2012-04-04T12:05:17Z<p>Sternocera: /* The patch itself */</p>
<hr />
<div>[[Valgrind]]<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, as opposed to the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
shared_buffers effectively scrubs memory from Valgrind's perspective.<br />
<br />
=== General testing procedure ===<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
=== The patch itself ===<br />
<br />
http://wiki.postgresql.org/images/d/d7/Valgrind-hooks-v1.patch.gz<br />
<br />
Per recommendations in the Valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current for the master branch, as of April 3 2012.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16484Valgrind2012-04-04T12:04:57Z<p>Sternocera: /* The patch itself */</p>
<hr />
<div>[[Valgrind]]<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, as opposed to the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
shared_buffers effectively scrubs memory from Valgrind's perspective.<br />
<br />
=== General testing procedure ===<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
=== The patch itself ===<br />
<br />
http://wiki.postgresql.org/images/d/d7/Valgrind-hooks-v1.patch.gz<br />
<br />
Per recommendations in the Valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current as for the master branch, as of April 3 2012.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=File:Valgrind-hooks-v1.patch.gz&diff=16483File:Valgrind-hooks-v1.patch.gz2012-04-04T12:04:35Z<p>Sternocera: valgrind for Postgres patch</p>
<hr />
<div>valgrind for Postgres patch</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Valgrind&diff=16482Valgrind2012-04-04T11:59:24Z<p>Sternocera: Created page with "Valgrind [http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect ma…"</p>
<hr />
<div>[[Valgrind]]<br />
<br />
[http://http://valgrind.org Valgrind] is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile programs in detail. In particular, Valgrind's Memcheck tool is useful for detecting these bugs. However, it is non-trivial to use with Postgres, and requires modification to Postgres source files to instrument memory allocation and memory context infrastructure with various Valgrind macros.<br />
<br />
It is hoped that at some point in the future, Postgres will directly support Valgrind through the use of a configure option, which is possible due to the fact that the header file valgrind.h is under a BSD license, as opposed to the rest of Valgrind which is under the GPL 2. In the meantime, this wiki page is the place to obtain an unofficial patch that adds the necessary calls. It is not as comprehensive as it possibly could be, and there are probably other places where specific checks could be usefully injected. <br />
<br />
shared_buffers effectively scrubs memory from Valgrind's perspective.<br />
<br />
=== General testing procedure ===<br />
<br />
For general tests, the recommended procedure is:<br />
<source lang="bash"><br />
# Build Postgres with the valgrind patch<br />
$ cd ~/postgresql<br />
$ patch -p1 < valgrind_postgres.patch<br />
# Building at O1 would probably also be acceptable if O0 proves too slow, but avoid O2<br />
$ ./configure --enable-debug CFLAGS=-"O0 -g"<br />
$ make && make install<br />
# Start Postmaster<br />
$ valgrind --leak-check=no --gen-suppressions=all --suppressions=postgresql/valgrind.supp --time-stamp=yes --log-file=pg-valgrind/%p.log postgres 2>&1 | tee pg-valgrind/postmaster.log<br />
# run tests<br />
$ make installcheck-world<br />
</source><br />
<br />
=== Co-ordination when running tests ===<br />
<br />
postgresql.conf should include a timestamp and PID in log_line_prefix, as well as having a log_min_duration_statement of 0. Since the valgrind logs included timestamps and were split by PID, they can be used to correlate valgrind errors with particular test suite commands. Once the test cases yielding valgrind errors are tracked-down, you can rerun the valgrind-ed postmaster with "--track-origins=yes --read-var-info=yes" to get more-specific diagnostics. Valgrind 3.6.0 should be used to get good pinpointing of the error source. At time of writing, version 3.7.0 is the latest stable release.<br />
<br />
The full installcheck-world run has been found to take something around six hours on a modern machine, but memory consumption is not greatly inflated. It is recommended that when running Valgrind that you disable CLOBBER_FREED_MEMORY and MEMORY_CONTEXT_CHECKING; they add additional valgrind hook traffic and are redundant with the testing valgrind performs. The patch actually switches the pg_config_manual.h defaults for those settings.<br />
<br />
=== The patch itself ===<br />
<br />
Per recommendations in the valgrind documentation, this patch just copies valgrind.h into the PostgreSQL tree. It is current as for the master branch, as of April 3 2012.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=PgCon_2012_Developer_Meeting&diff=16340PgCon 2012 Developer Meeting2012-02-29T01:59:29Z<p>Sternocera: /* Attendees */ - alphabetical order</p>
<hr />
<div>A meeting of the most active PostgreSQL developers is being planned for Wednesday 16th May, 2012 near the University of Ottawa, prior to pgCon 2012. In order to keep the numbers manageable, this meeting is '''by invitation only'''. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org). <br />
<br />
Please note that this year the attendee numbers have been cut to try to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.2 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.<br />
<br />
This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.<br />
<br />
== Time & Location ==<br />
<br />
The meeting will be from 9AM to 5PM, and will be in the "Red Experience" room at:<br />
<br />
Novotel Ottawa<br />
33 Nicholas Street<br />
Ottawa<br />
Ontario<br />
K1N 9M7<br />
<br />
Food and drink will be provided throughout the day, including breakfast from 8AM.<br />
<br />
[http://maps.google.ca/maps?f=q&source=s_q&hl=en&geocode=&q=novotel+ottawa&aq=&sll=49.891235,-97.15369&sspn=36.237851,79.013672&ie=UTF8&hq=novotel+ottawa&hnear=&ll=45.421528,-75.683699&spn=0.036869,0.077162&z=14&iwloc=A&layer=c&cbll=45.425741,-75.689638&panoid=Z4FUGnkZkdHAOkIxyjjS9Q&cbp=12,25.83,,0,-0.6 View on Google Maps]<br />
<br />
== Attendees ==<br />
<br />
The following people have RSVPed to the meeting (in alphabetical order, by surname):<br />
<br />
* Oleg Bartunov<br />
* Josh Berkus (Secretary)<br />
* Jeff Davis<br />
* Dimitri Fontaine<br />
* Peter Geoghegan<br />
* Magnus Hagander<br />
* Hitoshi Harada<br />
* KaiGai Kohei<br />
* Dave Page (Chair)<br />
* Simon Riggs<br />
* Teodor Sigaev<br />
* Greg Smith<br />
<br />
== Proposed Agenda Items ==<br />
<br />
Please list proposed agenda items here:<br />
<br />
* Queuing [Dimitri]<br />
* Materialized views [Dimitri, Kevin?]<br />
* Partitioning and Segment Exclusion [Dimitri]<br />
* Row-level Access Control and SELinux [KaiGai]<br />
** Security label on user tables<br />
** Dynamic expandable enum data types<br />
** Enforcement of triggers by extension<br />
* Enhancement of FDW at v9.3 [KaiGai]<br />
** Writable foreign tables<br />
** Stuffs to be pushed down (Join, Aggregate, Sort, ...)<br />
** Inheritance of foreign/regular tables<br />
** Constraint (PK/FK) & Trigger support.<br />
* GPU Acceleration [KaiGai]<br />
<br />
== Agenda ==<br />
<br />
{| border="1" cellpadding="4" cellspacing="0"<br />
!Time<br />
!Item<br />
!Presenter<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:00<br />
|Breakfast<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|08:45 - 09:00<br />
|Welcome and introductions<br />
|Dave Page<br />
|-<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|10:30 - 10:45<br />
|Coffee break<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|12:30 - 13:30<br />
|Lunch <br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|15:00 - 15:15<br />
|Tea break<br />
|<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|16:45 - 17:00<br />
|Any other business/group photo<br />
|Dave Page<br />
|- style="font-style:italic;background-color:lightgray;"<br />
|17:00<br />
|Finish<br />
| <br />
|}<br />
<br />
<br />
==Minutes==</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Group_commit&diff=16194Group commit2012-01-21T01:51:36Z<p>Sternocera: </p>
<hr />
<div>=== Description of feature ===<br />
<br />
''Group commit'' is a feature planned for PostgreSQL 9.2 . <br />
<br />
The feature is being developed by Simon Riggs and Peter Geoghegan. The latest -hackers thread on the feature is: http://archives.postgresql.org/pgsql-hackers/2012-01/msg00804.php .<br />
<br />
Broadly speaking, a group commit feature enables PostgreSQL to commit a group of transactions in batch, amortizing the cost of flushing WAL. The proposed implementation this page describes is heavily based on the existing synchronous replication implementation. It supercedes the commit_siblings "group commit" implementation of prior versions. This earlier implementation was never really considered to be effective, and its use was weighed down by caveats, so in practice it was only used very infrequently. It is anticipated that the proposed implementation will be turned on by default, and it may not be possible to turn off.<br />
<br />
=== Benchmark ===<br />
<br />
Benchmarking of this feature has been performed by Greg Smith's pgbench tool (https://github.com/gregs1104/pgbench-tools) . Here are results for the initial benchmark:<br />
<br />
http://wiki.postgresql.org/images/5/50/Group-commit-pgbench-tools.pdf<br />
<br />
Revised results, with semaphore implementation:<br />
<br />
http://wiki.postgresql.org/images/c/c6/Group-commit-semaphore-results.pdf<br />
<br />
These results were obtained on an ext4 (Linux kernel 3.1) filesystem with LVM. The harddisk used was a WDC WD3200BEKT-08PVMT1 7200 RPM sata disk, with write caching enabled.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=File:Group-commit-semaphore-results.pdf&diff=16193File:Group-commit-semaphore-results.pdf2012-01-21T01:49:30Z<p>Sternocera: </p>
<hr />
<div></div>Sternocerahttps://wiki.postgresql.org/index.php?title=Group_commit&diff=16170Group commit2012-01-17T14:54:10Z<p>Sternocera: </p>
<hr />
<div>=== Description of feature ===<br />
<br />
''Group commit'' is a feature planned for PostgreSQL 9.2 . <br />
<br />
The feature is being developed by Simon Riggs and Peter Geoghegan. The latest -hackers thread on the feature is: http://archives.postgresql.org/pgsql-hackers/2012-01/msg00804.php .<br />
<br />
Broadly speaking, a group commit feature enables PostgreSQL to commit a group of transactions in batch, amortizing the cost of flushing WAL. The proposed implementation this page describes is heavily based on the existing synchronous replication implementation. It supercedes the commit_siblings "group commit" implementation of prior versions. This earlier implementation was never really considered to be effective, and its use was weighed down by caveats, so in practice it was only used very infrequently. It is anticipated that the proposed implementation will be turned on by default, and it may not be possible to turn off.<br />
<br />
=== Benchmark ===<br />
<br />
Benchmarking of this feature has been performed by Greg Smith's pgbench tool (https://github.com/gregs1104/pgbench-tools) . Here are results for the initial benchmark:<br />
<br />
http://wiki.postgresql.org/images/5/50/Group-commit-pgbench-tools.pdf<br />
<br />
These results were obtained on an ext4 (Linux kernel 3.1) filesystem with LVM. The harddisk used was a WDC WD3200BEKT-08PVMT1 7200 RPM sata disk, with write caching enabled.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Group_commit&diff=16169Group commit2012-01-17T14:53:33Z<p>Sternocera: </p>
<hr />
<div>== Group commit ==<br />
<br />
''Group commit'' is a feature planned for PostgreSQL 9.2 . <br />
<br />
=== Description of feature ===<br />
<br />
The feature is being developed by Simon Riggs and Peter Geoghegan. The latest -hackers thread on the feature is: http://archives.postgresql.org/pgsql-hackers/2012-01/msg00804.php .<br />
<br />
Broadly speaking, a group commit feature enables PostgreSQL to commit a group of transactions in batch, amortizing the cost of flushing WAL. The proposed implementation this page describes is heavily based on the existing synchronous replication implementation. It supercedes the commit_siblings "group commit" implementation of prior versions. This earlier implementation was never really considered to be effective, and its use was weighed down by caveats, so in practice it was only used very infrequently. It is anticipated that the proposed implementation will be turned on by default, and it may not be possible to turn off.<br />
<br />
=== Benchmark ===<br />
<br />
Benchmarking of this feature has been performed by Greg Smith's pgbench tool (https://github.com/gregs1104/pgbench-tools) . Here are results for the initial benchmark:<br />
<br />
http://wiki.postgresql.org/images/5/50/Group-commit-pgbench-tools.pdf<br />
<br />
These results were obtained on an ext4 (Linux kernel 3.1) filesystem with LVM. The harddisk used was a WDC WD3200BEKT-08PVMT1 7200 RPM sata disk, with write caching enabled.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Group_commit&diff=16168Group commit2012-01-17T14:49:36Z<p>Sternocera: </p>
<hr />
<div>== Group commit ==<br />
<br />
'Group commit'' is a feature planned for PostgreSQL 9.2 . <br />
<br />
=== Description of feature ===<br />
<br />
The feature is being developed by Simon Riggs and Peter Geoghegan. The latest -hackers thread on the feature is: http://archives.postgresql.org/pgsql-hackers/2012-01/msg00804.php .<br />
<br />
Broadly speaking, a group commit feature enables PostgreSQL to commit a group of transactions in batch, amortizing the cost of flushing WAL. The proposed implementation this page describes is heavily based on the existing synchronous replication implementation. It supercedes the commit_siblings "group commit" implementation of prior versions. This earlier implementation was never really considered to be effective, and its use was weighed down by caveats, so in practice it was only used very infrequently. It is anticipated that the proposed implementation will be turned on by default, and it may not be possible to turn off.<br />
<br />
=== Benchmark ===<br />
<br />
Benchmarking of this feature has been performed by Greg Smith's pgbench tool (https://github.com/gregs1104/pgbench-tools) . Here are results for the initial benchmark:<br />
<br />
http://wiki.postgresql.org/images/5/50/Group-commit-pgbench-tools.pdf<br />
<br />
These results were obtained on an ext4 (Linux kernel 3.1) filesystem with LVM. The harddisk used was a WDC WD3200BEKT-08PVMT1 7200 RPM sata disk, with write caching enabled.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Group_commit&diff=16167Group commit2012-01-17T13:58:55Z<p>Sternocera: /* Description of feature */</p>
<hr />
<div>== Group commit ==<br />
<br />
'Group commit'' is a feature planned for PostgreSQL 9.2 . The feature is being developed by Simon Riggs and Peter Geoghegan. The latest -hackers thread on the feature is: http://archives.postgresql.org/pgsql-hackers/2012-01/msg00804.php .<br />
<br />
=== Description of feature ===<br />
<br />
Broadly speaking, a group commit feature enables PostgreSQL to commit a group of transactions in batch, amortizing the cost of flushing WAL. The proposed implementation this page describes is heavily based on the existing synchronous replication implementation. It supercedes the commit_siblings "group commit" implementation of prior versions. This earlier implementation was never really considered to be effective, and its use was weighed down by caveats, so in practice it was only used very infrequently. It is anticipated that the proposed implementation will be turned on by default, and may not be possible to turn off.<br />
<br />
=== Benchmark ===<br />
<br />
Benchmarking of this feature has been performed by Greg Smith's pgbench tool (https://github.com/gregs1104/pgbench-tools) . Here are results for the initial benchmark:<br />
<br />
http://wiki.postgresql.org/images/5/50/Group-commit-pgbench-tools.pdf<br />
<br />
These results were obtained on an ext4 (Linux kernel 3.1) filesystem with LVM. The harddisk used was a WDC WD3200BEKT-08PVMT1 7200 RPM sata disk, with write caching enabled.</div>Sternocerahttps://wiki.postgresql.org/index.php?title=Group_commit&diff=16166Group commit2012-01-17T13:18:26Z<p>Sternocera: /* Benchmark */</p>
<hr />
<div>== Group commit ==<br />
<br />
'Group commit'' is a feature planned for PostgreSQL 9.2 . The feature is being developed by Simon Riggs and Peter Geoghegan. The latest -hackers thread on the feature is: http://archives.postgresql.org/pgsql-hackers/2012-01/msg00804.php .<br />
<br />
=== Description of feature ===<br />
<br />
Broadly speaking, a group commit feature enables PostgreSQL to commit a group of transactions in batch, amortizing the cost of flushing WAL. The proposed implementation this page describes is heavily based on the existing synchronous replication implementation.<br />
<br />
=== Benchmark ===<br />
<br />
Benchmarking of this feature has been performed by Greg Smith's pgbench tool (https://github.com/gregs1104/pgbench-tools) . Here are results for the initial benchmark:<br />
<br />
http://wiki.postgresql.org/images/5/50/Group-commit-pgbench-tools.pdf<br />
<br />
These results were obtained on an ext4 (Linux kernel 3.1) filesystem with LVM. The harddisk used was a WDC WD3200BEKT-08PVMT1 7200 RPM sata disk, with write caching enabled.</div>Sternocera