What's new in PostgreSQL 9.2

From PostgreSQL wiki
Jump to navigationJump to search


This document showcases many of the latest developments in PostgreSQL 9.2, compared to the last major release – PostgreSQL 9.1. There are many improvements in this release, so this wiki page covers many of the more important changes in detail. The full list of changes is itemised in Release Notes.


Major new features

Index-only scans

In PostgreSQL, indexes have no "visibility" information. It means that when you access a record by its index, PostgreSQL has to visit the real tuple in the table to be sure it is visible to you: the tuple the index points to may simply be an old version of the record you are looking for.

It can be a very big performance problem: the index is mostly ordered, so accessing its records is quite efficient, while the records may be scattered all over the place (that's a reason why PostgreSQL has a cluster command, but that's another story). In 9.2, PostgreSQL will use an "Index Only Scan" when possible, and not access the record itself if it doesn't need to.

There is still no visibility information in the index. So in order to do this, PostgreSQL uses the visibility map (visibility map) , which tells it whether the whole content of a (usually) 8K page is visible to all transactions or not. When the index record points to a tuple contained in an «all visible» page, PostgreSQL won't have to access the tuple, it will be able to build it directly from the index. Of course, all the columns requested by the query must be in the index.

The visibility map is maintained by VACUUM (it sets the visible bit), and by the backends doing SQL work (they unset the visible bit).

If the data has been read only since the last VACUUM then the data is All Visible and the index only scan feature can improve performance.

Here is an example.

 CREATE TABLE demo_ios (col1 float, col2 float, col3 text);

In this table, we'll put random data, in order to have "scattered" data. We'll put 100 million records, to have a big recordset, and have it not fit in memory (that's a 4GB-ram machine). This is an ideal case, made for this demo. The gains won't be that big in real life.

 INSERT INTO demo_ios SELECT generate_series(1,100000000),random(), 'mynotsolongstring';
 
 SELECT pg_size_pretty(pg_total_relation_size('demo_ios'));
  pg_size_pretty 
 ----------------
  6512 MB

Let's pretend that the query is this:

 SELECT col1,col2 FROM demo_ios where col2 BETWEEN 0.01 AND 0.02

In order to use an index only scan on this query, we need an index on col2,col1 (col2 first, as it is used in the WHERE clause).

 CREATE index idx_demo_ios on demo_ios(col2,col1);

We vacuum the table, so that the visibility map to be up-to-date:

 VACUUM demo_ios;

All the timing you'll see below are done on a cold OS and PostgreSQL cache (that's where the gains are, as the purpose on Index Only Scans is to reduce I/O).

Let's first try without Index Only Scans:

 SET enable_indexonlyscan to off;
 
 EXPLAIN (analyze,buffers) select col1,col2 FROM demo_ios where col2 between 0.01 and 0.02;
                                                                QUERY PLAN                                                               
 ----------------------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on demo_ios  (cost=25643.01..916484.44 rows=993633 width=16) (actual time=763.391..362963.899 rows=1000392 loops=1)
    Recheck Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))
    Rows Removed by Index Recheck: 68098621
    Buffers: shared hit=2 read=587779
    ->  Bitmap Index Scan on idx_demo_ios  (cost=0.00..25394.60 rows=993633 width=0) (actual time=759.011..759.011 rows=1000392 loops=1)
          Index Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))
          Buffers: shared hit=2 read=3835
  Total runtime: 364390.127 ms


With Index Only Scans:

 explain (analyze,buffers) select col1,col2 from demo_ios where col2 between 0.01 and 0.02;
                                                                   QUERY PLAN                                                                   
 -----------------------------------------------------------------------------------------------------------------------------------------------
  Index Only Scan using idx_demo_ios on demo_ios  (cost=0.00..35330.93 rows=993633 width=16) (actual time=58.100..3250.589 rows=1000392 loops=1)
    Index Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))
    Heap Fetches: 0
    Buffers: shared hit=923073 read=3848
  Total runtime: 4297.405 ms


As nothing is free, there are a few things to keep in mind:

  • Adding indexes for index only scans obviously adds indexes to your table. So updates will be slower.
  • You will index columns that weren't indexed before. So there will be less opportunities for HOT updates.
  • Gains will probably be smaller in real life situations, especially when data is changed between VACUUMs

This required making visibility map changes crash-safe, so visibility map bit changes are now WAL-logged.

Replication improvements

Streaming Replication becomes more polished with this release.

One of the main remaining gripes about streaming replication is that all the slaves have to be connected to the same and unique master, consuming its resources. Moreover, in case of a failover, it could be complicated to reconnect all the remaining slaves to the newly promoted master, if one is not using a tool like repmgr.

With 9.2, a standby can also send replication changes, allowing cascading replication.

Let's build this. We start with an already working 9.2 database.

We set it up for replication:

postgresql.conf:

 wal_level=hot_standby #(could be archive too)
 max_wal_senders=5
 hot_standby=on

You'll probably also want to activate archiving in production, it won't be done here.

pg_hba.conf (do not use trust in production):

 host   replication replication_user          0.0.0.0/0                      md5

Create the user:

 create user replication_user replication password 'secret';

Clone the cluster:

 pg_basebackup -h localhost -U replication_user -D data2
 Password:

We have a brand new cluster in the data2 directory. We'll change the port so that it can start (postgresql.conf), as both clusters are running on the same machine:

 port=5433

We add a recovery.conf to tell it how to stream from the master database:

 standby_mode = on
 primary_conninfo = 'host=localhost port=5432 user=replication_user password=secret' 
 pg_ctl -D data2 start
 server starting
 LOG:  database system was interrupted; last known up at 2012-07-03 17:58:09 CEST
 LOG:  creating missing WAL directory "pg_xlog/archive_status"
 LOG:  entering standby mode
 LOG:  streaming replication successfully connected to primary
 LOG:  redo starts at 0/9D000020
 LOG:  consistent recovery state reached at 0/9D0000B8
 LOG:  database system is ready to accept read only connections

Now, let's add a second slave, which will use this slave:

 pg_basebackup -h localhost -U replication_user -D data3 -p 5433
 Password: 

We edit data3's postgresql.conf to change the port:

 port=5434

We modify the recovery.conf to stream from the slave:

 standby_mode = on
 primary_conninfo = 'host=localhost port=5433 user=replication_user password=secret'             # e.g. 'host=localhost port=5432'

We start the third cluster:

 pg_ctl -D data3 start
 server starting
 LOG:  database system was interrupted while in recovery at log time 2012-07-03 17:58:09 CEST
 HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
 LOG:  creating missing WAL directory "pg_xlog/archive_status"
 LOG:  entering standby mode
 LOG:  streaming replication successfully connected to primary
 LOG:  redo starts at 0/9D000020
 LOG:  consistent recovery state reached at 0/9E000000
 LOG:  database system is ready to accept read only connections

Now, everything modified on the master cluster get streamed to the first slave, and from there to the second slave. This second replication has to be monitored from the first slave (the master knows nothing about it).


As you may have noticed from the example, pg_basebackup now works from slaves.

There is another use case that wasn't covered: what if a user didn't care for having a full fledged slave, and only wanted to stream the WAL files to another location, to benefit from the reduced data loss without the burden of maintaining a slave ?

pg_receivexlog is provided just for this purpose: it pretends to be a PostgreSQL slave, but only stores the log files as they are streamed, in a directory:

 pg_receivexlog -D /tmp/new_logs -h localhost -U replication_user

will connect to the master (or a slave), and start creating files:

 ls /tmp/new_logs/
 00000001000000000000009E.partial

Files are of the segment size, so they can be used for a normal recovery of the database. It's the same as an archive command, but with a much smaller granularity.

Remember to rename the last segment to remove the .partial suffix before using it with a PITR restore or any other operation.

The synchronous_commit parameter has a new value: remote_write. It can be used when there is a synchronous slave (synchronous_standby_names is set), meaning that the master doesn't have to wait for the slave to have written the data to disk, only for the slave to have acknowledged the data. With this set, data is protected from a crash on the master, but could still be lost if the slave crashed at the same time (i.e. before having written the in flight data to disk). As this is a quite remote possibility, and the performance improvement will be large, some people will be interested in this compromise.

JSON datatype

The JSON datatype is meant for storing JSON-structured data. It will validate that the input JSON string is correct JSON:

 =# SELECT '{"username":"john","posts":121,"emailaddress":"john@nowhere.com"}'::json;
                                json                                
 -------------------------------------------------------------------
  {"username":"john","posts":121,"emailaddress":"john@nowhere.com"}
 (1 row)
 
 =# SELECT '{"username","posts":121,"emailaddress":"john@nowhere.com"}'::json;
 ERROR:  invalid input syntax for type json at character 8
 DETAIL:  Expected ":", but found ",".
 CONTEXT:  JSON data, line 1: {"username",...
 STATEMENT:  SELECT '{"username","posts":121,"emailaddress":"john@nowhere.com"}'::json;
 ERROR:  invalid input syntax for type json
 LINE 1: SELECT '{"username","posts":121,"emailaddress":"john@nowhere...
                ^
 DETAIL:  Expected ":", but found ",".
 CONTEXT:  JSON data, line 1: {"username",...

You can also convert a row type to JSON:

 =#SELECT * FROM demo ;
  username | posts |    emailaddress     
 ----------+-------+---------------------
  john     |   121 | john@nowhere.com
  mickael  |   215 | mickael@nowhere.com
 (2 rows)
 
 =# SELECT row_to_json(demo) FROM demo;
                                row_to_json                               
 -------------------------------------------------------------------------
  {"username":"john","posts":121,"emailaddress":"john@nowhere.com"}
  {"username":"mickael","posts":215,"emailaddress":"mickael@nowhere.com"}
 (2 rows)

Or an array type:


 =# select array_to_json(array_agg(demo)) from demo;
                                                                 array_to_json                                                                
 ---------------------------------------------------------------------------------------------------------------------------------------------
  [{"username":"john","posts":121,"emailaddress":"john@nowhere.com"},{"username":"mickael","posts":215,"emailaddress":"mickael@nowhere.com"}]
 (1 row)

Range Types

Range types are used to store a range of data of a given type. There are a few pre-defined types. They are integer (int4range), bigint (int8range), numeric (numrange), timestamp without timezone (tsrange), timestamp with timezone (tstzrange), and date (daterange).

Ranges can be made of continuous (numeric, timestamp...) or discrete (integer, date...) data types. They can be open (the bound isn't part of the range) or closed (the bound is part of the range). A bound can also be infinite.

Without these datatypes, most people solve the range problems by using two columns in a table. These range types are much more powerful, as you can use many operators on them.

Here is the intersection between then 1000(open)-2000(closed) and 1000(closed)-1200(closed) numeric range:

 SELECT '(1000,2000]'::numrange * '[1000,1200]'::numrange;
   ?column?   
 -------------
  (1000,1200]
 (1 row)

So you can query on things like: «give me all ranges that intersect this»:

 =# SELECT * from test_range ;
                        period                        
 -----------------------------------------------------
  ["2012-01-01 00:00:00+01","2012-01-02 12:00:00+01"]
  ["2012-01-01 00:00:00+01","2012-03-01 00:00:00+01"]
  ["2008-01-01 00:00:00+01","2015-01-01 00:00:00+01"]
 (3 rows)
 
 
 =# SELECT * FROM test_range WHERE period && '[2012-01-03 00:00:00,2012-01-03 12:00:00]'; 
                        period                        
 -----------------------------------------------------
  ["2012-01-01 00:00:00+01","2012-03-01 00:00:00+01"]
  ["2008-01-01 00:00:00+01","2015-01-01 00:00:00+01"]
 (2 rows)

This query could use an index defined like this:

 =# CREATE INDEX idx_test_range on test_range USING gist (period);

You can also use these range data types to define exclusion constraints:

 CREATE EXTENSION btree_gist ;
 CREATE TABLE reservation (room_id int, period tstzrange);
 ALTER TABLE reservation ADD  EXCLUDE USING GIST (room_id WITH =, period WITH &&);

This means that now it is forbidden to have two records in this table where room_id is equal and period overlaps. The extension btree_gist is required to create a GiST index on room_id (it's an integer, it is usually indexed with a btree index).

 =# INSERT INTO reservation VALUES (1,'(2012-08-23 14:00:00,2012-08-23 15:00:00)');
 INSERT 0 1
 =# INSERT INTO reservation VALUES (2,'(2012-08-23 14:00:00,2012-08-23 15:00:00)');
 INSERT 0 1
 =# INSERT INTO reservation VALUES (1,'(2012-08-23 14:45:00,2012-08-23 15:15:00)');
 ERROR:  conflicting key value violates exclusion constraint "reservation_room_id_period_excl"
 DETAIL:  Key (room_id, period)=(1, ("2012-08-23 14:45:00+02","2012-08-23 15:15:00+02")) 
 conflicts with existing key (room_id, period)=(1, ("2012-08-23 14:00:00+02","2012-08-23 15:00:00+02")).
 STATEMENT:  INSERT INTO reservation VALUES (1,'(2012-08-23 14:45:00,2012-08-23 15:15:00)');

One can also declare new range types.

Performance improvements

This version has performance improvements on a very large range of domains (non-exhaustive):

  • The most visible will probably be the Index Only Scans, which has already been introduced in this document.
  • The lock contention of several big locks has been significantly reduced, leading to better multi-processor scalability, for machines with over 32 cores mostly.
  • The performance of in-memory sorts has been improved by up to 25% in some situations, with certain specialized sort functions introduced.
  • An idle PostgreSQL server now makes less wakeups, leading to lower power consumption. This is especially useful on virtualized and embedded environments.
  • COPY has been improved, it will generate less WAL volume and fewer locks of a table's pages.
  • Statistics are collected on array contents, allowing for better estimations of selectivity on array operations.
  • Text-to-anytype concatenation and quote_literal/quote_nullable functions are not volatile any more, enabling better optimization in some cases
  • The system can now track IO durations

This one deserves a little explanation, as it can be a little tricky. Tracking IO durations means asking repeatedly the time to the operating system. Depending on the operating system and the hardware, this can be quite cheap, or extremely costly. The most import factor here is where the system gets its time from. It could be directly retrieved from the processor (TSC), dedicated hardware such as HPET, or an ACPI call. What's most important is that the cost of getting time can vary from a factor of thousands.

If you are interested in this timing data, it's better to first check if your system will support it without too much of a performance hit. PostgreSQL provides you with the pg_test_timing tool:

$ pg_test_timing 
Testing timing overhead for 3 seconds.
Per loop time including overhead: 28.02 nsec
Histogram of timing durations:
   < usec:      count   percent
       32:         41  0.00004%
       16:       1405  0.00131%
        8:        200  0.00019%
        4:        388  0.00036%
        2:    2982558  2.78523%
        1:  104100166 97.21287%

Here, everything is good: getting time costs around 28 nanoseconds, and has a very small variation. Anything under 100 nanoseconds should be good for production. If you get higher values, you may still find a way to tune your system. You'd better check on the documentation.

Anyway, here is the data you'll be able to collect if your system is ready for this:

First, you'll get per-database statistics, which will now give accurate informations about which database is doing most I/O:

=# SELECT * FROM pg_stat_database WHERE datname = 'mydb';
-[ RECORD 1 ]--+------------------------------
datid          | 16384
datname        | mydb
numbackends    | 1
xact_commit    | 270
xact_rollback  | 2
blks_read      | 1961
blks_hit       | 17944
tup_returned   | 269035
tup_fetched    | 8850
tup_inserted   | 16
tup_updated    | 4
tup_deleted    | 45
conflicts      | 0
temp_files     | 0
temp_bytes     | 0
deadlocks      | 0
blk_read_time  | 583.774
blk_write_time | 0
stats_reset    | 2012-07-03 17:18:54.796817+02

We see here that mydb has only consumed 583.774 milliseconds of read time.

Explain will benefit from this too:

=# EXPLAIN (analyze,buffers) SELECT count(*) FROM mots ;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1669.95..1669.96 rows=1 width=0) (actual time=21.943..21.943 rows=1 loops=1)
   Buffers: shared read=493
   I/O Timings: read=2.578
   ->  Seq Scan on mots  (cost=0.00..1434.56 rows=94156 width=0) (actual time=0.059..12.933 rows=94156 loops=1)
         Buffers: shared read=493
         I/O Timings: read=2.578
 Total runtime: 22.059 ms

We now have a separate information about the time taken to retrieve data from the operating system. Obviously, here, the data was in the operating system's cache (2 milliseconds to read 493 blocks).

And last, if you have enabled pg_stat_statements:

select * from pg_stat_statements where query ~ 'words';
-[ RECORD 1 ]-------+---------------------------
userid              | 10
dbid                | 16384
query               | select count(*) from words;
calls               | 2
total_time          | 78.332
rows                | 2
shared_blks_hit     | 0
shared_blks_read    | 986
shared_blks_dirtied | 0
shared_blks_written | 0
local_blks_hit      | 0
local_blks_read     | 0
local_blks_dirtied  | 0
local_blks_written  | 0
temp_blks_read      | 0
temp_blks_written   | 0
blk_read_time       | 58.427
blk_write_time      | 0
  • As for every version, the optimizer has received its share of improvements
    • Prepared statements used to be optimized once, without any knowledge of the parameters' values. With 9.2, the planner will use specific plans regarding to the parameters sent (the query will be planned at execution), except if the query is executed several times and the planner decides that the generic plan is not too much more expensive than the specific plans.
    • A new feature has been added: parameterized paths. Simply put, it means that a sub-part of a query plan can use parameters it has got from a parent node. It fixes several bad plans that could occur, especially when the optimizer couldn't reorder joins to put nested loops where it would have been efficient.

This example is straight from the developpers mailing lists :

CREATE TABLE a (
    a_id serial PRIMARY KEY NOT NULL,
    b_id integer
);
CREATE INDEX a__b_id ON a USING btree (b_id);


CREATE TABLE b (
    b_id serial NOT NULL,
    c_id integer
);
CREATE INDEX b__c_id ON b USING btree (c_id);


CREATE TABLE c (
    c_id serial PRIMARY KEY NOT NULL,
    value integer UNIQUE
);

INSERT INTO b (b_id, c_id)
    SELECT g.i, g.i FROM generate_series(1, 50000) g(i);

INSERT INTO a(b_id)
    SELECT g.i FROM generate_series(1, 50000) g(i);

INSERT INTO c(c_id,value)
    VALUES (1,1);

So we have a referencing b, b referencing c.

Here is an example of a query working badly with PostgreSQL 9.1:

EXPLAIN ANALYZE SELECT 1
FROM
    c
WHERE
    EXISTS (
        SELECT *
        FROM a
            JOIN b USING (b_id)
        WHERE b.c_id = c.c_id)
    AND c.value = 1;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=1347.00..3702.27 rows=1 width=0) (actual time=13.799..13.802 rows=1 loops=1)
   Join Filter: (c.c_id = b.c_id)
   ->  Index Scan using c_value_key on c  (cost=0.00..8.27 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=1)
         Index Cond: (value = 1)
   ->  Hash Join  (cost=1347.00..3069.00 rows=50000 width=4) (actual time=13.788..13.788 rows=1 loops=1)
         Hash Cond: (a.b_id = b.b_id)
         ->  Seq Scan on a  (cost=0.00..722.00 rows=50000 width=4) (actual time=0.007..0.007 rows=1 loops=1)
         ->  Hash  (cost=722.00..722.00 rows=50000 width=8) (actual time=13.760..13.760 rows=50000 loops=1)
               Buckets: 8192  Batches: 1  Memory Usage: 1954kB
               ->  Seq Scan on b  (cost=0.00..722.00 rows=50000 width=8) (actual time=0.008..5.702 rows=50000 loops=1)
 Total runtime: 13.842 ms

Not that bad, 13 milliseconds. Still, we are doing sequential scans on a and b, when our common sense tells us that c.value=1 should be used to filter rows more aggressively.

Here's what 9.2 does with this query:

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.00..16.97 rows=1 width=0) (actual time=0.035..0.037 rows=1 loops=1)
   ->  Index Scan using c_value_key on c  (cost=0.00..8.27 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)
         Index Cond: (value = 1)
   ->  Nested Loop  (cost=0.00..8.69 rows=1 width=4) (actual time=0.025..0.025 rows=1 loops=1)
         ->  Index Scan using b__c_id on b  (cost=0.00..8.33 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=1)
               Index Cond: (c_id = c.c_id)
         ->  Index Only Scan using a__b_id on a  (cost=0.00..0.35 rows=1 width=4) (actual time=0.014..0.014 rows=1 loops=1)
               Index Cond: (b_id = b.b_id)
 Total runtime: 0.089 ms

The «parameterized path» is:

   ->  Nested Loop  (cost=0.00..8.69 rows=1 width=4) (actual time=0.025..0.025 rows=1 loops=1)
         ->  Index Scan using b__c_id on b  (cost=0.00..8.33 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=1)
               Index Cond: (c_id = c.c_id)
         ->  Index Only Scan using a__b_id on a  (cost=0.00..0.35 rows=1 width=4) (actual time=0.014..0.014 rows=1 loops=1)
               Index Cond: (b_id = b.b_id)
 Total runtime: 0.089 ms

This part of the plan depends on a parent node (c_id=c.c_id). This part of the plan is called each time with a different parameter coming from the parent node.

This plan is of course much faster, as there is no need to fully scan a, and to fully scan AND hash b.

SP-GiST

SP-GiST stands for Space Partitionned GiST, GiST being Generalized Search Tree. GiST is an index type, and has been available for quite a while in PostgreSQL. GiST is already very efficient at indexing complex data types, but performance tends to suffer when the source data isn't uniformly distributed. SP-GiST tries to fix that.

As all indexing methods available in PostgreSQL, SP-GiST is a generic indexing method, meaning its purpose is to index whatever you'll throw at it, using operators you'll provide. It means that if you want to create a new datatype, and make it indexable through SP-GiST, you'll have to follow the documented API.

SP-GiST can be used to implement 3 type of indexes: trie (suffix) indexing, Quadtree (data is divided into quadrants), and k-d tree (k-dimensional tree).

For now, SP-GiST is provided with operator families called "quad_point_ops", "kd_point_ops" and "text_ops".

As their names indicate, the first one indexes point types, using a quadtree, the second one indexes point types using a k-d tree, and the third one indexes text, using suffix.

pg_stat_statements

This contrib module has received a lot of improvements in this version:

  • Queries are normalized: queries that are identical except for their constant values will be considered the same, as long as their post-parse analysis query tree (that is, the internal representation of the query before rule expansion) are the same. This also implies that differences that are not semantically essential to the query, such as variations in whitespace or alias names, or the use of one particular syntax over another equivalent one will not differentiate queries.
=#SELECT * FROM words WHERE word= 'foo';
 word 
------
(0 ligne)

=# SELECT * FROM words WHERE word= 'bar';
 word 
------
 bar

=#select * from pg_stat_statements where query like '%words where%';
-[ RECORD 1 ]-------+-----------------------------------
userid              | 10
dbid                | 16384
query               | SELECT * FROM words WHERE word= ?;
calls               | 2
total_time          | 142.314
rows                | 1
shared_blks_hit     | 3
shared_blks_read    | 5
shared_blks_dirtied | 0
shared_blks_written | 0
local_blks_hit      | 0
local_blks_read     | 0
local_blks_dirtied  | 0
local_blks_written  | 0
temp_blks_read      | 0
temp_blks_written   | 0
blk_read_time       | 142.165
blk_write_time      | 0

The two queries are shown as one in pg_stat_statements.

  • For prepared statements, the execution part (execute statement) is charged on the prepare statement. That makes it is easier to interpret, and avoids the double-counting there was with PostgreSQL 9.1.
  • pg_stat_statements displays timing in milliseconds, to be consistent with other system views.

Explain improvements

  • Timing can now be disabled with EXPLAIN (analyze on, timing off), leading to lower overhead on platforms where getting the current time is expensive
 =# EXPLAIN (analyze on,timing off) SELECT * FROM reservation ;
                                      QUERY PLAN                                       
 ----------------------------------------------------------------------------------------
  Seq Scan on reservation  (cost=0.00..22.30 rows=1230 width=36) (actual rows=2 loops=1)
  Total runtime: 0.045 ms


  • Have EXPLAIN ANALYZE report the number of rows rejected by filter steps

This new feature makes it much easier to know how many rows are removed by a filter (and spot potential places to put indexes):

 =# EXPLAIN ANALYZE SELECT * FROM test WHERE a ~ 'tra';
                                                   QUERY PLAN                                                   
 ---------------------------------------------------------------------------------------------------------------
  Seq Scan on test  (cost=0.00..106876.56 rows=2002 width=11) (actual time=2.914..8538.285 rows=120256 loops=1)
    Filter: (a ~ 'tra'::text)
    Rows Removed by Filter: 5905600
  Total runtime: 8549.539 ms
 (4 rows)

Backward compatibility

These changes may incur regressions in your applications.

Ensure that xpath() escapes special characters in string values

Before 9.2:

SELECT (XPATH('/*/text()', '<root>&lt;</root>'))[1];
 xpath 
-------
 <

'<' Isn't valid XML.

With 9.2:

SELECT (XPATH('/*/text()', '<root>&lt;</root>'))[1];
 xpath 
-------
 &lt;

Remove hstore's => operator

Up to 9.1, one could use the => operator to create a hstore. Hstore is a contrib, used to store key/values pairs in a column.

In 9.1:

=# SELECT 'a'=>'b';
 ?column? 
----------
 "a"=>"b"
(1 row)

=# SELECT pg_typeof('a'=>'b');
 pg_typeof 
-----------
 hstore
(1 row)

With 9.2:

SELECT 'a'=>'b';
ERROR:  operator does not exist: unknown => unknown at character 11
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
STATEMENT:  SELECT 'a'=>'b';
ERROR:  operator does not exist: unknown => unknown
LINE 1: SELECT 'a'=>'b';
                  ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.

It doesn't mean one cannot use '=>' in hstores, it just isn't an operator anymore:

=# select hstore('a=>b');
  hstore  
----------
 "a"=>"b"
(1 row)

=# select hstore('a','b');
  hstore  
----------
 "a"=>"b"
(1 row)

are still two valid ways to input a hstore.

"=>" is removed as an operator as it is a reserved keyword in SQL.


Have pg_relation_size() and friends return NULL if the object does not exist

A relation could be dropped by a concurrent session, while one was doing a pg_relation_size on it, leading to a SQL exception. Now, it merely returns NULL for this record.


Remove the spclocation field from pg_tablespace

The spclocation field provided the real location of the tablespace. It was filled in during the CREATE or ALTER TABLESPACE command. So it could be wrong: somebody just had to shutdown the cluster, move the tablespace's directory, re-create the symlink in pg_tblspc, and forget to update the spclocation field. The cluster would still run, as the spclocation wasn't used.

So this field has been removed. To get the tablespace's location, use pg_tablespace_location():

=# SELECT *, pg_tablespace_location(oid) AS spclocation FROM pg_tablespace;
  spcname   | spcowner | spcacl | spcoptions |  spclocation   
------------+----------+--------+------------+----------------
 pg_default |       10 |        |            | 
 pg_global  |       10 |        |            | 
 tmptblspc  |       10 |        |            | /tmp/tmptblspc

Have EXTRACT of a non-timezone-aware value measure the epoch from local midnight, not UTC midnight

With PostgreSQL 9.1:

=# SELECT extract(epoch FROM '2012-07-02 00:00:00'::timestamp);
 date_part  
------------
 1341180000
(1 row)

=# SELECT extract(epoch FROM '2012-07-02 00:00:00'::timestamptz);
 date_part  
------------
 1341180000
(1 row)

There is no difference in behaviour between a timstamp with or without timezone.

With 9.2:

=# SELECT extract(epoch FROM '2012-07-02 00:00:00'::timestamp);
 date_part  
------------
 1341187200
(1 row)

=# SELECT extract(epoch FROM '2012-07-02 00:00:00'::timestamptz);
 date_part  
------------
 1341180000
(1 row)

When the timestamp has no timezone, the epoch is calculated with the "local midnight", meaning the 1st january of 1970 at midnight, local-time.

Fix to_date() and to_timestamp() to wrap incomplete dates toward 2020

The wrapping was not consistent between 2 digit dates and 3 digit dates: 2 digit dates always chose the date closest to 2020, 3 digit dates mapped dates from 100 to 999 on 1100 to 1999, and 000 to 099 on 2000 to 2099.

Now PostgreSQL chooses the date closest to 2020, for 2 and 3 digit dates.

With 9.1:

=# SELECT to_date('200-07-02','YYY-MM-DD');
  to_date   
------------
 1200-07-02

With 9.2:

SELECT to_date('200-07-02','YYY-MM-DD');
  to_date   
------------
 2200-07-02


pg_stat_activity and pg_stat_replication's definitions have changed

The view pg_stat_activity has changed. It's not backward compatible, but let's see what this new definition brings us:

  • current_query disappears and is replaced by two columns:
    • state: is the session running a query, waiting
    • query: what is the last run (or still running if stat is "active") query
  • The column procpid is renamed to pid, to be consistent with other system views

The benefit is mostly for tracking «idle in transaction» sessions. Up until now, all we could know was that one of these sessions was idle in transaction, meaning it has started a transaction, maybe done some operations, but still not committed. If that session stayed in this state for a while, there was no way of knowing how it got in this state.

Here is an example:

-[ RECORD 1 ]----+---------------------------------
datid            | 16384
datname          | postgres
pid              | 20804
usesysid         | 10
usename          | postgres
application_name | psql
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2012-07-02 15:02:51.146427+02
xact_start       | 2012-07-02 15:15:28.386865+02
query_start      | 2012-07-02 15:15:30.410834+02
state_change     | 2012-07-02 15:15:30.411287+02
waiting          | f
state            | idle in transaction
query            | DELETE FROM test;

With PostgreSQL 9.1, all we would have would be «idle in transaction».

As this change was backward-incompatible, procpid was also renamed to pid, to be more consistent with other system views. The view pg_stat_replication has also changed. The column procpid is renamed to pid, to also be consistent with other system views.

Change all SQL-level statistics timing values to float8-stored milliseconds

pg_stat_user_functions.total_time, pg_stat_user_functions.self_time, pg_stat_xact_user_functions.total_time, pg_stat_xact_user_functions.self_time, and pg_stat_statements.total_time (contrib) are now in milliseconds, to be consistent with the rest of the timing values.

postgresql.conf parameters changes

  • silent_mode has been removed. Use pg_ctl -l postmaster.log
  • wal_sender_delay has been removed. It is no longer needed
  • custom_variable_classes has been removed. All «classes» are accepted without declaration now
  • ssl_ca_file, ssl_cert_file, ssl_crl_file, ssl_key_file have been added, meaning you can now specify the ssl files

Other new features

DROP INDEX CONCURRENTLY

The regular DROP INDEX command takes an exclusive lock on the table. Most of the time, this isn't a problem, because this lock is short-lived. The problem usually occurs when:

  • A long-running transaction is running, and has a (shared) lock on the table
  • A DROP INDEX is run on this table in another session, asking for an exclusive lock (and waiting for it, as it won't be granted until the long-running transaction ends)

At this point, all other transactions needing to take a shared lock on the table (for a simple SELECT for instance) will have to wait too: their lock acquisition is queued after the DROP INDEX's one.


DROP INDEX CONCURRENTLY works around this and won't lock normal DML statements, just as CREATE INDEX CONCURRENTLY. The limitations are also the same: Since you can only DROP one index with the CONCURRENTLY option, and the CASCADE option is not supported.

NOT VALID CHECK constraints

PostgreSQL 9.1 introduced «NOT VALID» foreign keys. This has been extended to CHECK constraints. Adding a «NOT VALID» constraint on a table means that current data won't be validated, only new and updated rows.

 =# CREATE TABLE test (a int); 
 CREATE TABLE
 =# INSERT INTO test SELECT generate_series(1,100);
 INSERT 0 100
 =# ALTER TABLE test ADD CHECK (a>100) NOT VALID;
 ALTER TABLE
 =# INSERT INTO test VALUES (99);
 ERROR:  new row for relation "test" violates check constraint "test_a_check"
 DETAIL:  Failing row contains (99).
 =# INSERT INTO test VALUES (101);
 INSERT 0 1

Then, later, we can validate the whole table:

 =# ALTER TABLE test VALIDATE CONSTRAINT test_a_check ;
 ERROR:  check constraint "test_a_check" is violated by some row

Domains, which are types with added constraints, can also be declared as not valid, and validated later.

Check constraints can also be renamed now:

 =# ALTER TABLE test RENAME CONSTRAINT test_a_check TO validate_a;
 ALTER TABLE

NO INHERIT constraints

Here is another improvement about constraints: they can be declared as not inheritable, which will be useful in partitioned environments. Let's take PostgreSQL documentation example, and see how it improves the situation:

 CREATE TABLE measurement (
     city_id         int not null,
     logdate         date not null,
     peaktemp        int,
     unitsales       int,
     CHECK (logdate IS NULL) NO INHERIT
 );
 
 CREATE TABLE measurement_y2006m02 (
     CHECK ( logdate >= DATE '2006-02-01' AND logdate < DATE '2006-03-01' )
 ) INHERITS (measurement);
 CREATE TABLE measurement_y2006m03 (
     CHECK ( logdate >= DATE '2006-03-01' AND logdate < DATE '2006-04-01' )
 ) INHERITS (measurement);
 
 
 INSERT INTO measurement VALUES (1,'2006-02-20',1,1);
 ERROR:  new row for relation "measurement" violates check constraint "measurement_logdate_check"
 DETAIL:  Failing row contains (1, 2006-02-20, 1, 1).
 INSERT INTO measurement_y2006m02 VALUES (1,'2006-02-20',1,1);
 INSERT 0 1

Until now, every check constraint created on measurement would have been inherited by children tables. So adding a constraint forbidding inserts, or allowing only some of them, on the parent table was impossible.

Reduce ALTER TABLE rewrites

A table won't get rewritten anymore during an ALTER TABLE when changing the type of a column in the following cases:

  • varchar(x) to varchar(y) when y>=x. It works too if going from varchar(x) to varchar or text (no size limitation)
  • numeric(x,z) to numeric(y,z) when y>=x, or to numeric without specifier
  • varbit(x) to varbit(y) when y>=x, or to varbit without specifier
  • timestamp(x) to timestamp(y) when y>=x or timestamp without specifier
  • timestamptz(x) to timestamptz(y) when y>=x or timestamptz without specifier
  • interval(x) to interval(y) when y>=x or interval without specifier

Security barriers and Leakproof

This new feature has to do with views security. First, let's explain the problem, with a very simplified example:

 =# CREATE TABLE all_data (company_id int, company_data varchar);
 CREATE TABLE
 =# INSERT INTO all_data VALUES (1,'secret_data_for_company_1');
 INSERT 0 1
 =# INSERT INTO all_data VALUES (2,'secret_data_for_company_2');
 INSERT 0 1
 =# CREATE VIEW company1_data AS SELECT * FROM all_data WHERE company_id = 1;
 CREATE VIEW

This is a quite classical way of giving access to only a part of a table to a user: we'll create a user for company_id 1, grant to him the right to access company1_data, and deny him the right to access all_data.

The plan to this query is the following:

 =# explain SELECT * FROM company1_data ;
                         QUERY PLAN                        
 ----------------------------------------------------------
  Seq Scan on all_data  (cost=0.00..25.38 rows=6 width=36)
    Filter: (company_id = 1)

Even if there was more data, a sequential scan could still be forced: just "SET enable_indexscan to OFF" and the likes.

So this query reads all the records from all_data, filters them, and returns to the user only the matching rows. There is a way to display scanned records before they are filtered: just create a function with a very low cost, and call it while doing the query:

 CREATE OR REPLACE FUNCTION peek(text) RETURNS boolean LANGUAGE plpgsql AS
 $$
 BEGIN
   RAISE NOTICE '%',$1;
   RETURN true;
 END
 $$
 COST 0.1;

This function just has to cost less than the = operator, which costs 1, to be executed first.

The result is this:


 =# SELECT * FROM company1_data WHERE peek(company1_data.company_data);
 NOTICE:  secret_data_for_company_1
 NOTICE:  secret_data_for_company_2
  company_id |       company_data        
 ------------+---------------------------
           1 | secret_data_for_company_1
 (1 row)

We got access to the record from the second company (in the NOTICE messages).

So this is the first new feature: the view can be declared as implementing "security barriers":


 =# CREATE VIEW company1_data WITH (security_barrier) AS SELECT * FROM all_data WHERE company_id = 1;
 CREATE VIEW
 =# SELECT * FROM company1_data WHERE peek(company1_data.company_data);
 NOTICE:  secret_data_for_company_1
  company_id |       company_data        
 ------------+---------------------------
           1 | secret_data_for_company_1
 (1 row)

The view is not leaking anymore. The problem, of course, is that there is a performance impact: maybe the "peek" function could have made the query faster, by filtering lots of rows early in the plan. The rows are now filtered in two separate plan steps:

 =# explain SELECT * FROM company1_data WHERE peek(company1_data.company_data);
                              QUERY PLAN                             
 --------------------------------------------------------------------
  Subquery Scan on company1_data  (cost=0.00..25.44 rows=2 width=36)
    Filter: peek((company1_data.company_data)::text)
    ->  Seq Scan on all_data  (cost=0.00..25.38 rows=6 width=36)
          Filter: (company_id = 1)

This leads to the complementary feature: some function may be declared as "LEAKPROOF", meaning that they won't leak the data they are passed into error or notice messages.

Declaring our peek function as LEAKPROOF is a very bad idea, but let's do it just to demonstrate how it's used:

 CREATE OR REPLACE FUNCTION peek(text) RETURNS boolean LEAKPROOF LANGUAGE plpgsql AS
 $$
 BEGIN
   RAISE NOTICE '%',$1;
   RETURN true;
 END
 $$
 COST 0.1;

A LEAKPROOF function is executed «normally»:

 =# SELECT * FROM company1_data WHERE peek(company1_data.company_data);
 NOTICE:  secret_data_for_company_1
 NOTICE:  secret_data_for_company_2
  company_id |       company_data        
 ------------+---------------------------
           1 | secret_data_for_company_1
 (1 row)

Of course, in our case, peek isn't LEAKPROOF and shouldn't be declared as such. Only superuser have the permission to declare a LEAKPROOF function.

New options for pg_dump

Until now, one could ask pg_dump to dump a table's data, or a table's meta-data (DDL statements for creating the table's structure, indexes, constraints). Some meta-data is better restored before the data (the table's structure, check constraints), some is better after the data (indexes, unique constraints, foreign keys…), for performance reasons mostly.

So there are now a few more options:

  • --section=pre-data: dump what's needed before restoring the data. Of course, this can be combined with a -t for instance, to specify only one table
  • --section=post-data : dump what's needed after restoring the data.
  • --section=data: dump the data
  • --exclude-table-data: dump everything, except THIS table's data. It means pg_dump will still dump other tables' data.