UPSERT

From PostgreSQL wiki

Jump to: navigation, search

This page summarizes the INSERT ... ON CONFLICT UPDATE patch. This feature is popularly known as "UPSERT".

The patch has been committed [1], and will appear in PostgreSQL 9.5. This Wiki page was only maintained until a few weeks before commit, where the patch further evolved in some minor aspects (most notably, the syntax became ON CONFLICT DO UPDATE/NOTHING). It is now only of historic interest.

CommitFest entry: https://commitfest.postgresql.org/3/35/

Committed version's INSERT documentation: http://www.postgresql.org/docs/devel/static/sql-insert.html

Contents

"UPSERT" definition

"UPSERT" is a DBMS feature that allows a DML statement's author to atomically either insert a row, or on the basis of the row already existing, UPDATE that existing row instead, while safely giving little to no further thought to concurrency. One of those two outcomes must be guaranteed, regardless of concurrent activity, which has been called "the essential property of UPSERT". Examples include MySQL's INSERT...ON DUPLICATE KEY UPDATE, or VoltDB's UPSERT statement.

The absence of this feature from Postgres has been a long-standing complaint from Postgres users [2] [3] [4] [5].

Goals for implementation

Primary goals:

  • In general, guarantee insert-or-update "atomicity" for the simple cases - guarantee one or the other of those two outcomes ("the essential property of UPSERT"). Don't be inferior in any way to existing, subxact-looping approach to UPSERT.
    • Avoid having to invent "READ COMMITTED serialization failures" [6]. These seem like a cop out. This goal is a special case of our first primary goal.
    • Avoid "unprincipled deadlocks" - deadlocks that the user has no reasonable way of avoiding [7]. These too seem like a cop out. Again, this goal is a special case of our first primary goal.
  • Advance usability over the status quo - don't introduce a feature with significant foot-guns, unnecessary subtle behaviors, or big performance surprises

As outlined below, SQL MERGE seemingly doesn't meet this standard (principally because it lacks "the essential property of UPSERT", which appears to be in tension with supporting a fully flexible join).

Secondary goal:

  • Avoid duplicate key violations that the implementation must trap and handle. Don't burn through XIDs with expensive subtransactions - the existing, subxact-looping approach to UPSERT implies that XIDs are consumed at an intractably high rate (intractable for at least some use cases, like the multi-master replication conflict resolution use case).

Tertiary goals:

  • Support for UPSERT of multiple values in one operation is desirable. The current looping approach really needs to loop over single values, making UPSERT of significant numbers of rows very slow.
  • It would be nice to offer an INSERT IGNORE - the ability to not proceed with insertion in respect of certain slots/rows proposed for insertion when doing so implies a unique violation must be thrown. This is primarily important for ETL-type use cases.

Current Status of patch

Important: Note that git format-patch has been used to generate cumulative patch set revisions. The code is structured to be cumulative, and has extensive commit message commentary, to make its integration into PostgreSQL as straightforward as possible. These commit messages are thought to be a useful way of gaining an understanding of how the proposed ON CONFLICT UPDATE/IGNORE patch fits together.

Note that approach #1 and #2 to Value locking are being maintained in parallel. When "V1.X" is referred to, there are actually two cumulative patchsets [8] of "V1.X" (one for each approach to Value locking).

Update: As of V2.0, only approach #2 to value locking is being maintained.

Revisions

Committing ON CONFLICT IGNORE first is now considered to be the best way of committing the code incrementally [9].

Most recently, V4.0 posted. Featured further revamp of RLS, and integrated the recent clean-up work of Andres, Peter and Heikki.

Before that, V3.4 posted. Featured new approach to RLS, and somewhat refined handling of WAL-logging and logical decoding (essentially, a hybrid of earlier approaches).Before that, V3.3 posted. Featured full support for ON CONFLICT IGNORE with updatable views (UPDATE remains unsupported), and a new, refactored approach to logical decoding. Before that, V3.2 posted. Features full inheritance support, new approach to WAL logging to suit logical decoding. Also, multiple indexes can be inferred at once. Collations and opclasses can now be specified in inference clause, in case they are ever semantically significant. Before that, V3.1 posted. This featured refinements to value locking scheme, so that tokens were stored directly in t_ctid field in tuples. Before that, V3.0 posted. This featured a new way of breaking out the code (ON CONFLICT IGNORE is the first in a series of cumulative patches ending in ON CONFLICT UPDATE), and logical decoding fixes.

Before that, V2.3 posted. This featured an overhaul of tqual.c visibility routine's handling of "super deleted" tuples, livelock insurance for exclusion constraints, plus some miscellaneous polishing. Before that, v2.2 posted. This featured extensive refactoring, following feedback from Andres Freund. Before that, V2.1 posted. This had a minor bug fix for an issue with user-defined rules, as well as fixing some bit-rot. Before that, V2.0 posted, adding support for row-level security. This was very significant because it closes out all open issues with support for/by interrelated features (e.g. inheritance, updatable views). Issues around semantics now seem all but settled. Now, the core mechanism and code should be our focus.

Before that, V1.8 posted, which has support for partial index inference, making it possible to use partial unique indexes with the ON CONFLICT UPDATE variant. Before that, V1.7 posted, which featured minor clean-up. Before that, V1.6 posted, which had slight tweaks to per-statement trigger execution order, as well as minor documentation and comment fix-ups. Before that, V1.5 posted, establishing new RETURNING behavior (updated tuples projected). Before that, V1.4 posted, which includes support for postgres_fdw, costing of arbiter unique indexes where there are multiple alternatives, and a pseudo-alias syntax (which makes aliases TARGET.* and EXCLUDED.* visible in UPDATE auxiliary query only - compare OLD.* and NEW.* in the context of user-defined rules). Controversy over "failure to play nice" with other features has died down, as has controversy over syntax, and to a large extent controversy over value locking.

When V1.3 was posted, it added "unique index inference", requiring users to always specify which unique index they'd like to use as an arbiter of whether on not the update path should be taken (this is inferred in the sense that a unique index isn't directly named, but rather the implementation figures that out based on a set of columns/expressions). This is mandatory for the ON CONFLICT UPDATE variant, but optional for the ON CONFLICT IGNORE variant. We're currently still looking at approaches to value locking. Approach #2 has been integrated into a revised patch with the same revised syntax, as a stepping stone to settling questions around value locking. We are now well equipped to compare at least approach #1 and #2 to value locking.

Documentation

A mirror of pre-built patched user-visible documentation is maintained for the feature by its principal author, Peter Geoghegan, and is accessible on the web.

The following pages are likely to be of particular interest:

  • INSERT command documentation - overview of changes to syntax, how unique index inference works, operator class/collation support, etc.
  • Transaction isolation - changes concern READ COMMITTED mode's qualitatively new "MVCC violation", which is also discussed below.
  • Trigger documentation - behavior of per-statement and per row triggers with new INSERT variant.
  • Updatable Views - Updatable views have ON CONFLICT IGNORE support only (full support, though - an inference specification may be used, as of V3.3).
  • Inheritance - Inheritance parents and children are not special cases (i.e. they are supported without needing to be specially considered, including with ON CONFLICT UPDATE). In general, support for inheritance is quite comprehensive (although fundamental limitations remain - unique constraints still do not work across inheritance relations - that's a much bigger project).
  • Message formats - updated to discuss new UPSERT command tag added in V1.5.
  • CREATE POLICY - RLS related changes.

User visible documentation will be maintained and uploaded here as new revisions of the patch are posted.

Note also that the plain ON CONFLICT IGNORE documentation is mirrored (in line with how it is anticipated UPSERT can be incrementally committed, documentation changes are cumulative, just like the code). The major difference between this and the full UPSERT documentation linked to above is the INSERT reference page:

Testing

Apart from the main regression and isolation tests included with the patch set, there has also been extensive use of stress-testing as part of the development of the patch. A stress-testing suite in maintained here: https://github.com/petergeoghegan/upsert.

Jeff Janes wrote a variant of his general-purpose stress-testing suite that was very useful for flushing out bugs. It is maintained separately here: https://github.com/petergeoghegan/jjanes_upsert.

PgCon Talk

A reasonable overview of the trade-off involved in implementing UPSERT (i.e. the tension between linking value locking and row locking) was a focus of Peter Geoghegan's pgCon talk, "Why UPSERT is weird". Note that this relates to an obsolete version of the syntax, but that should mostly not matter.

UPSERT as implemented in practice

Discussion of how users of RDBMS systems in general deal with the UPSERT problem today. This includes all currently released versions of PostgreSQL, and a number of other SQL-based systems, too.

PostgreSQL (today)

To correctly UPSERT in PostgreSQL today, without any dedicated, native support, one must use a retry loop in READ COMMITTED mode. The ad-hoc implementation should attempt an INSERT, and if that fails, UPDATE. The implementation must loop until one of those two outcomes occurs, since is general INSERTs and UPDATEs may be hindered by concurrent activity in a way that makes neither an INSERT or UPDATE occur (e.g. a duplicate violation can occur, or the UPDATE may not affect any rows).

See:

Many users attempt to use naïve approaches, like an UPDATE followed by an INSERT ... IF NOT EXISTS that fall down in the face of concurrent operation. Another common though incorrect approach in PostgreSQL is to use data-modifying CTEs. The correct solution is slow and clumsy to use, and is unsuitable for significant amounts of data. It also potentially burns through a lot of subtransaction IDs - avoiding burning XIDs is an explicit goal of the current "native UPSERT in PostgreSQL" effort.

Oracle, MS-SQL, DB2: SQL MERGE

Users of MS-SQL and Oracle frequently use MERGE to implement an upsert operation. However, users often incorrectly assume that MERGE is immune to concurrency effects, e.g:

The apparent confusion on what problem MERGE is or is not intended to solve seems unlikely, but in fact there is plenty of evidence for it. Although it has been a problem for as long as MERGE has existed, people continue to be confused by it. Many of the above links are quite recent.

Documentation for Oracle and MS SQL makes no reference to concurrency in MERGE, as noted in the Syntax section below.

Also, MERGE can do a lot more than a simple upsert, and has somewhat complex/verbose syntax as a result.

Bizarrely, in addition to MERGE, Oracle offers a "hint" that has the executor IGNORE would-be duplicate violations [10].

MySQL: INSERT ... ON DUPLICATE KEY UPDATE

MySQL offers INSERT ... ON DUPLICATE KEY UPDATE, which is widely used and understood. Although it has some "gotchas" that we hope to avoid (e.g. "If more than one unique index is matched, only the first is updated. It is not recommended to use this statement on tables with more than one unique index."[11]), it does make guarantees around concurrency, and so offers something broadly comparable to the ON CONFLICT UPDATE patch.

SQLite: ... ON CONFLICT ... / INSERT/UPDATE ... OR ...

SQLite has an ON CONFLICT clause for INSERT.

SQLite's ON CONFLICT clause may apply to constraints (other than foreign key constraints). It isn't just an UPSERT feature, though that's one of its roles.

It's logical to consider PostgreSQL's constraints as if they are unconditionally defined as ON CONFLICT ROLLBACK.

For UPSERT, SQLite offers the shortened syntax OR instead of ON CONFLICT, but the behavior is the same.

Teradata

Teradata offers a non-standard UPSERT (which they call "UPSERT", or occasionally "atomic UPSERT") [12], as well as SQL MERGE [13]. The former appeared in an earlier version of Teradata than the latter. This suggests that having both features at once may make sense. The UPSERT implementation also appears to offer guarantees around concurrency, and so is broadly comparable to the ON CONFLICT UPDATE feature proposed for PostgreSQL.

VoltDB

VoltDB recently added a new UPSERT feature:

http://voltdb.com/blog/new-powerful-way-do-upserts-voltdb

This feature is invoked via a non-standard INSERT variant syntax. This implementation also appears to offer guarantees around concurrency, and so is broadly comparable to the ON CONFLICT UPDATE feature proposed for PostgreSQL.

Syntax discussion

Discussion of syntax we could use, and syntax of proposed patch. Includes limitations implied by syntax of proposed patch.

SQL MERGE syntax

The SQL standard provides a MERGE statement, and says:

If a table T, as well as being updatable, is insertable-into, then rows can be inserted into it (subject to applicable Access Rules and Conformance Rules). The primary effect of an <insert statement> on T is to insert into T each of the zero or more rows contained in a specified table. The primary effect of a <merge statement> on T is to replace zero or more rows in T with specified rows and/or to insert into T zero or more specified rows, depending on the result of a <search condition> and on whether one or both of <merge when matched clause> and <merge when not matched clause> are specified.

The syntax specified by the standard (leaving out <override clause> for the INSERT case, since we don't support that for plain old INSERT) is:

 MERGE INTO table_name [ [ AS ] alias ]
   USING { table_name | ( subquery ) } [ [ AS ] alias ]
   ON condition
   [ WHEN MATCHED THEN
     UPDATE SET { column_name = expression } [, ...] ]
   [ WHEN NOT MATCHED THEN
     INSERT [ ( column_name [, ...] ) ] VALUES ( expression [, ...] ) ]

(Technically, the WHEN clauses can be in either order, but each is only allowed once.)

Some database products have extended the standard syntax; for example, adding the option to say "AND expression" after WHEN or allowing MATCHED or NOT MATCHED to be present more than once. In some cases a DELETE option is allowed as a product-specific extension.

The standard does not generally address implementation techniques, so it contains no mention of indexes, 2PL, or MVCC regarding any statement. To use the techniques being discussed for UPSERT, the ON condition would need to allow matching from the second table name (or the subquery) to the target table on a unique index. Kevin Grittner has suggested requiring such a match for the initial implementation.

MERGE advantages

  • In the SQL Standard
  • Already used (if not necessarily correctly) by many users and products
  • JOIN can be optimized usefully. For example, in systems with support for SQL MERGE, it appears to be possible for the optimizer to vary its choice of join algorithm (often it's a nestloop join, but it doesn't have to be - there doesn't even have to be an index on the target table). This is particularly important for the bulk loading/datawarehousing use case that MERGE addresses very well - the general-purpose merge join algorithm can be used for big bulk operations.

MERGE disadvantages

  • The SQL standard doesn't specify concurrency behaviour for MERGE any more than it does for INSERT, UPDATE, or DELETE; so (like other DML) our behavior with concurrent access may not be the same as other database products. In particular, the SQL standard does not require that MERGE be atomic in the sense of atomically providing either an INSERT or UPDATE, and other products do not provide any such guarantees with their MERGE statement.
  • The SQL-standard MERGE doesn't provide for choosing an index, so use of a unique index would need to be conditioned on equality quals against indexed columns in order to provide the behavior being discussed for UPSERT.
  • Implementing an extended subset of MERGE support for UPSERT will impact a future implementation of full MERGE support:
    • Different concurrency and locking semantics will be needed for a MERGE without equality quals matching a unique index of the target table.
    • Users may be very surprised when failure to compare for equality on the columns of a unique index results in slower execution or less graceful interaction with other transactions, such as deadlocks.
    • The design for MERGE when the conditions don't allow joining using a unique index has not yet been done. It may require different locking, allow deadlocks or other serialization failures, or allow the same anomalies which can occur with other DML statements, and which will not be possible when the unique index is used.
  • If a subquery (rather than a direct table reference) is used for the second relation reference, there is no restriction on what that subquery may join on. Delineating the cases where we must use a dirty snapshot, and where we must use an MVCC snapshot seems very difficult and subtle - Peter Geoghegan [14] [15].
  • The ON expression will need to be evaluated to see whether it properly compares to a unique index on the target table. Initially this will need to be done to determine whether the MERGE is allowed at all; later it will determine which sort of plan is allowed. It would be easier to match an index name, provided the index name is known and stable.
  • MERGE will probably need to fire before row insert triggers after deciding to insert, since in general it cannot reverse the triggers having been fired - the implementation should probably have decided on insertion by then. In contrast, UPSERT will have to fire before row triggers before deciding to INSERT or UPDATE (by value locking the value extracted from the affected tuple) [16]. With UPSERT, we are deciding to take the alternative path based on the modified tuple. With MERGE, we may prefer not to.

Detailed comments and concerns

The SQL standard does not and cannot really comment on concurrency for MERGE any more than it does for INSERT, UPDATE, and DELETE. It never mentions indexes or any particular concurrency control mechanism; instead it describes the the minimum requirements at each isolation level (allowing any product to provide additional guarantees as it sees fit and is able to do so). So, the minimum requirement for the MERGE statement at any isolation level is the same as for other DML. We are considering an UPSERT implementation which would be providing additional guarantees; so for the case where matching on a unique index is possible, there would be guarantees beyond other cases, which we would need to document. There are concerns that:

  1. The different guarantees with and without the index may confuse users.
  2. Users may sometimes fail to match an index when they think they are matching; resulting in slower performance and different handling of concurrency problems.
  3. The different plans used with and without the index may result in messy code.

Peter Geoghegan gave a reasonable overview of why MERGE is quite different from a pragmatic UPSERT implementation in the pgsql-hackers thread SQL MERGE is quite distinct from UPSERT. It seems quite reasonable to suppose that one day PostgreSQL will have both "UPSERT" and SQL MERGE, since each addresses distinct use-cases, even if that isn't well understood by the user community at large.

Adopting an existing non-standard syntax

We could potentially gain better uptake by using an existing vendor's syntax, but would be bound to conform more closely to expectations about the behavior and capabilities of that vendor's feature.

MySQL's INSERT ... ON DUPLICATE KEY UPDATE

Does not provide a mechanism for specifying an arbiter unique index. DML query author must know ahead of time that a conflict is only possible when it originates from one particular unique index. MySQL docs advise against using the statement when there is more than one unique index to begin with, which seems very restrictive (this seems to have something to do with it breaking their statement-based/logical replication - the order that unique indexes are considered is apparently storage engine defined).

MySQL's INSERT IGNORE is not equivalent to INSERT...ON CONFLICT IGNORE, in that it ignores not-NULL violations. This seems undesirable, and for the purposes of ON CONFLICT UPDATE, a conflict is always a duplicate violation (or possibly an exclusion constraint violation for ON CONFLICT IGNORE). NULL values can always be filtered with things like an IS NULL in a query predicate. ON CONFLICT UPDATE/IGNORE concerns conflicts that are fundamentally unpredictable, in that they relate to the state of the entire table (including state not visible to the command's MVCC snapshot).

SQLite's INSERT ... ON CONFLICT REPLACE

Similarly, does not accept an arbiter unique index. Same objections apply with the SQLite command as apply to MySQL's UPSERT (the INSERT ... ON DUPLICATE KEY UPDATE command).

Teradata's Atomic UPSERT (UPDATE ... ELSE INSERT ...)

Little used, not much advantage over rolling our own. Does seem to mandate UPDATE predicate that implies one particular unique index, and does make concurrency guarantees, though.

VoltDB's UPSERT

VoltDB's new UPSERT command [17] allows the user to UPSERT, with it only being possible to UPDATE using the excluded-from-insert tuple as a whole (without specifying which attributes to UPDATE when the alternative path is taken at all). This simplicity does have a certain appeal [18].

Adopting VoltDB's UPSERT command for PostgreSQL (or any non-standard syntax that has the implementation simply do a whole-row UPDATE in the event of a would-be conflict) is thought to imply usability issues that are unlikely to be acceptable [19].

Syntax as proposed in the patch - INSERT ... ON CONFLICT {UPDATE | IGNORE}

The proposed INSERT ... ON CONFLICT UPDATE patch is most comparable to Teradata's UPSERT, or VoltDB's UPSERT, or MySQL's INSERT ... ON DUPLICATE UPDATE, or SQLite's INSERT ... ON CONFLICT REPLACE. It is essentially an entirely custom syntax, though.

Custom syntax advantages

  • Total freedom to define semantics
  • Quicker and easier to implement

Custom syntax disadvantages

  • Users of ORMs etc need to add support to their tools, wait for support to be added, or break out into "native SQL" and do direct queries. Given the huge adoption of ORMs like JPA (Hibernate/EclipseLink/etc), ActiveRecord, etc, this is a serious concern; it will take longer for the feature to be accessible to users. On the other hand, there doesn't seem to be any evidence of these ORMs currently supporting SQL MERGE, possibly because of its failure to guarantee atomicity. Also, the Django people have expressed interest in the new syntax. Django core team member and notable Postgres-feature-implementer Marc Tamlyn said of the new syntax: "So in my opinion, yes we would support it in a future version of Django, potentially not that far in the future" [20].
  • Barrier to adoption/understanding
  • We risk inventing new problems along with our new solutions, then being stuck with them
  • Non-standard
  • To the extent that a non-standard syntax represents a "true UPSERT", with strict guarantees around concurrency that meet our goals for UPSERT in PostgreSQL, the custom syntax also implies that there is no useful optimization leeway. There is no conventional join involved, and so bulk loading/data warehousing use cases will probably be better off with SQL MERGE.

Overview of ON CONFLICT syntax itself

The proposed patch modifies the INSERT statement such that its syntax matches this description:

 [ WITH [ RECURSIVE ] with_query [, ...] ]
 INSERT INTO table_name [ ( column_name [, ...] ) ]
     { DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }
     [ ON CONFLICT [ ( { column_name_index | ( expression_index ) } [ COLLATE collation ] [ opclass ] [, ...] [ WHERE index_predicate ] ) ]
       { IGNORE | UPDATE
         SET { column_name = { expression | DEFAULT } |
               ( column_name [, ...] ) = ( { expression | DEFAULT } [, ...] )
             } [, ...]
         [ WHERE condition ]
       }
     ]
     [ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]

Example of the use of the feature's syntax (as currently proposed) follow:

 -- These examples assume the following table:
 CREATE TABLE upsert (key int4 PRIMARY KEY, val text, is_active boolean default true);
 
 
 -- IGNORE variant (does nothing on conflict, no
 -- row locks taken, mostly an ETL thing):
 INSERT INTO upsert(key, val) VALUES(1, 'insert')
   ON CONFLICT IGNORE;
 
 
 -- Can have had system infer unique index from
 -- inference expression, and make that arbiter
 -- of whether or not to IGNORE, though - just ask:
 INSERT INTO upsert(key, val) VALUES(1, 'insert')
   ON CONFLICT (key) -- optional for "IGNORE" variant
   IGNORE;
 
 
 -- Predicate within UPDATE auxiliary statement
 -- (row is still locked when the UPDATE predicate
 -- isn't satisfied):
 INSERT INTO upsert(key, val) VALUES(1, 'insert')
   ON CONFLICT (key) UPDATE -- inference expression "(key)" mandatory for UPDATE variant
   -- EXCLUDED.* carries forward effects of
   -- INSERT BEFORE row-level triggers:
   SET val = EXCLUDED.val 
   WHERE TARGET.val != 'delete'; 
 
 
 -- Multi-row insert-or-update, with reference
 -- to rejected tuples using special EXCLUDED.*
 -- expression:
 INSERT INTO upsert(key, val)
   VALUES(1, 'Foo'), (2, 'Bar'), (3, 'Baz')
   ON CONFLICT (key) UPDATE SET val = EXCLUDED.val;
 
 
 -- RETURNING will project successfully inserted
 -- rows, *and* successfully updated rows:
 WITH t AS(
   INSERT INTO upsert(key, val)
     VALUES(11, 'Foo'), (12, 'Bar')
     ON CONFLICT (key) UPDATE SET val = EXCLUDED.val
     RETURNING TARGET.key, TARGET.val -- TARGET.* alias must be used (if the RETURNING columns are qualified at all), even here
 )
 SELECT key, val FROM t;
 
 
 ALTER TABLE upsert DROP CONSTRAINT upsert_pkey;
 CREATE UNIQUE INDEX text_index ON upsert (val);
 
   
 -- Infer only particular opclass (fails):
 INSERT INTO upsert(key, val)
   VALUES(1, 'Foo') ON CONFLICT (val text_pattern_ops) UPDATE SET val = EXCLUDED.val;
 
 -- Infer only particular collation (fails):
 INSERT INTO upsert(key, val)
   VALUES(1, 'Foo') ON CONFLICT (val collate "C") UPDATE SET val = EXCLUDED.val;
 
 DROP INDEX text_index;
 
 
 -- Partial unique index support:
 CREATE UNIQUE INDEX ON upsert (key) WHERE is_active;
 
 INSERT INTO upsert(key, val) VALUES(10, 'Wombat')
   -- Works with only partial index on "key",
   -- covering "WHERE is_active" (this would also
   -- work if we didn't drop "upsert_pkey" and
   -- create the partial unique index)
   ON CONFLICT (key WHERE is_active)
   UPDATE SET val = EXCLUDED.val;

The syntax does not accept subqueries within either the UPDATE targetlist, or the UPDATE predicate; only the row being updated (via the TARGET.* alias), and the row originally proposed for insertion (via the EXCLUDED.* pseudo-alias expression) may be referenced. But the INSERT is not further restricted in any way - it may accept tuples from any source, have a VALUES() with subqueries within its value list, have CTEs, appear in data-modifying CTEs, etc. While somewhat restricted, the UPDATE may still make use of operators and functions in its targetlist and WHERE clause freely.

All of these restrictions are identical to restrictions on the structure of SQL MERGE query WHEN MATCHED THEN UPDATE "handlers"; those handlers have their own similarly restricted WHERE clauses and targetlists, that can only reference the 2 tuples on each side of the JOIN driving the MERGE -- ON CONFLICT UPDATE only allows referencing of the TARGET.* and EXCLUDED.* pseudo-aliases. The big restriction with INSERT with ON CONFLICT UPDATE as compared to MERGE is that the user must always be happy with an INSERT as one possible outcome - this provides the implementation with a useful way to terminate the retry loop, which appears necessary in order to offer users the "essential UPSERT property". Only a would-be duplicate violation can arbitrate whether or not the alternative path is taken (and not a totally flexible JOIN finding or failing to find an existing tuple in the target, which is far more ticklish). The ON CONFLICT UPDATE feature is about guarantees, whereas MERGE makes few or no guarantees around concurrency (e.g. never getting a duplicate violation in simple cases), either as described by the standard, or as described by the documentation of any real-world implementation, or as actually implemented in other systems.

In the future, it will be quite feasible (if not necessarily desirable) to modify the ON CONFLICT UPDATE implementation to have multiple possible "handlers", evaluated in sequence like SQL MERGE, including a DELETE-based handler. Provided the INSERT cannot accept a WHERE clause, and continues to "drive" the UPSERT, that should work fine, while preserving the "essential UPSERT property". This could be simulated with the current implementation, by just locking (and not updating) a row, and then deleting the row with a new READ COMMITTED command (so as to be sure to have a new snapshot, just in case the row locked is not visible to the original MVCC snapshot); subtleties with visibility make that approach something that probably shouldn't be widely advised. Still, provided a distinct, new command/snapshot is used, this workaround to the lack of "DELETE handlers" should be safe.

Command tag

The feature uses a dedicated "UPSERT" command tag, reporting the number of rows inserted or updated. The IGNORE variant always reports rows using the existing "INSERT" command tag, though. For example:

 postgres=# create table upsert (key int4 primary key, val text);
 CREATE TABLE
 postgres=# INSERT INTO upsert values(1, 'Foo'), (2, 'Bar') ON CONFLICT (key) UPDATE SET val = EXCLUDED.val;
 UPSERT 0 2
 postgres=# INSERT INTO upsert values(2, 'Baz'), (3, 'Fizz') ON CONFLICT (key) UPDATE SET val = EXCLUDED.val;
 UPSERT 0 2
 postgres=# INSERT INTO upsert values(1, 'Tres'), (2, 'Mono') ON CONFLICT IGNORE;
 INSERT 0 0

Restrictions on query structure in detail

As already mentioned, these restrictions affect targetlist and predicate, just as with SQL MERGE:

 INSERT INTO UPSERT VALUES (17, 'Woody') ON CONFLICT (key) UPDATE SET val = (SELECT 'f');
 ERROR:  0A000: cannot use subquery in ON CONFLICT UPDATE
 LINE 1: ...VALUES (17, 'Woody') ON CONFLICT (key) UPDATE SET val = (SELECT 'f...
                                                                    ^
   
 INSERT INTO UPSERT VALUES (18, 'Carl') ON CONFLICT (key) UPDATE SET val = (SELECT val FROM upsert);
 ERROR:  0A000: cannot use subquery in ON CONFLICT UPDATE
 LINE 1: ... VALUES (18, 'Carl') ON CONFLICT (key) UPDATE SET val = (SELECT va...
                                                                    ^
 
 INSERT INTO UPSERT VALUES (19, 'Trevor') ON CONFLICT (key) UPDATE SET val = 'f' WHERE TARGET.key IN (SELECT 1);
 ERROR:  0A000: cannot use subquery in ON CONFLICT UPDATE
 LINE 1: ...evor') ON CONFLICT (key) UPDATE SET val = 'f' WHERE TARGET.key IN (SELECT...
                                                                           ^

Note that the syntax does not restrict the "INSERT part" of these queries in any way. To give a not particularly practical example of how the "INSERT part" of the statement is no less flexible than INSERTs in general, the following is possible:

 postgres=# select * from upsert;
  key | val 
 -----+-----
    1 | a
    2 | b
    3 | c
 (3 rows)
 
 postgres=# insert into upsert select key, val from upsert on conflict (key) update set val = 'new';
 UPSERT 0 3
 postgres=# select * from upsert;
  key | val 
 -----+-----
    1 | new
    2 | new
    3 | new
 (3 rows)

"Cardinality violation" errors in detail

The SQL standard requires that MERGE raise a "cardinality violation" error [21] when more than one row is affected by an SQL MERGE command. The following examples show questionable usage of the ON CONFLICT UPDATE feature, which will cause PostgreSQL to raise an analogous cardinality violation error (unless otherwise noted):

 -- Raises "cardinality violation" error.
 --
 -- Doesn't matter if there is or is not existing row with
 -- value of 1 within "key"
 INSERT INTO upsert(key, val)
   VALUES(1, 'Foo'), (1, 'Bar')
   ON CONFLICT (key) UPDATE SET val = EXCLUDED.val;
 ERROR:  21000: ON CONFLICT UPDATE command could not lock/update self-inserted tuple
 HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
 
 
 -- Also raises same error, since row locked within
 -- ExecLockUpdateTuple() a second time after first update:
 INSERT INTO upsert(key, val)
   VALUES(5, 'Fizz'), (5, 'Arr')
   ON CONFLICT (key) UPDATE SET val = EXCLUDED.val
   -- XXX: Maybe only raise error when UPDATE required?
   WHERE EXCLUDED.val != 'Arr';
 ERROR:  21000: ON CONFLICT UPDATE command could not lock/update self-inserted tuple
 HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
 
   
 -- Does not raise error, provided row with PK value
 -- 5  already exists (and so aren't inserted and then
 -- locked by same command, as might have happened in
 -- previous examples).
 --
 -- We merely lock the one existing row twice, which does
 -- *not* result in a cardinality violation error:
 INSERT INTO upsert(key, val)
   VALUES(5, 'Fizz'), (5, 'Arr')
   ON CONFLICT (key) UPDATE SET val = EXCLUDED.val
   WHERE 1 = 2;

The idea of raising "cardinality violation" errors is to ensure that any one row is affected no more than once per statement executed. In the lexicon of the SQL standard's discussion of SQL MERGE, the SQL statement is "deterministic". The user ought to be confident that a row will not be affected more than once - if that isn't the case, then it isn't predictable what the final value of a row affected multiple times will be.

The implementation figures out that a cardinality violation occurred due to HeapTupleSatisfiesUpdate() returning HeapTupleInvisible (and the command ID matches the current command's, which it should, since a HeapTupleInvisible return value is traditionally a "can't happen" error).

Note that the possibility of a cardinality violation is considered after row locking, but before updating. So in last example above (with predicate of "WHERE 1 = 2"), an existing row needs to be there in order for there to be no cardinality violation - if the predicate passed for one proposed row, and it was successfully updated (or maybe inserted), whereas it did not pass for another proposed row (linked to the same existing row as the first proposed row) but still had to be locked, we'd still get a cardinality violation, even though the other row would not go on to be updated if we didn't get an error.

This restrictions imposed might be more severe than strictly necessary to only prevent the SQL standard's idea of a "cardinality violation", but this seems unlikely to matter.

The following example shows an arguably spurious "cardinality violation" that is actually pretty inconsequential in practice:

 postgres=# select * from upsert;
  key | val 
 -----+-----
 (0 rows)
 
 postgres=# WITH aa AS (
   INSERT INTO upsert VALUES (1, 'Foo')
   RETURNING *)
 INSERT INTO upsert SELECT * FROM aa
  ON CONFLICT (key) UPDATE SET val = EXCLUDED.val;
 ERROR:  21000: ON CONFLICT UPDATE command could not lock/update self-inserted tuple
 HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.

Note that if the UPSERT's values don't come from the data-modifying CTE, we'll just get a duplicate violation instead, due to the implementation-defined ordering of DML statement execution within the command:

 postgres=# select * from upsert;
  key | val 
 -----+-----
 (0 rows)
 
 postgres=# WITH aa AS (
   INSERT INTO upsert VALUES (1, 'Foo')
   RETURNING *)
 INSERT INTO upsert values(1, 'Bar')
  ON CONFLICT (key) UPDATE SET val = EXCLUDED.val;
 ERROR:  23505: duplicate key value violates unique constraint "upsert_pkey"
 DETAIL:  Key (key)=(1) already exists.
 SCHEMA NAME:  public
 TABLE NAME:  upsert
 CONSTRAINT NAME:  upsert_pkey

It seems quite unlikely that this theoretical risk of what are arguably spurious "cardinality violations" actually matters. It is only noted here for completeness. There is, however, arguably a difference with historic behavior here. To illustrate:

 postgres=# select * from upsert;
  key |  val   
 -----+--------
   20 | Before
   21 | Before
 (2 rows)
 
 postgres=# with aaa as (UPDATE upsert set val = 'After' returning key, val) update upsert set val = 'Outer Value' from aaa where upsert.key = aaa.key;
 UPDATE 0
 postgres=# select * from upsert;
  key |  val  
 -----+-------
   20 | After
   21 | After
 (2 rows)

Note that the outer UPDATE does not affect any rows (their "val" remains 'After', and is never 'Outer Value'), because the nested data-modifying CTE updates first. This is because ExecUpdate() (fed here by a CTE Scan) handles a HeapTupleSelfUpdated return value from heap_update() as follows:

 /*
  * The target tuple was already updated or deleted by the
  * current command, or by a later command in the current
  * transaction.  The former case is possible in a join UPDATE
  * where multiple tuples join to the same target tuple. This
  * is pretty questionable, but Postgres has always allowed it:
  * we just execute the first update action and ignore
  * additional update attempts.

So the top-level UPDATE finds the tuple modified by the data-modifying CTE it joins on. Each ExecUpdate() call outside the CTE becomes a no-op.

Simply ignoring additional update attempts (as ExecUpdate() already does here) is an alternative to throwing a cardinality violation for ON CONFLICT UPDATE that isn't outrageous, but is thought to be more surprising given the behavior of SQL MERGE, and the increased likelihood of this kind of thing with UPSERT. The comparison between the new HeapTupleInvisible handler, and the existing ExecUpdate() handler for HeapTupleSelfUpdated is more or less an "apples to apples" comparison; in contrast to the ExecUpdate() call to heap_update(), a HeapTupleSelfUpdated return value from heap_lock_tuple() is impossible from the new ExecLockUpdateTuple() call site. The comparison is more or less "apples to apples" because a DirtySnapshot is used by the index scan that finds conflicting TIDs, and never an MVCC snapshot as in the case of ExecUpdate(). The exact observed return codes (ultimately originating from HeapTupleSatisfiesUpdate()) are different, but the basic issue of a situation arising where we detect that a command affects the same row multiple times is the same.

Challenges, issues (with ON CONFLICT patch)

Current open items (items that are either points of contention, or that clearly must be fixed) are listed in this category. These are distinct from interesting/odd behavior that should be discussed ("Miscellaneous odd properties") that the patch exhibits, which is also discussed.

Open Items

Open Items (items which definitely must be fixed, not just discussed) are:

Value Locking mechanism

See dedicated value locking page for a definition of value locking, and a full explanation.

Update: Controversy has more or less died down. Approach #2 is likely to be used in final patch, but maintaining approach #1 in parallel with approach #2 seems useful for testing, since approach #1 has the most "conceptually pure" implementation of Value locking; it will continue to be a useful basis of comparison. For example, Jeff Janes found race conditions within approach #2 only at one point [22], and the knowledge that #1 was apparently unaffected aided debugging.

RLS

Due to the timing of the patch, we have yet to consider row-level security (per-column privileges are considered, and have tests, however). Note, however, that the proposed ON CONFLICT UPDATE patch correctly preserved the guarantees of per-column privileges.

The current revision has a clear security bug pertaining to RLS. Example session demonstrating how to leak data that should be protected by "USING ( expression )" security qual, but isn't:

 postgres=> select * from upsert;
  key | val 
 -----+-----
    2 | Bar
 (1 row)
 
 postgres=> update upsert set val = val where key = 1;
 UPDATE 0
 postgres=> INSERT INTO upsert values (1, 'Random String') on conflict (key) update set val = TARGET.val;
 ERROR:  44000: new row violates WITH CHECK OPTION for "upsert"
 DETAIL:  Failing row contains (1, Secret).
 LOCATION:  ExecWithCheckOptions, execMain.c:1719

Note that the system leaked the existing tuple values (that is, "1, Secret)"), in violation of the guarantees made by RLS. The CREATE POLICY docs do state: "An example of this is attempting to insert a duplicate value into a column which is the primary key or has a unique constraint. If the insert fails then the user can infer that the value already exists (this example assumes that the user is permitted by policy to insert records which they are not allowed to see)". However, this seems like something considerably more serious than that, since non-constrained values are reported, too.

In this example, the query written by a malicious party does not INSERT because there was an existing row (if there was no existing row, then there'd be principled, standard enough RLS check error, but with reporting of values supplied by the user). When it goes to UPDATE, security quals are not added to the UPDATE, and so the UPDATE actually is at least attempted. The check does fail at the UPDATE stage, so the malicious party is denied the opportunity to write data spuriously, but as things stand ON CONFLICT effectively gives attackers a way to avoid having security quals added to the UPDATE.

Update: Following discussion with Stephen Frost [23], it isn't immediately clear that the solution here is to make sure that the auxiliary UPDATE has an appropriate security qual. More discussion is required.

Update: Following further discussion with Stephen Frost [24], V2.0 was written which defines semantics that are thought to be acceptable, and avoids the leak outlined above. See CREATE POLICY documentation (linked to above) for more information.

Update: A finer-grained approach is probably needed, after all. UPDATE-related tuples will probably end up never having INSERT-related with check options enforced, for example. V3.4 takes this approach.

Partial unique indexes

It seems like we should probably have partial unique index support from the first committed version. This is important for the UPDATE variant in particular, where currently we cannot infer a unique index from any legal specification, even though we must in order to use such indexes -- so currently we can't use them. This is something that a future revision will hopefully natively support without naming the unique index directly, which remains unacceptable.

Note that the IGNORE variant always supported partial unique indexes, provided the user didn't require that ignored would-be unique violations were restricted to some particular partial unique index.

Update: V1.8 has the ability to infer partial unique indexes via its new, extended unique index inference specification clause. Importantly, since the UPDATE variant mandates an inference specification, this addition makes it possible to use the feature to UPSERT with partial unique indexes. Note that any tuple proposed for insertion must ultimately be covered by the partial index - otherwise an error is thrown within the executor.

Non-default opclasses

There should be a way of supporting non-default opclasses within the first committed version. Non-default opclasses seldom or never introduce an alternative notion of equality as compared to the default opclass (that is, there "equals" operator is in practice (almost?) invariably the same as that of the default opclass, conventionally "operator =(type, type)"; the non-default opclasses shipped with Postgres introduce alternative sort orders, not alternative notions of equality).

Note that this implies creating a new note under the doc's "System Dependencies on Operator Classes" subsection [25]. Like DISTINCT or GROUP BY, the idea of equality for the purposes of ON CONFLICT UPDATE/IGNORE unique index inference is likely to be formalized as the idea of equality implied by the default opclass "equals" operator (which in practice gives the implementation leeway to use at least all shipped non-default opclasses).

Maybe the unique index inference process would be better off formally not caring about the use of non-default opclasses [26].

Update: V1.8 formally documents that opclasses that support distinct notions of "equals" (which is inconsistent with established PostgreSQL convention) might theoretically be problematic. It formally becomes a matter of whatever unique indexes happen to be available on the expressions/attributes under consideration driving that definition of "equals" (by having one operator class or another). All shipped non-default operator classes have identical "equals" operators to their type's default anyway, with the sole exception of the internal record non-default opclass [27] (this is only used to implement materialized views, and so is deliberately obfuscated with non-standard operator spellings).

Update: V3.2 allows the user to specify an opclass (or collation), should that be semantically significant. Even where it is, a difference in opclass across actually defined unique indexes is required for failing to account for such a difference to be a real problem.

RETURNING behavior

In the ON CONFLICT UPDATE/IGNORE patch, RETURNING currently only return tuples actually inserted - as when a BEFORE ROW INSERT trigger returns NULL, RETURNING returns no tuple when an insert does not actually proceed. Original discussion of this point [28] shows the consensus in now that the implementation should also return/project updated rows in the event of a RETURNING clause. This is expected in the next revision (V1.5).

Update: V1.5 establishes this behavior for RETURNING. Adopting this behavior occurred in consultation with the Django community [29].

Note that RETURNING does not make visible the "EXCLUDED.*" alias from the UPDATE (just the generic "TARGET.*" alias is visible there). Doing so is thought to create annoying ambiguity for the simple, common cases [30] for little to no benefit. At some point in the future, we may pursue a way of exposing if RETURNING-projected tuples were inserted and updated, but this probably doesn't need to make it into the first committed iteration of the feature [31].

postgres_fdw

There is no support for postgres_fdw. This is probably a fairly straightforward matter of adding new deparsing logic to deparseInsertSql().

Implementation has support for the IGNORE variant (provided no unique index inference specification is provided, because there is no concept of a unique index on a foreign table. There is never anything to infer.). postgres_fdw will support ON CONFLICT UPDATE, purely because that variant mandates an inference specification clause.

Syntax (MERGE, custom, other)

Update: While some people continue to want to use a custom MERGE-like syntax that is inspired by MERGE, the idea of a fully general MERGE that offers UPSERT-like guarantees has fallen out of favor generally. [32] [33] [34]. This just leaves open the details of to what extent a custom syntax, constrained in a way similar to INSERT ... ON CONFLICT UPDATE takes inspiration from MERGE. That is a far less contentious issue, since we're essentially only talking about spelling.

Consensus seems to be that an alias-like referencing to the tuple (in the style of OLD.*/NEW.*) is desirable. This necessitates adding a dummy RTE to the query during parse analysis, which presents its own difficulties.

Update: V1.4 has an pseudo-alias EXCLUDED.*/TARGET.* syntax (these aliases are only visible in the UPDATE auxiliary query - neither alias will be visible in the INSERT's RETURNING clause, if any, for example). Issues around syntax seem to have very much settled down - it seems opinion has converged around the most recent formulation of the ON CONFLICT UPDATE syntax (with pseudo-alias EXCLUDED.* expressions, and new RETURNING behavior).

Plan/EXPLAIN output

As of V1.3, a sequential scan, which is hidden and has certain aspects folded into its parent ModifyTable node in EXPLAIN output, is now guaranteed. So recent versions look like this (note the lack of any relation scan node - there is a hidden sequential scan that will never be executed in the conventional manner, but only selectively, by using it with EvalPlanQual()).

Update: V2.2 now shows TARGET.* alias in all explain nodes (due to parser refactoring):

Update: V3.3 now shows "Conflict Arbiter [unique] Indexes" in EXPLAIN output (all formats):

 postgres=# EXPLAIN ANALYZE INSERT INTO upsert VALUES(1, 'Art')
   ON CONFLICT (key) UPDATE SET val = EXCLUDED.val WHERE TARGET.key = 1;
                                                       QUERY PLAN                                                      
 ----------------------------------------------------------------------------------------------------------------------
  Insert on upsert target  (cost=0.00..0.01 rows=1 width=0) (actual time=0.136..0.136 rows=0 loops=1)
    Conflict Arbiter Indexes: upsert_pkey
    ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.003 rows=1 loops=1)
    ->  Conflict Update on upsert target  (cost=0.00..25.88 rows=6 width=36) (actual time=0.052..0.052 rows=0 loops=1)
          Filter: (key = 1)
  Planning time: 0.149 ms
  Execution time: 0.254 ms
 (6 rows)

Inheritance

We should probably figure out a way to support limited cases of table inheritance/partitioning, where the partition key is conflict-updated on.

Update: version V1.3 has good support for inheritance (and updatable views).

Update: version V3.2 removes the final limitation on inheritance: inheritance parents can support unique index inference (and so can support UPSERT). Of course, the general restrictions that inheritance has on enforcing unique constraints across inheritance children limit the usefulness of using UPSERT with inheritance.

Trigger behavior

Trigger behavior for statement-level triggers seems funny (row-level triggers are probably fine, though). Simon Riggs writes of MERGE [35]:

It's unclear whether MERGE should activate statement-level triggers, or not. Few of the above sources are explicit either way on this point. DB2 always runs both UPDATE and INSERT statement-level triggers, whether or not rows have been changed; I would suggest we do that also for before triggers. For after statement triggers I would suggest that we track how many updates and inserts are caused and if updates > 0 then we activate the after statement for update triggers, and if inserts > 0 then we activate the after statement for insert triggers. If a statement level trigger is activated by both update and insert then it would be possible for both TRIGGER_FIRED_BY_UPDATE() and TRIGGER_FIRED_BY_DELETE() to be true (for statement level triggers only), which would be a change from what we do now, even if the old behaviour was not explicitly mutually exclusive. In both cases I suggest we run Update triggers before Insert triggers consistently for both before and after statement triggers.

On the other hand, Rules can't really work with MERGE or UPSERT (note that user-defined rules are completely unsupported by the ON CONFLICT patch) - it's not clear how various types of user-defined rules should affect a hybrid INSERT/UPDATE top level statement, which is effectively what INSERT with an ON CONFLICT UPDATE clause is. Therefore, rather than attempting to expand user-defined rules, the implementation throws an error.

Kevin Grittner points out [36] that the standard requires that all statement-level triggers must be fired regardless of how many rows are affected. Discuss.

Update: V1.3 has the above behavior, which appears to be the consensus. Therefore, the implementation always fires insert and update statement-level triggers (both BEFORE and AFTER, for both UPDATE and INSERT, regardless of whatever else happened during statement execution).

Update: V1.6 standardizes and documents the order in which statement-level triggers are executed [37].

Exposing whether or not trigger functions are called as a result of an ON CONFLICT IGNORE is a concern [38], particularly with INSTEAD OF triggers on views, where the corresponding "redirected" insert (redirected by the user-defined trigger function) should probably also IGNORE, as already happens automatically in the case of updatable views with ON CONFLICT IGNORE. This may be addressed later.

Miscellaneous odd properties of proposed ON CONFLICT patch

This section lists unusual properties of the proposed patch, that are qualitatively different to any existing DML behaviors. These are not considered open items because there is no dispute around the behaviors, and they may well be totally acceptable. They are highlighted here (and not under "Open Items") as possible points of concern in the future, that ought to be discussed sooner rather than later.

Why still lock row when UPDATE predicate isn't satisfied?

The implementation uses heap_lock_tuple() to lock a row ahead of deciding to UPDATE. The row is locked ahead of evaluating the UPDATE's predicate, on conclusively locked, latest tuple version - if the implementation finds that there has been a concurrent UPDATE when row locking, it loops back to the start and retries (it re-finds the tuple using a DirtySnapshot, and may even go on to INSERT if the concurrent UPDATE created a new non-conflicting tuple version). This is convenient to the implementation, because the UPDATE qual need only be evaluated once, on a conclusively locked row version. During a discussion about SQL MERGE several years ago, Simon Riggs considered the possibility of having to do this [39]. It allows us to not have to teach ExecUpdate() to care about the case where there was a conflict -- conflicts at the row level necessitate restart (not the usual follow the update chain EvalPlanQual() thing). In the patch, ExecUpdate() is called with the row already locked by a dedicated heap_lock_tuple()-calling routine.

This seems to also make sense on user-visible grounds. It would be surprising if a REPEATABLE READ transaction had a serialization failure due to the same UPSERT statement being executed twice. After all, that won't happen with a regular UPDATE (although in that case, it won't happen because our xact's snapshot won't see a new version where the UPDATE qual is satisfied). This approach seems less different. Besides, it's quite possible that someone only wants to lock rows, and making that happen with a "WHERE 1 = 2" predicate is perhaps neater, and does not involve UPDATE row-level triggers (unlike a "column = column" targetlist, which isn't quite as close to an "UPDATE no-op").

Note that postgres_fdw also locks rows ahead of updating them [40], in the first of two distinct phases (so the deparsed UPDATE statement executed on the foreign server actually contain a ctid-based qual only - the TID that the query established the right to UPDATE by locking in the "first phase", using SELECT FOR UPDATE). This is because similarly, the postgres_fdw-based UPDATE needs to establish the right to UPDATE ahead of doing so (there could be local and foreign quals).

Visibility issues and the proposed syntax (WHERE clause/predicate stuff)

There are, in general, certain situations in which PostgreSQL's READ COMMITTED isolation level can give results that violate snapshot isolation (sometimes known as an "MVCC violation", or "Snapshot Isolation violation" - this happens using an executor facility known internally as EvalPlanQual()).

Example setup:

CREATE TABLE x
(
    c int
);

INSERT INTO x VALUES(1);
Read Committed mode snapshot isolation violation example
session 1 session 2
BEGIN;
UPDATE x SET c = c + (SELECT max(c) FROM x);
BEGIN;
UPDATE x SET c = c + (SELECT max(c) FROM x);

Session 2 performs the same query concurrently.

COMMIT;
COMMIT;
-- Returns no rows (since value within only row's "c" column is actually 3):
select * from x where c = 4;

This, and other similar cases are better documented in the official documentation [41], or the executor README [42].

The proposed UPSERT patch adds some interesting variation. During an early discussion of SQL MERGE, Robert Haas originally pointed out [43] the necessity of a new "MVCC violation" in order to ensure that the stated goals for UPSERT could be met (in particular, atomicity in the sense of always getting an INSERT or UPDATE in READ COMMITTED mode):

But let's back up and talk about MVCC for a minute. Suppose we have three source tuples, (1), (2), and (3); and the target table contains tuples (1) and (2), of which only (1) is visible to our MVCC snapshot; suppose also an equijoin. Clearly, source tuple (1) should fire the MATCHED rule and source tuple (3) should fire the NOT MATCHED rule, but what in the world should source tuple (2) do? AFAICS, the only sensible behavior is to throw a serialization error, because no matter what you do the results won't be equivalent to a serial execution of the transaction that committed target tuple (2) and the transaction that contains the MERGE.

Even with predicate locks, it's not obvious how to me how to solve this problem. Target tuple (2) may already be there, and its transaction already committed, by the time the MERGE statement gets around to looking at the source data.

UPSERT (or at least, any implementation that meets our goals) potentially UPDATEs a row without any version of it being visible to the command's MVCC snapshot (again, in READ COMMITTED mode only). This is distinct from the prior MVCC violation just illustrated, in that in order for the prior MVCC violation to occur, at least some version of the row being updated must be visible; in general, a relation scan from which a ModifyTable node pulls up tuples expects to pull up tuples that are visible to the MVCC snapshot. Whereas INSERT ... ON CONFLICT UPDATE may UPDATE a tuple only accessible to the command via a DirtySnapshot (in READ COMMITTED mode - higher isolation levels throw a serialization error when they find the tuple isn't visible to their MVCC snapshot following a special check). The row will be visible to new snapshots, but is not necessarily visible to our own.

The UPDATE WHERE clause (predicate) is only evaluated once, on this conclusively-locked tuple, and not any other version of the same tuple. It is therefore possible (in READ COMMITTED mode) that the predicate may "fail to be satisfied" according to the command's MVCC snapshot. As already explained, it might simply be that there is no row version visible, but it's also possible that there is some row version that is visible, but only as a version that doesn't satisfy the predicate. If, however, the conclusively-locked version satisfies the predicate, it is posited that that is good enough and the tuple is UPDATEd. The MVCC-snapshot-visible row version is denied the opportunity to prevent the UPDATE from taking place, because we don't walk the UPDATE chain in the usual way.

Peter Geoghegan writes [44]:

We all seem to be in agreement that we should update at READ COMMITTED if *no* version of the tuple is visible. It seems utterly arbitrary to me to suggest that on the one hand it's okay to introduce one particular "MVCC violation", but not another equivalent one. The first scenario is one in which we update despite our update's (or rather insert's) "predicate" not being satisfied (according to our MVCC snapshot). The second scenario is one in which the same "predicate" is also not satisfied according to our MVCC snapshot, but in a slightly different way. Why bother introducing a complicated distinction, if it's a distinction without a difference? I'd rather have a behavior that is consistent, easy to reason about, and easy to explain. And so, the predicate is considered once, after conclusively locking a conflict tuple.

Update: This issue is directly illustrated by a new isolation tests added in V1.3 (see insert-conflict-update-3).

Theoretical lock starvation hazards

The file src/backend/access/heap/README.tuplock states [45]:

When it is necessary to wait for a tuple-level lock to be released, the basic delay is provided by XactLockTableWait or MultiXactIdWait on the contents of the tuple's XMAX. However, that mechanism will release all waiters concurrently, so there would be a race condition as to which waiter gets the tuple, potentially leading to indefinite starvation of some waiters. The possibility of share-locking makes the problem much worse --- a steady stream of share-lockers can easily block an exclusive locker forever. To provide more reliable semantics about who gets a tuple-level lock first, we use the standard lock manager, which implements the second level mentioned above. The protocol for waiting for a tuple-level lock is really LockTuple() XactLockTableWait() mark tuple as locked by me UnlockTuple()

The current UPSERT patch has lock arbitration that is a little fuzzier than this. Shared lockers don't cause conflicts, and we're using heap_lock_tuple(), so arbitration behaves approximately fairly in practice (we aren't attempting to simply grab the lmgr-controlled row lock, which the README is talking about: we're attempting to lock the row using the higher-level heap_lock_tuple() facility, whose implementation is actually described here). The exact same risk exists when using the "subxact looping" pattern, the existing approach that the docs suggest -- retrying index tuple inserts in the face of possible conflicts (conflicts due to concurrent insertions) also has loose lock arbitration rules with respect to the order that conflicting inserters complete relative to each other (i.e. there is no strict queue fairness). Stress-testing has shown the risk of lock starvation to be vanishingly small, but it's still a theoretical risk.

No one has expressed concern about this, but it's an issue that deserves highlighting as a possible concern. Note that with other snapshot isolation databases, like Oracle, handling of concurrent UPDATEs at READ COMMITTED is quite different to the equivalent handling within Postgres: The entire statement is undone and restarted with a new snapshot, so that snapshot-isolation cannot be violated (which EvalPlanQual() allows). In principle, this can loop forever (possibly, it errors-out after enough retries). Oracle basically considers READ COMMITTED to be what you should use when you want Oracle to restart the statement, rather than the app restarting the transaction (as in SERIALIZABLE mode).

Personal tools