From PostgreSQL wiki
Note: contains items pertinent to ongoing discussion of INSERT ... ON CONFLICT UPDATE/IGNORE patch: https://commitfest.postgresql.org/action/patch_view?id=1564
See also: UPSERT for a broader discussion.
Approaches to "Value Locking"
The term "Value Locking" refers to the idea of locking a value in the abstract. There need be no existing object to lock - the value "5" may need to be locked in the abstract (maybe objects can be created that logically represent this state of affairs, but that's just an implementation detail). (Almost?) invariably, this mechanism involves unique indexes. This page considers various approaches to "value locking" for the purposes of implementing UPSERT. Value locking is a term that isn't well established, but is an idea that appears in various different contexts.
In "Modern B-Tree Techniques" by Goetz Graefe (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.219.7269&rep=rep1&type=pdf), Graefe writes:
The terms key value locking and key range locking are often used interchangeably. The purpose of locking key ranges instead of key values only is to protect one transaction from insertions by another transaction.
The classic example of "Value Locking" is the key-range locking that exists in 2PL systems like DB2 and SQL Server. Range locking has a long history; it has existed since the 1970s, since it forms the basis of legacy approaches to implementing serializabile isolation level. Key-range locking might be considered a specific implementation of "value locking", which is intended as a more general term.
InnoDB adds "Record, Gap, and Next-Key Locks" , despite the fact that it is an MVCC system first and foremost (apparently in READ COMMITTED mode, gap locking just affects things like unique index enforcement and foreign keys ), and SQL Server uses them beyond serializable transactions (Graefe calls these "instant locks"). Clearly, their role is not limited to implementing 2PL-based SERIALIZABLE isolation level.
One could argue that PostgreSQL already has a very limited form of "value locking", required for unique index enforcement. The nbtree AM carefully locks buffers in such a way as to check for the existence of a duplicate ahead of establishing the right to insert. This needs to happen in the ordinary course of inserting into a unique index. Once a "value lock" exclusive buffer lock is taken on "the first leaf page a value could be on", that backend will either insert successfully, or throw an error: other backends inserting the same value will be denied the opportunity to themselves check for duplicates by obtaining that lock until the first backend finishes .
The buffer lock is conceptually a "value lock". Any row that is subsequently inserted isn't really a value lock at all; it's just a row that happens to contain a value. From a high level, the effect is about the same: Other backends must block on either the buffer lock (a conceptual "value lock") for an instant, or the row lock for as long as the potentially conflicting other transaction is still in progress. However, the distinction between value locks and rows locks is important for the purposes of describing how UPSERT must work in general. This distinction is somewhat confused by the fact that some of the proposed approaches to value locking described here (#2 and #3) use special index/heap tuples in various ways.
Value locking is important to UPSERT because it is needed to make the feature work. It needs to be possible to lock a value to reach consensus to proceed with inserting a heap tuple (or to mark an inserted-ahead-of-time heap tuple as being logically *really* inserted, or to mark a nbtree index tuple as actually pointing to some real heap tuple, or maybe even something else equivalent). There are at least 3 approaches to value locking that have been discussed in the context of UPSERT:
Comparison of various approaches
Note that approach #1 and #2 are maintained in parallel, as part of the ongoing development of the INSERT...ON CONFLICT UPDATE/IGNORE patch.
Update: As of V2.0, only approach #2 is maintained.
Main CommitFest entry: https://commitfest.postgresql.org/4/35/
The performance of schemes #1 and #2 are directly compared here: http://www.postgresql.org/message-id/CAM3SWZQvSf+UWpt1YfTqudc4Z2j5dwyBX2fQQfDR-1k-CB6eog@mail.gmail.com
Performance was measured by using the stress test suite (performance is only one aspect exercised by the suite): https://github.com/petergeoghegan/upsert
#1. Heavyweight page locking (Peter Geoghegan)
Basically, this is a generalization of the existing mechanism: Page buffer locks are extended to page heavyweight locks. They are held on to until UPSERT goes to row lock...but not during row locking, because that causes deadlocks (see below). There is some tricks played with page-level flags (private B-Tree flag bits in page special area) to make most inserts not have to care about the possibility of a heavyweight lock on leaf beyond checking the flag.
- Generalization of existing mechanism.
- All in-memory (except use of a single flag on B-Tree leaf pages that doesn't need to survive a hard crash). Can potentially be refined in the future without painting ourselves into a corner with on-disk compatibility.
- Exists, in review-quality form. May perform best, when lock contention on leaf pages isn't a significant issue.
- Has been tested extensively by stress-testing.
- Doesn't bloat. This is not thought to be a major advantage (at least compared to #2).
- Preserves invariant that some index tuple always points to some heap tuple (unlike #3).
- Doesn't deadlock. Has a strategy on making it cheap/easy to *release* value locks to avoid deadlocking (see below).
- Only works with nbtree AM. ON CONFLICT IGNORE might be made to work with exclusion constraints (but not ON CONFLICT UPDATE, which makes little sense).
- Necessitates putting something in the critical path of ordinary insertion. Leaf pages must be checked for heavyweight-lock-indicating flag now. This is probably not a performance problem, but that's unproven and untested.
- Has the potential for bugs. This doesn't seem to have been a focus of criticism, since it isn't clear that this is more or less true than alternative approaches.
Heikki on generalizing value locking to work everywhere (which this approach notably fails to do):
This is somewhat like what Graefe describes in the following:
Microsoft’s SQL Server product employs key range locking based on Lomet’s design , which builds on ARIES/IM and ARIES/KVL [93, 97]. Like ARIES, this design requires “instant locks,” i.e., locks with extremely short duration.
Note that it's assumed that "value locks" cannot ever be usefully held for more than an instant, since they must be released *prior* to row locking, as discussed below.
This is clearly not a reference to buffer locks - those are always referred to as "latches" by Graefe (and B-Tree papers in general).
#2. "Promise" heap tuples (Heikki Linnakangas)
Proposed by Heikki in response to #1 in late 2013 (with refinements coming later). Like #3, this approach has been around as a rough idea for some time . Basically, works by checking for existing tuple with dirty snapshot, and optimistically inserting heap tuple, followed by associated unique index tuples if required. Can get a conflict though, as when there is a race between two upserters that both decided to insert. When value locks need to be released before row locking, and there was a conflict, we must "super delete" speculative heap tuple. If we don't super delete speculative heap tuple, simple cases will deadlock indefensibly (see below).
- Generalized approach - INSERT IGNORE can work with exclusion constraints, too.
- Much less invasive to nbtree code than both #1 and #3. It's particularly complex and subtle code, of fundamental importance, and avoiding modifying it is in general a good thing.
Potentially, we can fix existing race where two inserters into exclusion constraints can both fail while we're at it.Update: This is now not thought to be true - this issue doesn't exist with the existing implementation of exclusion constraints (only concurrent deadlocking is possible when instead one or the other xacts is killed, but that hardly seems like a major issue since at least one inserter can proceed, and the queue ordering is more or less equivalent to a random ordering anyway).
- In later revisions, has a strategy on making it cheap/easy to *release* value locks to avoid deadlocking (see below). This is essential for any design that avoids unprincipled deadlock errors in READ COMMITTED.
- Closer to UPSERT subxact looping pattern than #1, because we all but abort subxact to *release* value locks (basically, special release-early locks are used, where last-until-end-of-xact locks were used before).
- Exists; has been put on an even footing with #1 WRT integration with new syntax, tests, and so on.
- Preserves invariant that some index tuple always points to some heap tuple (unlike #3)
- Not based on designs that are known to work well in some other systems...similarity to existing unique index/exclusion constraint enforcement may not be that useful.
- Seems to require making changes that invasive to several existing places (In contrast to #1, which is only invasive to the btree code). These places include HeapTupleSatisfiesDirty() (#1 doesn't touch tqual.c, except for one narrow case need for REPEATABLE READ/SERIALIZABLE isolation levels). Another example is that it adds a new reason to lock the procarray with a shared lock.
- Must "super delete" tuples that don't work out. We probably can't swap out InvalidTransactionId in xmin, for the same reason we couldn't with relfrozenxid (see commit message of 37484ad2). Might find some free infomask bits instead, but those are valuable, and it hasn't really been investigated at all. Requires thought.
- Performance probably inferior to #1 due to pre-check, which always goes to waste when inserts are the outcome. This could be ameliorated, possibly. However, does not have as many problems with lock contention as #1 when there is a concentration of values in one area of the B-Tree.
- Can cause bloat (albeit not all that much...pre-check tends to work well). Bloat isn't that strong a motivation to pick any particular design, since it seemingly isn't all that significant an issue here, so "no bloat" isn't much of an advantage for #1.
- Not entirely untested, but not as well tested as #1.
- In general, putting the responsibility for handling these issues in the heap may expose the code to a relatively large amount of other code, whose complex interactions could result in bugs. Perhaps it's unfair to hold the patch to always setting tuple xmin to InvalidTransactionId, but if we assume that it must, that *could* interact badly with a fairly high number of other places. This analysis hasn't been done, but must be.
#3. "Promise" index tuples (Andres Freund, Simon Riggs)
Important: Note that there is no actual patch in any form that implements this approach, so these points are based on what little information is available, and even more subject to change than those for #1 and #2.
- Has a passing resemblance to "ghost records", which are sometimes used in 2PL systems in a way that interacts with range locking.
- Objections to #2 wrt putting knowledge of this in heap tuples are unlikely to apply here.
- Doesn't exist today. Entirely unproven.
- Violates existing invariant that every index tuple has a corresponding heap tuple, which goes back to the Berkeley days. Currently, we always create heap tuples first, and physically delete them last.
- Relatedly, concern about breaking VACUUM. VACUUMing of btrees occurs with a list of TIDs to kill in index, built from the heap. Current bulk delete callback cannot be used for this. Super-exclusive lock is needed to delete tuples in btree page (i.e. VACUUM). Normally skipping of LockBufferForCleanup() (as added by bbb6e559) works okay in heap because it tends to imply that list of tids won't be bulk deleted in index, where we currently cannot skip for failure to quickly acquire super exclusive lock. So this could make the problem addressed by bbb6e559 return to some extent.
- "Index-only" bloat becomes a possibility. Implications are not well understood.
- We have to re-find any items using an ordinary index scan, not a tid. Must match our xid to that.
- Doesn't have a strategy for dealing with unprincipled deadlocks, at least at the moment. Presumably some aspects of #2 could be adopted here.
"Unprincipled Deadlocking" and value locking
Goals for UPSERT in Postgres include that the implementation avoids "unprincipled deadlocks" - deadlocks that the user has no reasonable way of avoiding . In order for a deadlock scenario to be considered "legitimate", it has to be due to a user-visible mutual dependency. It cannot be solely due to a mutual dependency that involves non-obvious implementation level locks (like value locks), where the user cannot restructure their code to avoid the mutual dependency. It should continue to be a realistic goal to have client applications that can never deadlock; application developers need to be able to be sure that queries within their transactions acquire locks in a consistent order, without implementation level hindrance that makes that impossible in the general case.
"Unprincipled deadlocks" are a problem that #1 always avoided, and #2 was later made to avoid following some work. Peter Geoghegan pointed out problems with "unprincipled deadlocks" thrown by an early prototype of approach #2 . Basically, the upshot of Geoghegan's complaint is that value locks cannot reasonably be made to persist across row locking. Otherwise, deadlocks are possible and even inevitable with high concurrency, and there is no reasonable way of avoiding that as a user.
The view that we cannot hold value locks, because to do so would risk unprincipled deadlocks has seemingly won acceptance. Heikki accepted that principle from an early stage . Heikki later revised his prototype (implementing approach #2) to fix "unprincipled deadlocks" in late 2013. Robert Haas more recently (September 2014) agreed that "unprincipled deadlocks" are unacceptable .
It is not intuitively the case that "value locks cannot reasonably be made to persist across row locking". To understand why this is true, consider the example of earlier versions of #2, and how that was shown to deadlock in a way deemed unacceptable. Peter Geoghegan writes :
What happens if row locking on upsert finds a conflict update changing uniquely-constrained attributes? Sure, a vanilla non-HOT update will fail on inserting a unique index tuple, but *it* can still cause us a row-level conflict, and *it* can only fail (with a dup violation) when we commit/abort. But now we're obligated to wait on it to get the row lock, and it's obligated to wait on us to get the promise tuple lock, or any other sort of "value lock" that hasn't already been released when we go to row lock. Deadlock. In general, locking a value successfully does not imply that the row will be locked. The implementation must loop. There may be conflicts necessitating restart at both the value locking stage, and row locking stages, although in practice the implementation tends to behave approximately fairly.
Geoghegan's pgCon talk discussed the problems with escalating from value locks to row locks, and how that causes deadlocks .