PgCon 2020 Developer Unconference/Edge cases of synchronous replication in HA solutions
PGCon 2020 Developer Unconference session on - Edge Cases of Synchronous Replication=
Add link here
(Described later in the session)
We generate a commit log record and when a query is cancelled because of statement-timeout or CTRL-C, We then release all the locks and get out of the snapshot. Another query that comes and tries to read the data from the first cancelled query will see the changes from the query. However then you might failover to sync-standby that did not yet get that commit record.
Sync Rep Edge cases
- Sync rep allows many solutions to be built. Works correctly for insert/updates
- Issues with UPSERT(INSERT ON CONFLICT). Running UPSERT was cancelled during switchover, driver ran against failed primary node. Data was reported as committed. Thread in hackers was started.
- Cancelling a query can leave locally committed changes that are not replicated to other nodes.
1. Client cancels query other queries can observe non replicated changes 2. Restart node and non replicated changes are observed by a query 3. Partially replicated changes can be observed by nodes without any failover
- We need more than 1 cancel message. First cancel, 'okay' second 'seriously cancel'
- When you cancel a query the data can be locally committed but not replicated to other sync replicas.
- You need to use 2PC to be absolutely sure to avoid this but the performance of this is not very good
- 2 cases. a) Client cancels query b) The connection looses connection. In case b the locks are maintained so this works much better.
- Other local transactions will see the results of the committed transaction before it is replicated. This is an undesirable property for some goals.
- You could hold back local snapshot generation so the new case is not complete for snapshot purposes until it has replicated.
- You could disregard cancel request locally if the data is not replicated.
- If you commit locally and process crash recovery you might be processing log that has not yet been replicated across.
- After a restart you can wait in the restart hook to wait for remote lsn to be equal to the local lsn to make sure you don't allow user connections to early.
- Maybe this is not fixable because there are non-atomic operations. Maybe the expectations on this are wrong. The initial design was to ensure that way we commit is written on two systems. The design wasn't to prevent observing different states on difference systems. Is this excessive expectations?
- We generate a commit log record and when a query is cancelled because of statement-timeout or CTRL-C, we release all the locks and get out of the snapshot. Another query that comes and tries to read the data from the first cancelled query will see the changes from the query. However then you might failover to sync-standby that did not yet get that commit record.
- Delaying the creation of the new snapshot sounds like a better fix than not allowing cancels, if it can be done correctly.
- The problem with above is crash recovery will still leak the transaction, though this could be dealt with by the client hook(described earlier)
- Is solving this for most cases enough?
- What would happen if we don't honor the cancel at this point? We don't honor cancels at other points(ie flushing). This sounds like a practical band-aid. This could make sense to avoid this issue but could there be other side-effects and this would not fix the underlying issue.
- Sometimes you really do want to cancel the query, ie if a client replica is far behind.
- Clients disconnecting and reconnecting can't be stopped but this actually works fine because the server will continue to hold the locks.
- What about termination? What if you don't allow cancel but someone tries termination. Termination means 'exit or you will be killed' so we shouldn't disallow. We would need the visibility issue solved for this(ie block the creation of snapshots).
- Treat the recently committed transaction as still running for the purpose of snapshots.
- Would this mean a transaction in progress might not have a backend
- Blocking snapshot creation is saying "You can't do anything until the transaction is replicated". Maybe this is the right answer but it will impact all the clients including readers.
- You could maybe implement this by using 2PC internally, never to a 'rollback prepared' just an implicit prepared. This would allow us to distinguish the moment we establish durability vs the moment we establish visibility
- Might this cause issues with naive clients?
- How would this interact with other replication slots? Other replication clients can see these records before the sync one that is behind. A logical slot might receive this record before a sync replica and the logical client might have information sync replica you later failover to.
- Should we require the lsn to be committed to the sync replicas before it can be visible to any other replication slots. We have no mechanism for this edge case.
- Fixing this would be a nice property to have. Maybe a GUC that is 'post_synchronous_standby_names' of standbys that can't get data until the transaction has been sent to the sync standbys. This is the same guarantee that we want to provide to clients(the issue that started this discussion)
- Consider a client that gets an error on a cancel(crashes?), reconnects and sees the data it just sent isn't visible but then becomes visble later. This can happen now because the client and server might notice the disconnection at different points in time.
- Anything that generates an extra WAL record would have performance impact. Anything that can be done just in memory would perform better but what about the crash recovery cases
- A proposed patch is in the July commit fest. This can anchor to moving the discussion to hackers.