Difference between revisions of "Fsync Errors"

From PostgreSQL wiki
Jump to: navigation, search
m
(Update 2018 fsync page to reflect current status)
Line 1: Line 1:
 +
This article covers the current status, history, and OS and OS version differences relating to the circa 2018 fsync() reliability issues discussed on the PostgreSQL mailing list and elsewhere. It has sometimes been referred to as "fsyncgate 2018".
 +
 +
== Current status ==
 +
 +
As of [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 this PostgreSQL 12 commit], PostgreSQL will now PANIC on fsync() failure. It was backpatched to PostgreSQL 11, 10, 9.6, 9.5 and 9.4. Thanks to Thomas Munro, Andres Freund, Robert Haas, and Craig Ringer.
 +
 +
[Linux kernel 4.13 improved <code>fsync()</code> error handling]() and the [man page for <code>fsync()</code> is somewhat improved](https://linux.die.net/man/2/fsync) as well. See:
 +
 +
* [https://kernelnewbies.org/Linux_4.13#Improved_block_layer_and_background_writes_error_handling Kernelnewbies for 4.13]
 +
* Particularly significant 4.13 commits include:
 +
  * [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5660e13d2fd6af1903d4b0b98020af95ca2d638a "fs: new infrastructure for writeback error handling and reporting"]
 +
  * [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6acec592c6bc9a4c3136e46430e14767b07f9f1a "ext4: use errseq_t based error handling for reporting data writeback errors"]
 +
  * [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=acbf3c3452c3729829fdb0e5a52fed3cce556eb2 "Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors"]
 +
  * [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed1e46aaf1bec6a12f4c89637f2c3ef4c70f18e "mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error"]
 +
 +
Many thanks to Jeff Layton for work done in this area.
 +
 +
Similar changes were made in [https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a#diff-28ed40e7ccec6683ebd46da2ca82d01c InnoDB/MySQL], [https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96 WiredTiger/MongoDB] and no doubt other software as a result of the PR around this.
 +
 +
== Articles and news ==
 +
 +
* [https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com The "fsyncgate 2018" mailing list thread]
 +
* [https://lwn.net/Articles/752063/ LWN.net article "PostgreSQL's fsync() surprise"]
 +
* [https://lwn.net/Articles/724307/ LWN.net article "Improved block-layer error handling"]
 +
* [https://www.usenix.org/system/files/atc20-rebello.pdf Can Applications Recover from fsync Failures?] - a USENIX 2020 paper discussing some of these topics
 +
 +
== Research notes and OS differences ==Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).
 +
 
Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).
 
Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).
  
What we want to know is: when can write-back errors be forgotten and go unreported to userspace?  Arbitrarily, if errors are detected during asynchronous write-back?  What about errors that occurred before you opened the file and got a new file descriptor and called fsync()?  If fsync() reports failure and then you call fsync() again, can it falsely report success?  PostgreSQL believes that a successful call to fsync() means that *all* data for a file is on disk, as part of its checkpointing protocol.  Apparently that is not the case on some operating systems, leading to the potential for unreported data loss. Triggered by [https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com fsyncgate 2018].
+
What we want to know is: when can write-back errors be forgotten and go unreported to userspace?  Arbitrarily, if errors are detected during asynchronous write-back?  What about errors that occurred before you opened the file and got a new file descriptor and called fsync()?  If fsync() reports failure and then you call fsync() again, can it falsely report success?  PostgreSQL believes that a successful call to fsync() means that *all* data for a file is on disk, as part of its checkpointing protocol.  Apparently that is not the case on some operating systems, leading to the potential for unreported data loss. .
  
 
If you see a mistake or know something I don't, please update this document with supporting references, or ping thomas.munro@gmail.com!
 
If you see a mistake or know something I don't, please update this document with supporting references, or ping thomas.munro@gmail.com!
 
Update: As of [https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 this commit], PostgreSQL will now PANIC on fsync() failure.  (Similar changes were made in [https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a#diff-28ed40e7ccec6683ebd46da2ca82d01c InnoDB/MySQL], [https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96 WiredTiger/MongoDB] and no doubt other software as a result of the PR around this.)
 
 
Update: [https://www.usenix.org/system/files/atc20-rebello.pdf Can Applications Recover from fsync Failures?] appeared as a USENIX 2020 paper, discussing some of these topics.
 
  
 
Open source kernels:
 
Open source kernels:
Line 31: Line 55:
  
 
Archeological notes: All BSD-derived systems probably inherited that brelse() logic from their [https://github.com/weiss/original-bsd/blob/master/sys/kern/vfs_bio.c#L377 common ancestor], but FreeBSD [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 changed it in 1999] and DragonflyBSD forked from FreeBSD in 2003 but apparently rewrote the bio code significantly.  Darwin inherited code directly from ancient BSD via NeXT, and later took more code from FreeBSD but apparently not the behaviour discussed above.  Ancient Bell UNIX was [https://github.com/dspinellis/unix-history-repo/blob/Bell-32V-Snapshot-Development/usr/src/sys/sys/bio.c#L196 conceptually had the same problem] but since it didn't have fsync(), that's somewhat moot.  According to various man pages, fsync() was introduced by 4.2BSD (1983, not sure if fsync was added a bit later), developed around the same time and same place as POSTGRES (1986), and said in its man page it for making transactional facilities.  Also fsync(1) appeared in FreeBSD 4.3 (2001), a command line tool that lets you sync a named file, which probably only makes sense if you have a certain model of how I/O errors and buffering work.
 
Archeological notes: All BSD-derived systems probably inherited that brelse() logic from their [https://github.com/weiss/original-bsd/blob/master/sys/kern/vfs_bio.c#L377 common ancestor], but FreeBSD [https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 changed it in 1999] and DragonflyBSD forked from FreeBSD in 2003 but apparently rewrote the bio code significantly.  Darwin inherited code directly from ancient BSD via NeXT, and later took more code from FreeBSD but apparently not the behaviour discussed above.  Ancient Bell UNIX was [https://github.com/dspinellis/unix-history-repo/blob/Bell-32V-Snapshot-Development/usr/src/sys/sys/bio.c#L196 conceptually had the same problem] but since it didn't have fsync(), that's somewhat moot.  According to various man pages, fsync() was introduced by 4.2BSD (1983, not sure if fsync was added a bit later), developed around the same time and same place as POSTGRES (1986), and said in its man page it for making transactional facilities.  Also fsync(1) appeared in FreeBSD 4.3 (2001), a command line tool that lets you sync a named file, which probably only makes sense if you have a certain model of how I/O errors and buffering work.
 +
 +
== Relevant PostgreSQL commits ==

Revision as of 02:30, 22 September 2020

This article covers the current status, history, and OS and OS version differences relating to the circa 2018 fsync() reliability issues discussed on the PostgreSQL mailing list and elsewhere. It has sometimes been referred to as "fsyncgate 2018".

Current status

As of this PostgreSQL 12 commit, PostgreSQL will now PANIC on fsync() failure. It was backpatched to PostgreSQL 11, 10, 9.6, 9.5 and 9.4. Thanks to Thomas Munro, Andres Freund, Robert Haas, and Craig Ringer.

[Linux kernel 4.13 improved fsync() error handling]() and the [man page for fsync() is somewhat improved](https://linux.die.net/man/2/fsync) as well. See:

 * "fs: new infrastructure for writeback error handling and reporting"
 * "ext4: use errseq_t based error handling for reporting data writeback errors"
 * "Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors"
 * "mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error"

Many thanks to Jeff Layton for work done in this area.

Similar changes were made in InnoDB/MySQL, WiredTiger/MongoDB and no doubt other software as a result of the PR around this.

Articles and news

== Research notes and OS differences ==Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).

Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).

What we want to know is: when can write-back errors be forgotten and go unreported to userspace? Arbitrarily, if errors are detected during asynchronous write-back? What about errors that occurred before you opened the file and got a new file descriptor and called fsync()? If fsync() reports failure and then you call fsync() again, can it falsely report success? PostgreSQL believes that a successful call to fsync() means that *all* data for a file is on disk, as part of its checkpointing protocol. Apparently that is not the case on some operating systems, leading to the potential for unreported data loss. .

If you see a mistake or know something I don't, please update this document with supporting references, or ping thomas.munro@gmail.com!

Open source kernels:

Closed source kernels:

  • AIX: unknown
  • HPUX: unknown
  • Solaris: maybe the same as Illumos, but there was apparently a great VM allocator rewrite after Solaris reverted to closed source
  • Windows: unknown

Note that ZFS is likely to be a special case even on Linux, because it doesn't use the regular page cache and has special handling for failures. More information needed.

Archeological notes: All BSD-derived systems probably inherited that brelse() logic from their common ancestor, but FreeBSD changed it in 1999 and DragonflyBSD forked from FreeBSD in 2003 but apparently rewrote the bio code significantly. Darwin inherited code directly from ancient BSD via NeXT, and later took more code from FreeBSD but apparently not the behaviour discussed above. Ancient Bell UNIX was conceptually had the same problem but since it didn't have fsync(), that's somewhat moot. According to various man pages, fsync() was introduced by 4.2BSD (1983, not sure if fsync was added a bit later), developed around the same time and same place as POSTGRES (1986), and said in its man page it for making transactional facilities. Also fsync(1) appeared in FreeBSD 4.3 (2001), a command line tool that lets you sync a named file, which probably only makes sense if you have a certain model of how I/O errors and buffering work.

Relevant PostgreSQL commits