This article covers the current status, history, and OS and OS version differences relating to the circa 2018 fsync() reliability issues discussed on the PostgreSQL mailing list and elsewhere. It has sometimes been referred to as "fsyncgate 2018".
As of this PostgreSQL 12 commit, PostgreSQL will now PANIC on fsync() failure. It was backpatched to PostgreSQL 11, 10, 9.6, 9.5 and 9.4. Thanks to Thomas Munro, Andres Freund, Robert Haas, and Craig Ringer.
Linux kernel 4.13 improved
fsync() error handling and the man page for
fsync() is somewhat improved as well. See:
- Kernelnewbies for 4.13
- Particularly significant 4.13 commits include:
- "fs: new infrastructure for writeback error handling and reporting"
- "ext4: use errseq_t based error handling for reporting data writeback errors"
- "Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors"
- "mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error"
Many thanks to Jeff Layton for work done in this area.
A proposed follow-up change to PostgreSQL was discussed in the thread Refactoring the checkpointer's fsync request queue]. The patch that was committed did not incorporate the file-descriptor passing changes proposed. There is still discussion open on some additional safeguards that may use file system error counters and/or filesystem-wide flushing.
Articles and news
- The "fsyncgate 2018" mailing list thread
- LWN.net article "PostgreSQL's fsync() surprise"
- LWN.net article "Improved block-layer error handling"
- Can Applications Recover from fsync Failures? - a USENIX 2020 paper discussing some of these topics
Research notes and OS differences
Here is a summary of what we have learned so far about the behaviour of the fsync() system call in the presence of write-back errors on various operating systems of interest to PostgreSQL users (if our build farm is a reliable survey).
What we want to know is: when can write-back errors be forgotten and go unreported to userspace? Arbitrarily, if errors are detected during asynchronous write-back? What about errors that occurred before you opened the file and got a new file descriptor and called fsync()? If fsync() reports failure and then you call fsync() again, can it falsely report success? PostgreSQL believes that a successful call to fsync() means that *all* data for a file is on disk, as part of its checkpointing protocol. Apparently that is not the case on some operating systems, leading to the potential for unreported data loss. .
If you see a mistake or know something I don't, please update this document with supporting references, or ping email@example.com!
Open source kernels
- Darwin/macOS: buffers are invalidated, code similar to NetBSD
- DragonflyBSD: not analysed -- the source of brelse might tell us
- FreeBSD: buffers remain dirty (and from version 11.1 on, they are dropped on failure after the device goes away) so future fsync() calls will try again and presumably fail; recent testing report, 10 year old testing report commit from over 20 years ago fixing the issue
- Illumos: writes are retried, at least in the case of asynchronous write-back. Not yet clear to me whether failure provoked by a synchronous fsync() call leaves buffers valid and dirty.
- Linux < 4.13: fsync() errors can be lost in various ways; also buffers are marked clean after errors, so retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure
- Linux 4.13 and 4.15: fsync() only reports writeback errors that occurred after you called open() so our schemes for closing and opening files LRU-style and handing fsync() work off to the checkpointer process can hide write-back errors; also buffers are marked clean after errors so even if you opened the file before the failure, retrying fsync() can falsely report success and the modified buffer can be thrown away at any time due to memory pressure.
- Linux 4.14 and Linux >= 4.16 write-back error counter is initialised differently so that somebody gets the inode's first error even if the file was closed and opened in between, but you still only get the error once (so retrying fsync() is not OK) and the error can be forgotten if the inode falls out of the inode cache (unlikely since all file descriptors referencing the inode must be closed first, and close calls fsync); buffers are still thrown away (either immediately or on memory pressure, depending on choice of fs) so you might read back an older version of the page than you most recently wrote
- NetBSD: buffers are invalidated here so future fsync() calls may return success despite data loss; there may also be other problems according to a netbsd.org bug report that was triggered by our discussion
- OpenBSD: buffers are invalidated, code similar to NetBSD; OpenBSD hackers pinged for comment new OpenBSD hackers thread; UPDATE: a recent commit changed the behaviour, analysis needed; man page updated to say "To guard against potential inconsistency, future calls will continue failing until all references to the file are closed.", which is good as long as someone holds the file open, but that isn't guaranteed in PostgreSQL (it probably should be)
Closed source kernels
- AIX: unknown
- HPUX: unknown
- Solaris: maybe the same as Illumos, but there was apparently a great VM allocator rewrite after Solaris reverted to closed source
- Windows: unknown
Note that ZFS is likely to be a special case even on Linux, because it doesn't use the regular page cache and has special handling for failures. More information needed.
There is ongoing discussion regarding flushing and error handling in the Linux kernel, such as that occurring in the fsinfo patch sets.
History and notes
Archeological notes: All BSD-derived systems probably inherited that brelse() logic from their common ancestor, but FreeBSD changed it in 1999 and DragonflyBSD forked from FreeBSD in 2003 but apparently rewrote the bio code significantly. Darwin inherited code directly from ancient BSD via NeXT, and later took more code from FreeBSD but apparently not the behaviour discussed above. Ancient Bell UNIX was conceptually had the same problem but since it didn't have fsync(), that's somewhat moot. According to various man pages, fsync() was introduced by 4.2BSD (1983, not sure if fsync was added a bit later), developed around the same time and same place as POSTGRES (1986), and said in its man page it for making transactional facilities. Also fsync(1) appeared in FreeBSD 4.3 (2001), a command line tool that lets you sync a named file, which probably only makes sense if you have a certain model of how I/O errors and buffering work.