From PostgreSQL wiki
Jump to navigationJump to search

Some observations about I/O on FreeBSD, discovered while working on porting a new proposed PostgreSQL I/O subsystem. This is about AIO, VFS, UFS, kqueue and how they compare to Linux using io_uring and XFS. You can read a bit about that PostgreSQL project here, but the short version is that we're trying to give PostgreSQL optional direct I/O support, optional native asynchronous I/O support, introducing more concurrency and scatter/gather I/O while we're at it, with a fallback based on worker processes where no native AIO is available.

These are just raw notes, I haven't got much experience with several of these subsystems so if I've got things wrong here, or projects exist to address some of these things, I'd love to know about that. Thomas Munro <tmunro@{postgresql,freebsd}.org>.


Note that this brain dump is primarily about UFS, because it's in the same general family as XFS (I could speculate that SGI made XFS *because* they wanted to solve some of the problems I mentioned below, but I'm just guessing; they did invent O_DIRECT and tout I/O concurrency as a key feature when I was a student cutting my teeth on IRIX systems...). ZFS is great for databases and better at some of the things I mention below than UFS (I'll try to note where in parens, in cases that I know about), but it also doesn't have direct I/O (yet, though see PR 10018), and, on paper at least, a simple overwrite system ought to provide the highest possible performance for a database that is already doing some of the same sorts of things as ZFS itself (for example, you can write a transaction log with higher TPS if you have non-overlapping carefully block-aligned writes going out concurrently, but not if your filesystem is serialising the writes to put them in its own transaction log; maybe this can be done through ZFS too, but it's not obvious). Furthermore, databases believe that logically sequential blocks are also physically sequential, and include this in their query planner costing (see random_page_cost vs seq_page_cost), which is obviously a partial fiction on modern systems at various levels (extents, flash block relocation, underlying log structured cloud storage, ...) but nevertheless it remains approximately true that COW systems' sequential scan performance tends to be more affected by random update history. (Admittedly in-place systems have the inverse problem when writing).

Consequences of block size

PostgreSQL file I/O is always block aligned (like MySQL, Oracle, ...), but UFS's default block size is much larger. That's just a small matter of asking newfs for 8KB blocks (or for MySQL, IIRC 16KB blocks; a lot of historical filesystems and databases from the 80s-90s seemed to use 8KB, but FreeBSD decided to double it a couple of times; other BSDs doubled it once). I'm not exactly sure what new problems it creates to run with 8KB blocks these days, but it avoids a bunch of unnecessary write amplification and read-before-write activity.

When XFS moved from IRIX to Linux, it lost the ability to use pages larger than 4KB (though it can use smaller ones), traditionally seen in Unix filesystems. FreeBSD obviously retains the ability to use a wide range of block sizes. I have speculated that it might be possible to take advantage of that to pass 8KB block atomicity guarantees through the storage stack, but that's a separate can of worms; the prize there would be the ability for PostgreSQL (and MySQL) to avoid their current need to write out all data twice, sort of (!) (see FreeBSD/AtomicIO for more on that pipe dream).

Cache control

When running with device WCE enabled (consumer storage, or some cloud storage options), cache control commands/flags should be used to control write-through. Without that, users may be exposed to data loss, or lower performance if WCE is disabled. IIUC, BIO_FLUSH is currently used to protect file system meta data (I have not studied this code, but I see that in ffs_softdep.c), but not user data (like PostgreSQL calling pwrite() and then fdatasync()). To keep up with changes seen on other systems over the past decade, I suppose FreeBSD might want to consider the following changes:

  • fsync(), fdatasync() could send BIO_FLUSH if the device write cache is enabled and it's supported. This is the default behaviour on Linux these days. (ZFS does this already, UFS doesn't except perhaps as noted.) Proof-of-concept patch
  • ffs_write() with O_SYNC/O_DSYNC could instead set a hypothetical new BIO_FUA flag on BIO_WRITE commands (to be converted to SCSI/ATA/NVMe FUA), or if the storage doesn't support that, a BIO_FLUSH fallback could be used, if the device write cache is enabled. (Googling for B_FUA and BIO_FUA in the hope of finding existing work revealed that Apple has gone this way with their BSD-derived code, but I'm not sure if they're actually using it.) This avoids flushing the whole drive cache, and just flushes the writes carrying the flag.

Synchronous I/O

  • O_SYNC/O_DSYNC causes ffs_write() to write out *individual blocks* synchronously, losing the natural clustering from large pwrite() system calls. This makes O_DSYNC unsuitable for writing database transaction logs in general, even though it might beat pwrite() + fdatasync() in benchmarks that happen to use single block writes, since it skips a system call. For this reason, after I added O_DSYNC support to FreeBSD 13, I also modified PostgreSQL not to use it for its transaction log by default yet (so it defaults to wal_sync_method=fdatasync, though you can still ask for wal_sync_method=open_datasync; that's good for small transactions but terrible for large transactions where we might write many blocks at once).

Direct I/O

  • ffs_readraw() doesn't work for preadv() with iovcnt > 1, so data is copied through the buffer cache. PostgreSQL would like to be able to make extensive use of scatter/gather I/O for moving data between storage and its own buffer pool via DMA, when running in direct I/O mode.
  • There is no corresponding "raw" path for O_DIRECT writes.

Concurrent I/O

  • UFS doesn't set MNTK_SHARED_WRITES, so all writes are serialised for each file. Ideally they should run concurrently for non-overlapping block ranges (and we already have range locking to make that happen). This seems to be more important for direct I/O where you eat a full I/O sleep. Besides whatever (potentially complicated) internal interlocking problems that would pose, perhaps there are POSIX-compliance reasons to serialise writes on a file? I'm not sure about that, but I recall that AIX requires to you opt out of that serialisation explicitly with O_CONCURRENT. XFS (historically on IRIX and today on Linux) allows concurrency at least when you asked for O_DIRECT, thereby exiting POSIX's jurisdiction (if indeed that is the reason, for it, I don't know). (ZFS allows it, but doesn't have direct I/O).
  • Likewise for fsync() and fdatasync() (ie when not using O_DIRECT + O_DSYNC); this blocks writers and readers with vnode-level LK_EXCLUSIVE lock while the I/O is in progress. (But not on ZFS).

Asynchronous I/O

  • I wish aio_read() and aio_write() could use the fast path (that is, not use aiod worker threads) for regular UFS files opened with O_DIRECT. Ideally, an aio_readv(iovcnt=16) call on a file opened with O_DIRECT should be asynchronous all the way down to the driver as a single I/O with direct DMA via the ffs_readraw path (as it is with io_uring in qualifying cases). Also, I'd like the moon on a stick.
  • Since they run through the regular VFS interfaces called by aiod threads, aio_fsync() and aio_fwrite() run with the vnode exclusively locked, for UFS (though not for ZFS).
  • DONE: aio_{read,write}v() just landed in FreeBSD 13, which is great (these are non-standard extensions to POSIX, haven't seen them on any other OS). It'd be nice to be able to use those in lio_listio() too, to start multiple multi-segment I/Os with one system call (as you can with io_uring on Linux).
  • Likewise, aio_fsync() should be listio-able. But that involves making policy decisions about dependencies. Probably easy, but discussion needed.
  • An observation: our fsync() and fdatasync() should actually wait for AIOs on that descriptor to complete, according to POSIX (and there is a comment in vfs_syscalls.c:kern_fsync() saying so). They don't ... but it's actually good to be able to not wait for them too, and in some ways it'd be good to have aio_fsync() that's able to opt out of waiting for AIOs to complete!
  • I want user space aio_waitcomplete() queue. Alternatively a user space kqueue completion queue. This way, we can keep our ratio of system calls to IOs closer to 0 than 1 (as they can on Linux). As a by product, aio_suspend(), aio_error() and aio_return() should be able to avoid entering in the kernel in common cases, for users of those interfaces (though aio_waitcomplete() and kevent() are much better, they don't require polling all outstanding IOs). XXX I am working on this.
  • kevent() should be able to give you the result, without having to call aio_return() to free kernel resources. I have prototypes of this working for FreeBSD and being used by PostgreSQL, not quite there yet...
  • kqueue() could in theory also have a user space queue in front of it. I had working prototypes of user space completion queues for aio_waitcomplete() being used by PostgreSQL but decided that kevent() would be a more useful and ambitious plan...