From PostgreSQL wiki
Jump to: navigation, search

Some observations about I/O on FreeBSD, discovered while working on porting a new proposed PostgreSQL I/O subsystem. This is about AIO, VFS, UFS, kqueue and how they compare to Linux using io_uring and XFS. You can read a bit about that PostgreSQL project here, but the short version is that we're trying to give PostgreSQL optional direct I/O support, optional native asynchronous I/O support, introducing more concurrency and scatter/gather I/O while we're at it, with a fallback based on worker processes where no native AIO is available.

These are just raw notes, I haven't got much experience with several of these subsystems so if I've got things wrong here, or projects exist to address some of these things, I'd love to know about that. Thomas Munro <tmunro@{postgresql,freebsd}.org>.


Note that this brain dump is primarily about UFS, because it's in the same general family as XFS (I could speculate that SGI made XFS *because* they wanted to solve some of the problems I mentioned below, but I'm just guessing; they did invent O_DIRECT and tout I/O concurrency as a key feature when I was a student cutting my teeth on IRIX systems...). ZFS is great for databases and better at some of the things I mention below than UFS (I'll try to note where in parens, in cases that I know about), but it also doesn't have direct I/O (yet, though see PR 10018), and, on paper at least, a simple overwrite system ought to provide the highest possible performance for a database that is already doing some of the same sorts of things as ZFS itself (for example, you can write a transaction log with higher TPS if you have non-overlapping carefully block-aligned writes going out concurrently, but not if your filesystem is serialising the writes to put them in its own transaction log; maybe this can be done through ZFS too, but it's not obvious). Furthermore, databases believe that logically sequential blocks are also physically sequential, and include this in their query planner costing (see random_page_cost vs seq_page_cost), which is obviously a partial fiction on modern systems at various levels (extents, flash block relocation, underlying log structured cloud storage, ...) but nevertheless it remains approximately true that COW systems' sequential scan performance tends to be more affected by random update history. (Admittedly in-place systems have the inverse problem when writing).

Consequences of block size

PostgreSQL file I/O is always block aligned (like MySQL, Oracle, ...), but UFS's default block size is much larger. That's just a small matter of asking newfs for 8KB blocks (or for MySQL, IIRC 16KB blocks; a lot of historical filesystems and databases from the 80s-90s seemed to use 8KB, but FreeBSD decided to double it a couple of times; other BSDs doubled it once). I'm not exactly sure what new problems it creates to run with 8KB blocks these days, but it avoids a bunch of unnecessary write amplification and read-before-write activity.

When XFS moved from IRIX to Linux, it lost the ability to use pages larger than 4KB (though it can use smaller ones), traditionally seen in Unix filesystems. FreeBSD obviously retains the ability to use a wide range of block sizes. I have speculated that it might be possible to take advantage of that to pass 8KB block atomicity guarantees through the storage stack, but that's a separate can of worms; the prize there would be the ability for PostgreSQL (and MySQL) to avoid their current need to write out all data twice, sort of (!) (see FreeBSD/AtomicIO for more on that pipe dream).

Cache control

When running with device WCE enabled (consumer storage, or some cloud storage options), cache control commands/flags should be used to control write-through. Without that, users may be exposed to data loss, or lower performance if WCE is disabled. IIUC, BIO_FLUSH is currently used to protect file system meta data (I have not studied this code, but I see that in ffs_softdep.c), but not user data (like PostgreSQL calling pwrite() and then fdatasync()). To keep up with changes seen on other systems over the past decade, I suppose FreeBSD might want to consider the following changes:

  • fsync(), fdatasync() could send BIO_FLUSH if the device write cache is enabled and it's supported. This is the default behaviour on Linux these days. (ZFS does this already, UFS doesn't except perhaps as noted.)
  • ffs_write() with O_SYNC/O_DSYNC could instead set a hypothetical new BIO_FUA flag on BIO_WRITE commands (to be converted to SCSI/ATA/NVMe FUA), or if the storage doesn't support that, a BIO_FLUSH fallback could be used, if the device write cache is enabled. (Googling for B_FUA and BIO_FUA in the hope of finding existing work revealed that Apple has gone this way with their BSD-derived code, but I'm not sure if they're actually using it.) This avoids flushing the whole drive cache, and just flushes the writes carrying the flag.

Synchronous I/O

  • O_SYNC/O_DSYNC causes ffs_write() to write out *individual blocks* synchronously, losing the natural clustering from large pwrite() system calls. This makes O_DSYNC unsuitable for writing database transaction logs in general, even though it might beat pwrite() + fdatasync() in benchmarks that happen to use single block writes, since it skips a system call. For this reason, after I added O_DSYNC support to FreeBSD 13, I also modified PostgreSQL not to use it for its transaction log by default yet (so it defaults to wal_sync_method=fdatasync, though you can still ask for wal_sync_method=open_datasync; that's good for small transactions but terrible for large transactions where we might write many blocks at once).

Direct I/O

  • ffs_readraw() doesn't work for preadv() with iovcnt > 1, so data is copied through the buffer cache. PostgreSQL would like to be able to make extensive use of scatter/gather I/O for moving data between storage and its own buffer pool via DMA, when running in direct I/O mode.
  • There is no corresponding "raw" path for O_DIRECT writes.

Concurrent I/O

  • UFS doesn't set MNTK_SHARED_WRITES, so all writes are serialised. Ideally they should run concurrently for non-overlapping block ranges. This seems to be more important for direct I/O where you eat a full I/O sleep. Besides whatever (potentially extremely) complicated internal interlocking problems that would pose, perhaps there are POSIX-compliance reasons to serialise writes on a file? I'm not sure about that, but I recall that AIX requires to you opt out of that serialisation explicitly with O_CONCURRENT. XFS (historically on IRIX and today on Linux) allows concurrency at least when you asked for O_DIRECT, thereby exiting POSIX's jurisdiction (if indeed that is the reason, for it, I don't know). (ZFS allows it, but doesn't have direct I/O).
  • Likewise for fsync() and fdatasync() (ie when not using O_DIRECT + O_DSYNC); this blocks writers and readers with vnode-level LK_EXCLUSIVE lock while the I/O is in progress. (But not on ZFS).

Asynchronous I/O

  • I wish aio_read() and aio_write() could use the fast path (that is, not use aiod worker threads) for regular UFS files opened with O_DIRECT. Ideally, an aio_readv(iovcnt=16) call on a file opened with O_DIRECT should be asynchronous all the way down to the driver as a single I/O with direct DMA via the ffs_readraw path (as it is with io_uring in qualifying cases). Also, I'd like the moon on a stick.
  • Since they run through the regular VFS interfaces called by aiod threads, aio_fsync() and aio_fwrite() run with the vnode exclusively locked, for UFS (though not for ZFS).
  • aio_{read,write}v() just landed in FreeBSD 13, which is great (these are non-standard extensions to POSIX, haven't seen them on any other OS). It'd be nice to be able to use those in lio_listio() too, to start multiple multi-segment I/Os with one system call (as you can with io_uring on Linux). Likewise, aio_fsync() should be listio-able.
  • An observation: our fsync() and fdatasync() should actually wait for AIOs on that descriptor to complete, according to POSIX (and there is a comment in vfs_syscalls.c:kern_fsync() saying so). They don't ... but it's actually good to be able to not wait for them too, and in some ways it'd be good to have aio_fsync() that's able to opt out of waiting for AIOs to complete!
  • I wish aio_error() didn't make a system call. The answer should be readily available from the user space aiocb.
  • I wish aio_return() didn't make a system call. The answer should be readily available from the user space aiocb, but here things are a bit more complicated than for aio_error(): aio_return() frees a kernel object, but if we start freeing them eagerly, then aio_waitcomplete() (a FreeBSD extension) wouldn't work any more. So you'd probably need a new flag when submitting the I/O that makes aio_return() user space-only, but makes the I/O invisible to aio_waitcomplete(), or the inverse.
  • If aio_error() and aio_return() didn't require system calls, that'd imply that all information is in userspace, which might make it possible for AIOs to be initiated by one process and have the completion waited on and consumed by another (io_uring supports that and the proposed AIO feature makes use of it). That said, PostgreSQL really should switch to threads so this is a non-problem... no timeline for that yet...
  • I wish I could read completion information from a kqueue without entering the kernel when it is available. You could imagine a user space ring buffer (like io_uring) of kevent objects, with a new wrapper function kuevent() that would try to read from a user space ring buffer, and enter the kernel only when necessary. I don't really mind entering the kernel just to read kevent objects, as long as I get a decent ratio of actual events to system calls. In particular, I'd like a cheap way to try to poll for new events opportunistically with timeout == 0, that only enters the kernel if there is at least one event ready. One much simpler way to achieve that (ie without a user space kevent ring) might be to have just an event counter in user space. (The reason it's not enough to have a cheap non-syscall aio_error() to achieve cheap I/O completion polling that is that I might have many I/Os in flight and I'd have to poll all of them, whereas a ring buffer with output events requires checking just one word.)
  • It might be interesting for kqueue() to allow new I/Os to be submitted, too (replacing lio_listio()). Then I could learn about I/Os that have been completed in the same system call as I submit later ones.
  • An alternative to changing kqueue() would be to introduce a new independent kind of user space queue that AIO request can be told to write their completion events into. It could be done with atomic integers for the head (advanced by the kernel) and tail (advanced by userspace), which could be non-blocking-polled by reading a word, and waited on when appropriate using _umtx_op(UMTX_OP_WAIT) for race-free waits on the counter when entering the kernel. (The do-it-with-kqueue variant, either the "userspace counter" version or the "userspace array of kevent" version would need something conceptually like that too, of course.)
  • Not FreeBSD's fault, but while I'm getting things off my chest: I wish the other systems that have both AIO and kqueue() had connected them together. Namely macOS and NetBSD. Currently the PostgreSQL AIO prototype is using signals for POSIX AIO mode because it's portable (though it'd also be possible to do non-blocking polls of all outstanding I/Os with aio_suspend(timeout = 0), but that's also really inefficient among other problems). I'd like to use kqueue.