FreeBSD/AtomicIO

From PostgreSQL wiki
Jump to navigationJump to search

Exposing atomic write size

I'd like to allow user space to find out the alignment requirements and maximum size for atomic writes to disk. After a crash including power failure, writes conforming to those parameters should be either entirely present or absent.

Motivation

Databases like PostgreSQL and MySQL typically write out all their data twice (!). That's because their crash recovery systems require database page-level atomicity, but they can't trust the storage to provide that. PostgreSQL typically uses 8kB pages, and MySQL typically uses 16kB pages, but they assume that the atomicity of the storage is only 512 bytes, the size of a sector in ancestral times. The reason they need atomicity at their own page level is because they use "physiological logging" for crash recovery, which requires all pages to be internally consistent as of *some* moment in time, in order to be able to replay changes to them. Therefore, "torn pages" (also known as "fractured blocks" in some other database communities) must be excluded somehow.

By writing the data twice with a synchronisation barrier in between, atomicity can be created. At least one of the copies must be a complete image, after a power loss crash that overwrites part of a page, and a checksum makes it clear which one. In PostgreSQL, the extra copy of the page goes into the write-ahead log and this feature is known as "full page writes". In MySQL, the extra copy of the page goes into a circular "double write buffer", a small disk file that is separate from the main data files.

You can dramatically increase the performance of these databases by turning off full_page_writes (PostgreSQL) or innodb_doublewrite (MySQL). When is it safe to do that? Some cases where it *might* be, if you configure things just right: ZFS, which has its own scheme for building atomic transactions and has a configurable block (record) size, modern flash storage (see AWUPF for advertised atomic write unit [in case of] power failure), and various cloud storage systems which might have log structured schemes or transactional intent logging schemes under the covers.

If the information is exposed, databases could make that setting automatic, or at least warn if it's set inappropriately.

Expose it the way we expose st_blksize?

Here is one idea for how to achieve that. There are plenty of other ways it could be done with new syscalls, of course, but st_blksize seems like something in the same ballpark, and it's nice to be able to ask on a per file basis with fstat(2), so that user space programs don't need to know about mount points and suchlike. No patches yet; this is just a vapourware idea! Perhaps the fields could be called something like:

   blksize_t st_atomic_buf_align,     /* alignment required for atomic buffered writes */
   blksize_t st_atomic_buf_size,      /* max size for atomic buffered writes */
   blksize_t st_atomic_direct_align,  /* alignment required for atomic direct writes */
   blksize_t st_atomic_direct_size,   /* max size for atomic direct writes */

Initial feedback from kernel hackers I asked: maybe fcntl(2) or a new syscall could be better than making struct stat bigger.

Examples

The values could be calculated by combining information from the filesystem and the underlying storage, as follow:

1. ZFS could return recordsize in all fields.

2. UFS could return the greatest common factor of its own block size and the underlying device, for buffered I/O. For direct I/O it could return the underlying device's values. But... it's not yet clear (to me) whether we'd need a new way to opt into atomicity-preserving behaviour, or whether the properties exist already; see questions section below.

3. NFS could do something similar, combining its own buffer size with the remote host's information, if that is made available somehow.

The device level could expose values through some new interface, as follows:

1. Unmodified drivers could somehow report either 0 (unknown) or 512 bytes (conservative value already assumed by a lot of software, but not governed by any standard).

2. The NVME driver could perhaps expose the AWUPF (atomic write unit power failure) value, if available (?)

3. Special drivers for cloud storage such as Azure hyperv/storvsc could report appropriate values when they are known. Cam hyperv/storvsc recognise when it is talking to Azure block storage (and not, say, HyperV loopback file on a laptop)? What guarantees could it then infer?

Questions

How can we know that requests can't be cut up by some layer in the storage stack, other than the obvious case of block size? Is it possible to know that without having special new flags to say "don't do that", like the proposed Linux O_ATOMIC flag? Given a UFS file system with 8kB logical blocks, can we expect (1) write-back of buffered blocks at the logical block size, and (2) direct writes of correctly aligned logical blocks, to arrive whole in SCSI WRITE commands (possibly clustered with others, but never divided), through GEOM, CAM, SCSI/XXX, driver? What we want to avoid is write being broken up into multiple write commands or scattered writes of the devices sector size or something like that. If that doesn't work today, what would be necessary to make it work?

Related proposal

For several years the Linux community has been considering a new open(2) flag O_ATOMIC, to be used along with O_SYNC or O_DSYNC to request that writes be atomic. The discussions specifically reference MySQL and the ability to turn off double writes. That sounds extremely useful for this purpose, but it's not strictly necessary to solve the immediate problem. For one thing, we don't need arbitrary sized atomic writes, we only need to know that our page size is suitable. For another thing, it doesn't allow for buffered I/O. FreeBSD has an advantage there, because its filesystem blocksize is not fixed at 4KB, so buffered I/O doesn't necessarily destroy atomicity at the sizes we're interested in. Linux Plumbers 2019 video, discussion, LWN article.

Of course, it would be great to support something like O_ATOMIC too, especially if it becomes a defacto standard due to Linux (though that work seems to have stalled a bit?), but that's much more difficult anyway: the filesystem-level part of it introduces copy on write behaviour to *create* atomicity where it doesn't come for free (like ZFS already does). In contrast, the present proposal is about merely reporting the existence of atomicity that is already here, so that databases don't have to be so pessimistic.