Effort to add asynchronous I/O, vectored/multiblock I/O and optional direct I/O to PostgreSQL.

High-level state

Core AIO infrastructure merged, using workers and io_uring
Buffer manager infrastructure for asynchronous reads has been merged
Read streams use AIO for readahead, various places have been converted to use AIO (See read stream users)
Direct I/O can be enabled via the debug_io_direct=data GUC, but will be very slow for many workloads (See DIO reqs)

Sub-Projects required for "minimal real-world use of DIO"

readahead of the table fetches for ordered index scans - without that common queries will be slower by orders of magnitude
Asynchronous writes for checkpointer / bgwriter and COPY
Some additional read streams might be required (e.g. ALTER TABLE SET TABLESPACE)

Streamification projects

In order to benefit from asynchronous I/O and vectored/multiblock I/O, we need to find all the places that pin buffers in a predictable way and teach them to do so via a stream. Switching to the streaming API insulates them from future improvements. For now, streamifying gets you vectored/multiblock I/O and systematic POSIX_FADV_WILLNEED hints only, but later true asynchronous modes will be proposed for the underlying infrastructure.

Users of read streams

Use streaming I/O in sequential scans. -- committed in v17
Use streaming I/O in ANALYZE. -- committed in 17
Use streaming I/O in pg_prewarm. -- committed in 17
Use streaming I/O in VACUUM -- work in progress
Use streaming I/O in bitmap heap scans -- work in progress
btree related thread -- todo
brin -- todo
gist -- todo
gin -- todo
CREATE DATABASE STRATEGY=WAL_LOG
recovery -- todo
...

Users of write streams

There are not as many places that write data out.

CHECKPOINT -- work in progress
CREATE INDEX (buffer pool bypass, extending/replacing the bulk_write.c facility) -- work in progress
VACUUM writeback -- work in progress
COPY writeback -- work in progress
eviction writeback -- work in progress

Infrastructure changes

Sub-Projects for AIO writes

bufmgr.c infrastructure to race-free acquire exclusive lock on buffer when IO might be in progress
sync.c requests cannot correctly be made in critical section, but that's required for buffered writes
Don't dirty pages while IO is ongoing (i.e. don't write hint bits while IO is ongoing)

Read stream API

basic buffered read stream, single relation + fork -- committed in 17
multi-relation read stream -- todo

Needed for recovery, but also useful for CREATE DATABASE ... STRATEGY=WAL_LOG.

Write stream API

basic buffered write stream -- todo
driving sync_file_range -- todo
driving WAL write frequency -- todo
buffer pool bypass, needed for CREATE INDEX (bulk writes) -- todo

TODO

General

partition_prune failing in CI due to compiler bug in specific MSVC release
- discussion on -hackers
- ticket to update compiler on CI
- drop commit "XXX Add temporary workaround for partition_prune test on Windows"
how should flush_range op be implemented on non-Linux?
write documentation, per-OS information
can we cut down on the number of places where we do non-blocking drain, for the benefit of implementations where that might be a system call?
replication is slow with wal_sync_method=open_datasync, because we don't call WalSndWakeupRequest(); this explains why eg the subscription tests are super slow on macOS on CI (macOS defaults to open_datasync)

Larger Issues

Streaming read / write interface are "too local" to specific users. A bit more backend global awareness would likely be a good idea
currently PostgreSQL does some prefetching with posix_fadvise_WILLNEED which is not asynchronous. See Readahead.

io_method=worker

self-adjusting IO worker pool?
more work on the spurious-wakeup vs latency tradeoff

io_method=posix_aio

add an "interruptible" field in shmem that can be used to avoid useless wakeups while the submitter is running synchronous IOs or already draining, with some double-checked flags to avoid races?
detect presence of POSIX AIO automatically so you don't have to build --with-posix-aio
- would be good to pass smoke tests on all known POSIX AIO implementations before we do that; results so far:
  - Successes!
    - FreeBSD
    - Linux with Glibc and Musl
    - illumos and Solaris
    - macOS
      - need to bump up sysctl limits to get decent performance
    - AIX
      - needs shared_memory_type=sysv
      - aio_nwait() would be a better wait/reap interface than aio_suspend(), but sadly it can't wait for aio_fsync()
        possible workaround: could use aio_nwait() whenever there are no aio_fsync calls outstanding
  - Failures, that will probably need to be excluded by default in early versions:
    - NetBSD 9 seems to spin eating 100% CPU in lio_listio() :-(
      - clue: an earlier iteration of this code was able to pass tests on NetBSD, when we were using aio_read(), aio_write() instead of lio_listio(), and when we were using SIGEV_SIGNAL instead of SIGEV_NONE
      - no bug report filed yet, but a NetBSD developer advised me not to try to use this, it's not ready
- if keeping it as a configure option, it should be "enable/disable", not "with/without"
fixed
- crash because aio_suspend() sees EINVAL, because "activate" LIO IOCB too soon, commit (fixup to be squashed)
- DONE: can we avoid waiting for a merge head from a later generation? yes
- DONE: can we avoid using atomics in a signal handler? most uses removed but in flight count remains; could do somethign about this
- DONE: the naming and coding in the baton stuff is weird, needs a rethink -> done, now called "exchange" and shared with iocp
- DONE: kill the array of active IOs if not using aio_suspend (eg FreeBSD)
- DONE: would it be better to have the signal handler give up after a short time so it can get back to doing something useful, and the waiter wake it again after a bit? -> seems to be better, but may need some defences against thundering herds and useless wakeups

io_method=iocp

Acceptance criteria for moving iocp into the main aio branch (from aio-win32):
- "TRAP: FailedAssertion("pgwin32_signal_event != NULL", File: "C:\Users\ContainerAdministrator\AppData\Local\Temp\cirrus-ci-build\src\port\open.c", Line: 78, PID: 3416)"
  - pgwin32_open() currently hacked to comment that out because of unresolved ordering problem, read_nondefault_variables() vs pgwin32_signal_initialize()
- currently pgaio_can_scatter_gather() considers only io_data_direct when deciding; but it applies also to WAL I/O (and potentially, in future, who knows, temporary files etc)
  - this matters primarily for Windows because Windows has a different answer depending on use of direct I/O, but we have at least 3 different GUCs to control that on diffrent subsystems
  - one idea would be for pgaio_can_scatter_gather() to take an IO and check the "scb" to see who is asking, or something like that...
  - another idea is to carry a "direct" flag on every IO so that the merge code doesn't have to concern itself with the details beyond that
- new API: pgaio_impl->opening_fd(int fd, int flags) so that Windows impl can register fd with IOCP if flags & O_OVERLAPPED? currently that's all a bit kludgy
other things
- using GetQueuedCompletionEventEx() requires Windows Vista, but PostgreSQL currently targets XP+. both are long dead, but the case for bumping it needs to be made in the community
- does FileFlushBuffers() have an async cousin? doesn't look like it
  - but there is an equivalen to fdatasync() in ntdll.dll; might be worth looking further
- pgaio_iocp_closing_fd() should drain only IOs on the given fd, not all IOs issued by this backend
  - compare see pgaio_posix_aio_closing_fd() -- it only drains results, it does not reap, to avoid deadlock risk! need something like that here too

fixed
- DONE: calling it "iocp" for now ("windows" was too generic; we want to reserve the option to use the new Windows io_uring knockoff API which will probably be a separate method)
- DONE: kill the IOCP thread, and teach pgaio_windows_drain() to drain?
- DONE: solve the resulting deadlock by using the same procsignal trick as posix_aio?
- DONE: we should use GetQueuedCompletionEventEx() to consume multiple completions in one call, instead of a loop!

io_method=ioring

no code yet, just an idea
ioring (note: no 'u', maybe should be win_ioring or some other name) is a hypothetical future IO method that would use Windows I/O rings, a knock-off of Linux io_uring that is available in Window 11 preview but still changing
so far the documentation only describes how to do reads but we know that you can already do writes and flushes with the current preview, so there is enough there right now to write aio_ioring.c and hook it up
the easiest way could be to have one ioring per backend and use aio_exchange.c to deal with cross-process problems (like io_method=posix_aio and io_method=iocp), but it may also be possible to have N iorings that are somehow shared between processes and then use the context system for interlocking (like io_method=io_uring, the Linux one); can you do that, somehow share the handle + memory mapping for the s and c queues + correct wakeups?

Quick start for PostgreSQL hackers/reviewers

Testing the default mode, simulated AIO using io_method=worker (= the default setting)
- Try strace-ing the backend and IO worker processes to see how I/O syscalls are offloaded
- Adjust the number of io_worker processes with io_workers=N
- See the view pg_stat_aios that shows individual IOs
- See the view pg_stat_aio_backends that shows per-backend info
Testing the use of direct I/O instead of PostgreSQL's traditional double buffering
- Set io_direct=data to disable OS buffering of relation data
- Set io_direct=wal to disable OS buffering WAL data
- Set io_direct=wal,wal_init,data to disable all
Testing OS-specific options for "native" AIO
- Linux io_uring
  - install package liburing-dev (or liburing-devel on some distros)
  - configure with --with-liburing (or if using Meson, it should find it by itself?)
  - run with io_method=io_uring
- POSIX AIO, on macOS, FreeBSD, NetBSD, AIX, illumos, Solaris, Linux
  - configure --with-posix-aio
  - run with io_method=posix_aio
  - Some OS specific notes:
    - Linux POSIX AIO is fake, simulated by glibc and musl with threads, and works well enough for testing but wouldn't be a good choice to actually use
    - Solaris/illumos also has fake POSIX AIO (there is a more native/kernel supported AIO API but we don't support that)
    - macOS has very tight limits on AIO; like every other application that uses AIO we're going to have to publish recommentations to crank them up (Sybase for example, VirtualBox is another); on the bright side, macOS defaults provide a nice workout for the IO retry code paths; that is, often we try to start an IO and the kernel says EAGAIN, so we way for one IO to complete and then try again, and you can see this in the pg_stat_aio_backends retry counter column
    - AIX only works if you set shared_memory_type=sysv because AIX's AIO can't access memory we allocate with mmap() (otherwise all IO fails with EFAULT)
    - AIX direct I/O might be unnecessarily serializing per-file, which we could fix with O_CONCURRENT, O_CIO or O_CIOR
    - HPUX probably also needs O_CIO to avoid serializing direct IO (native AIO not working there yet but this applies to worker mode too)
- Windows IOCP
  - run with io_method=iocp
  - (not yet pushed to main aio branch, find it in the aio-win32 branch until it's a little more complete...)

AIO

Contents