AIO
From PostgreSQL wiki
Jump to navigationJump to searchEffort to add asynchronous I/O, vectored/multiblock I/O and optional direct I/O to PostgreSQL.
Streamification projects
In order to benefit from asynchronous I/O and vectored/multiblock I/O, we need to find all the places that pin buffers in a predictable way and teach them to do so via a stream. Switching to the streaming API insulates them from future improvements. For now, streamifying gets you vectored/multiblock I/O and systematic POSIX_FADV_WILLNEED hints only, but later true asynchronous modes will be proposed for the underlying infrastructure.
Users of read streams
- Use streaming I/O in sequential scans. -- committed in v17
- Use streaming I/O in ANALYZE. -- committed in 17
- Use streaming I/O in pg_prewarm. -- committed in 17
- Use streaming I/O in VACUUM -- work in progress
- Use streaming I/O in bitmap heap scans -- work in progress
- btree related thread -- todo
- brin -- todo
- gist -- todo
- gin -- todo
- CREATE DATABASE STRATEGY=WAL_LOG
- recovery -- todo
- ...
Users of write streams
There are not as many places that write data out.
- CHECKPOINT -- work in progress
- CREATE INDEX (buffer pool bypass, extending/replacing the bulk_write.c facility) -- work in progress
- VACUUM writeback -- work in progress
- COPY writeback -- work in progress
- eviction writeback -- work in progress
Infrastructure changes
Read stream API
- basic buffered read stream, single relation + fork -- committed in 17
- multi-relation read stream -- todo
- Needed for recovery, but also useful for CREATE DATABASE ... STRATEGY=WAL_LOG.
Write stream API
- basic buffered write stream -- todo
- driving sync_file_range -- todo
- driving WAL write frequency -- todo
- buffer pool bypass, needed for CREATE INDEX (bulk writes) -- todo
Committed
- O_DIRECT on macOS
- So that Mac hackers can test io_data_direct=on (worker mode or posix_aio mode)
- pg_pwritev() and pg_preadv()
- Portable support for synchronous scatter/gather I/O
- Replace buffer I/O locks with condition variables
- So that backends other that the one that drains an I/O from the kernel can wait for it to complete
- Aligned memory allocation
- As required for buffers used in direct I/O
- Direct I/O GUC
- Released in 16 as debug_io_direct, to avoid attracting too much attention
- Fix DROP TABLESPACE on Windows with ProcSignalBarrier
- This is actually an ancient bug in PostgreSQL on Windows, but it was a bit harder to hit; make check fails every time under io_method=worker because processes hang around holding cached fds.
- Refactor relation extension
- This includes work to allow a backend to have multiple buffers in BM_IO_IN_PROGRESS state.
- Streaming I/O, vectored I/O
- Introduce the concept of streams of buffers, initially for more efficient synchronous I/O
TODO
General
- get synchronous optimisation to work on all implementations
- if we know we're going to submit exactly one IO and then wait for it (see "will_wait parameter"), switch to a regular preadv()/pwritev()/fsync()/fdatasync() syscall, which may be a little more efficient
- made initially for io_method=worker, but needs more work to work for the other methods
- partition_prune failing in CI due to compiler bug in specific MSVC release
- discussion on -hackers
- ticket to update compiler on CI
- drop commit "XXX Add temporary workaround for partition_prune test on Windows"
- there might be IOs that this backend retried (and thus submitted) that aren't on our issued list, so pgaio_postmaster_before_child_exit() won't wait for them, and some kernels could cancel/forget the IOs
- TODO note added to pgaio_postmaster_before_child_exit()
- TODO the posix_aio implementation has a .postmaster_before_child_exit callback to fix this locally (otherwise Macs fail on CI due to random retries)
- how should flush_range op be implemented on non-Linux?
- write documentation, per-OS information
- can we cut down on the number of places where we do non-blocking drain, for the benefit of implementations where that might be a system call?
- incorporate Melanie's EXPLAIN changes
- replication is slow with wal_sync_method=open_datasync, because we don't call WalSndWakeupRequest(); this explains why eg the subscription tests are super slow on macOS on CI (macOS defaults to open_datasync)
- we're still carrying some obsolete code for dealing with macOS F_NOCACHE, which has been upstreamed (differently)
- allow to use AIO interface for temporary tables, to avoid / reduce code duplication (see e.g. RelationUsesLocalBuffers() path in ReadBufferAsync())
- fixed
- DONE: occasional assertion failure "issued_abandoned_count == 0" fixed, squashed
- DONE: "buffer beyond EOF" fixed, squashed
- DONE: reference leak warnings from checkpointer fixed, squashed
Larger Issues
- local callbacks can lead to too deep recursion - callbacks likely shouldn't be allowed to wait for IO
- ownership tracking of IOs is too complicated and yet not quite good enough
- Streaming read / write interface are "too local" to specific users. A bit more backend global awareness would likely be a good idea
- currently PostgreSQL does some prefetching with posix_fadvise_WILLNEED which is not asynchronous. See Readahead.
io_method=worker
- self-adjusting IO worker pool?
- more work on the spurious-wakeup vs latency tradeoff
- bincheck failing on CI on Windows/worker
- pg_basebackup: error: could not initiate base backup: ERROR: could not stat file or directory "C:\Users\ContainerAdministrator\AppData\Local\Temp\sTWg233xJQ\tempdir\tblspc1/PG_15_202108031/12762/16388": Permission denied
- Bug in master?
- IO workers cache file descriptors for relations that have been dropped, and then could be confused if the relfilenode is recycled! that's because they don't obtain relation locks or participate in sinval
- the bgwriter's approach to this problem is actually broken in master and certainly wouldn't work here
- perhaps there is a way to rely on the caller's use of sinval and locks, putting extra information that could be used for cache invalidation into the IO
io_method=io_uring
- move method-specific stuff into io->io_method_data.io_uring
- there's a couple more #ifdef USE_LIBRING bits in aio.c that could be kicked out into pgaio_impl->something()?
- locking is too heavyweight
io_method=posix_aio
- add an "interruptible" field in shmem that can be used to avoid useless wakeups while the submitter is running synchronous IOs or already draining, with some double-checked flags to avoid races?
- detect presence of POSIX AIO automatically so you don't have to build --with-posix-aio
- would be good to pass smoke tests on all known POSIX AIO implementations before we do that; results so far:
- Successes!
- FreeBSD
- Linux with Glibc and Musl
- illumos and Solaris
- macOS
- need to bump up sysctl limits to get decent performance
- AIX
- needs shared_memory_type=sysv
- aio_nwait() would be a better wait/reap interface than aio_suspend(), but sadly it can't wait for aio_fsync()
- possible workaround: could use aio_nwait() whenever there are no aio_fsync calls outstanding
- Failures, that will probably need to be excluded by default in early versions:
- NetBSD 9 seems to spin eating 100% CPU in lio_listio() :-(
- clue: an earlier iteration of this code was able to pass tests on NetBSD, when we were using aio_read(), aio_write() instead of lio_listio(), and when we were using SIGEV_SIGNAL instead of SIGEV_NONE
- no bug report filed yet, but a NetBSD developer advised me not to try to use this, it's not ready
- NetBSD 9 seems to spin eating 100% CPU in lio_listio() :-(
- Successes!
- if keeping it as a configure option, it should be "enable/disable", not "with/without"
- would be good to pass smoke tests on all known POSIX AIO implementations before we do that; results so far:
- fixed
- crash because aio_suspend() sees EINVAL, because "activate" LIO IOCB too soon, commit (fixup to be squashed)
- DONE: can we avoid waiting for a merge head from a later generation? yes
- DONE: can we avoid using atomics in a signal handler? most uses removed but in flight count remains; could do somethign about this
- DONE: the naming and coding in the baton stuff is weird, needs a rethink -> done, now called "exchange" and shared with iocp
- DONE: kill the array of active IOs if not using aio_suspend (eg FreeBSD)
- DONE: would it be better to have the signal handler give up after a short time so it can get back to doing something useful, and the waiter wake it again after a bit? -> seems to be better, but may need some defences against thundering herds and useless wakeups
io_method=iocp
- Acceptance criteria for moving iocp into the main aio branch (from aio-win32):
- "TRAP: FailedAssertion("pgwin32_signal_event != NULL", File: "C:\Users\ContainerAdministrator\AppData\Local\Temp\cirrus-ci-build\src\port\open.c", Line: 78, PID: 3416)"
- pgwin32_open() currently hacked to comment that out because of unresolved ordering problem, read_nondefault_variables() vs pgwin32_signal_initialize()
- currently pgaio_can_scatter_gather() considers only io_data_direct when deciding; but it applies also to WAL I/O (and potentially, in future, who knows, temporary files etc)
- this matters primarily for Windows because Windows has a different answer depending on use of direct I/O, but we have at least 3 different GUCs to control that on diffrent subsystems
- one idea would be for pgaio_can_scatter_gather() to take an IO and check the "scb" to see who is asking, or something like that...
- another idea is to carry a "direct" flag on every IO so that the merge code doesn't have to concern itself with the details beyond that
- new API: pgaio_impl->opening_fd(int fd, int flags) so that Windows impl can register fd with IOCP if flags & O_OVERLAPPED? currently that's all a bit kludgy
- "TRAP: FailedAssertion("pgwin32_signal_event != NULL", File: "C:\Users\ContainerAdministrator\AppData\Local\Temp\cirrus-ci-build\src\port\open.c", Line: 78, PID: 3416)"
- other things
- using GetQueuedCompletionEventEx() requires Windows Vista, but PostgreSQL currently targets XP+. both are long dead, but the case for bumping it needs to be made in the community
- does FileFlushBuffers() have an async cousin? doesn't look like it
- but there is an equivalen to fdatasync() in ntdll.dll; might be worth looking further
- pgaio_iocp_closing_fd() should drain only IOs on the given fd, not all IOs issued by this backend
- compare see pgaio_posix_aio_closing_fd() -- it only drains results, it does not reap, to avoid deadlock risk! need something like that here too
- fixed
- DONE: calling it "iocp" for now ("windows" was too generic; we want to reserve the option to use the new Windows io_uring knockoff API which will probably be a separate method)
- DONE: kill the IOCP thread, and teach pgaio_windows_drain() to drain?
- DONE: solve the resulting deadlock by using the same procsignal trick as posix_aio?
- DONE: we should use GetQueuedCompletionEventEx() to consume multiple completions in one call, instead of a loop!
io_method=ioring
- no code yet, just an idea
- ioring (note: no 'u', maybe should be win_ioring or some other name) is a hypothetical future IO method that would use Windows I/O rings, a knock-off of Linux io_uring that is available in Window 11 preview but still changing
- so far the documentation only describes how to do reads but we know that you can already do writes and flushes with the current preview, so there is enough there right now to write aio_ioring.c and hook it up
- the easiest way could be to have one ioring per backend and use aio_exchange.c to deal with cross-process problems (like io_method=posix_aio and io_method=iocp), but it may also be possible to have N iorings that are somehow shared between processes and then use the context system for interlocking (like io_method=io_uring, the Linux one); can you do that, somehow share the handle + memory mapping for the s and c queues + correct wakeups?
Quick start for PostgreSQL hackers/reviewers
- Testing the default mode, simulated AIO using io_method=worker (= the default setting)
- Try strace-ing the backend and IO worker processes to see how I/O syscalls are offloaded
- Adjust the number of io_worker processes with io_workers=N
- See the view pg_stat_aios that shows individual IOs
- See the view pg_stat_aio_backends that shows per-backend info
- Testing the use of direct I/O instead of PostgreSQL's traditional double buffering
- Set io_direct=data to disable OS buffering of relation data
- Set io_direct=wal to disable OS buffering WAL data
- Set io_direct=wal,wal_init,data to disable all
- Testing OS-specific options for "native" AIO
- Linux io_uring
- install package liburing-dev (or liburing-devel on some distros)
- configure with --with-liburing (or if using Meson, it should find it by itself?)
- run with io_method=io_uring
- POSIX AIO, on macOS, FreeBSD, NetBSD, AIX, illumos, Solaris, Linux
- configure --with-posix-aio
- run with io_method=posix_aio
- Some OS specific notes:
- Linux POSIX AIO is fake, simulated by glibc and musl with threads, and works well enough for testing but wouldn't be a good choice to actually use
- Solaris/illumos also has fake POSIX AIO (there is a more native/kernel supported AIO API but we don't support that)
- macOS has very tight limits on AIO; like every other application that uses AIO we're going to have to publish recommentations to crank them up (Sybase for example, VirtualBox is another); on the bright side, macOS defaults provide a nice workout for the IO retry code paths; that is, often we try to start an IO and the kernel says EAGAIN, so we way for one IO to complete and then try again, and you can see this in the pg_stat_aio_backends retry counter column
- AIX only works if you set shared_memory_type=sysv because AIX's AIO can't access memory we allocate with mmap() (otherwise all IO fails with EFAULT)
- AIX direct I/O might be unnecessarily serializing per-file, which we could fix with O_CONCURRENT, O_CIO or O_CIOR
- HPUX probably also needs O_CIO to avoid serializing direct IO (native AIO not working there yet but this applies to worker mode too)
- Windows IOCP
- run with io_method=iocp
- (not yet pushed to main aio branch, find it in the aio-win32 branch until it's a little more complete...)
- Linux io_uring