Syscall Reduction
From PostgreSQL wiki
Jump to navigationJump to searchIn the past few releases we've done a lot of work to remove unnecessary system calls made by PostgreSQL, but there are plenty more opportunities. Here is a log and todo list about that.
lseek
- we used to do all disk IO with lseek+read/write, but we switched to pread/pwrite
- 3fd2a7932ef0708dda57369bb20c0499d905cc82 for portable interface (PG12)
- c24dcd0cfd949bdf245814c4c2b3df828ee7db36 for main data files and WAL (PG12)
- 0dc8ead46363fec6f621a12c7e1f889ba73b55a9 for WAL reader (PG13)
- 701a51fd4e01dbbd02067d8f01905a04bc571131 for miscellaenous other places (PG13)
- 2fd2effc50824a8775a088435a13f47b7a6f3b94 for base backup (PG14)
- e2b37d9e7cabc90633c4bd822e1bcfdd1bda44c4 for SLRU files (PG14)
- we use lseek(SEEK_END) to probe the size of relations
- c5315f4f44843c20ada876fdb0d0828795dfbdf5 to cache that instead of checking on every block referenced in recovery (PG14)
- TODO: we need to extend that caching to cover non-recovery paths too, because right now we probe relation sizes every time we plan or begin a sequential scan thread
polling for unexpected postmaster exit
- we used to check if the postmaster had gone away every time through the recovery loop by reading from a pipe
- 9f09529952ac41a10e5874cba745c1c24e67ac79 to use a process exit signal instead, on Linux (PG12)
- f98b8476cd4a19dfc602ab95642ce08e53877d65 to use a process exit signal instead, on FreeBSD (PG12)
- TODO: We could do this on Windows, with help from the signal simulation thread
- TODO: A fallback solution that isn't quite as good but still pretty good for all other operating systems: thread
- we used to include a pipe in the poll() set we generally use for waiting
- 98a64d0bd713cb89e61bef6432befc4b7b5da59e to switch to epoll so that pipe isn't polled inside the kernel every time, on Linux (PG9.6)
- 815c2f0972c8386aba7c606f1ee6690d13b04db2 to switch to kqueue and EVFILT_PROC instead of polling the pipe, on *BSD + macOS (PG13)
epoll/kqueue setup/teardown
- we used to create and destroy temporary epoll/kqueue objects frequently
- 3347c982bab0dd56d5b6cb784521233ba2bbac27 to use long-lived WaitEventSet (PG14)
- TODO: more opportunities thread
shm_open
- for parallel query, we allocate and free large chunks of temporary shared memory using POSIX shmem facilities
- 84b1c63ad41872792d47e523363fce1f0e230022 to preallocate a region up front and recycle that, rather and creating and destroying memory for every parallel query, though it's not enabled by default (PG14)
process titles
- on FreeBSD, we used to call setproctitle() multiple times for every statement
- 1bc180cd2acc55e31b61c4cc9ab4b07670a2566e to switch to setproctitle_fast(), which has no system call (PG12)
socket wait in request/response protocol
- currently we often do an extra non-blocking recvfrom() that fails with EAGAIN, followed by epoll_wait()/kevent() after sending a response and then waiting for the next query to arrive
- can we get rid of that extra system call? go straight to wait, if we predict that is most likely?
- with a local benchmark and a smallish number of threads, often the recvfrom() succeeds due to good timing, but in the read world with many threads and context switches or remote clients it's usually EAGAIN and then sleep -- hence desire for something adaptive
setitimer
- we used to call setitimer() for every statement when using statement_timeout, and other similar timers
- 09cf1d52267644cdbdb734294012cf1228745aaa to switch to an different algorithm that calls it much less frequently (PG14)
signals/latches
- we make SetLatch(), WaitLatch() more efficient?
- c8f3bc2401e7df7b79bae39dd3511c91f825b6a4: don't send signals when the other side isn't even waiting; this avoids many signals (12% on make check)
- 6a2a70a02018d6362f9841cc2f499cc45405e86b, 6148656a0be1c6245fbcfcbbeb87541f1b173162: we don't need the self-pipe trick on modern systems
fsync
- we used to call fsync() on SLRU segments whenever we evicted an SLRU (CLOG, ...) page from its mini-buffer pool
- dee663f7843902535a15ae366cede8b4089f1144 to hand the work off to the checkpointer (PG14)
stat
- In various places we walk a directory tree recursively, stating every entry to find out if it's a file or a directory. We should use common extensions to avoid the need for that.
- 861c6e7c8e4dfdd842442dde47cc653764baff4f for Unix (PG14)
- 87e6ed7c8c6abb6dd62119259d2fd89169a04ac0 for Windows (PG14)
- In releases before 14, the coding of RemoveOldXLogFiles() would generate O(n^2) stat() calls while recycling files, which could be a significant storm in some workloads
- 5ae1572993ae8bf1f6c33a933915c07cc9bc0add is the fix (PG14)
sendto
- we used to send stats to the stats collector over a UDP socket!
- too many commits to list here, but PG15 replaced the stats collector with shared memory, removing many system calls and duplicates of statistical data in each process's memory
disk IO
- Synchronous block-at-a-time IO should be replaced with async scatter/gather, but that's a larger architectural project that doesn't belong on this list of micro-optimisation scale improvements. Work is in progress...
- Relation extension is currently done by writing zeroes; would fallocate() be better?