Syscall Reduction

From PostgreSQL wiki
Jump to navigationJump to search

In the past few releases we've done a lot of work to remove unnecessary system calls made by PostgreSQL, but there are plenty more opportunities. Here is a log and todo list about that.

lseek

polling for unexpected postmaster exit

epoll/kqueue setup/teardown

shm_open

  • for parallel query, we allocate and free large chunks of temporary shared memory using POSIX shmem facilities
    • 84b1c63ad41872792d47e523363fce1f0e230022 to preallocate a region up front and recycle that, rather and creating and destroying memory for every parallel query, though it's not enabled by default (PG14)

process titles

socket wait in request/response protocol

  • currently we often do an extra non-blocking recvfrom() that fails with EAGAIN, followed by epoll_wait()/kevent() after sending a response and then waiting for the next query to arrive
    • can we get rid of that extra system call? go straight to wait, if we predict that is most likely?
    • with a local benchmark and a smallish number of threads, often the recvfrom() succeeds due to good timing, but in the read world with many threads and context switches or remote clients it's usually EAGAIN and then sleep -- hence desire for something adaptive

setitimer

  • we used to call setitimer() for every statement when using statement_timeout, and other similar timers

signals/latches

fsync

  • we used to call fsync() on SLRU segments whenever we evicted an SLRU (CLOG, ...) page from its mini-buffer pool

stat

  • In releases before 14, the coding of RemoveOldXLogFiles() would generate O(n^2) stat() calls while recycling files, which could be a significant storm in some workloads

sendto

  • we used to send stats to the stats collector over a UDP socket!
  • too many commits to list here, but PG15 replaced the stats collector with shared memory, removing many system calls and duplicates of statistical data in each process's memory

disk IO

  • Synchronous block-at-a-time IO should be replaced with async scatter/gather, but that's a larger architectural project that doesn't belong on this list of micro-optimisation scale improvements. Work is in progress...
  • Relation extension is currently done by writing zeroes; would fallocate() be better?