Difference between revisions of "Syscall Reduction"

From PostgreSQL wiki
Jump to: navigation, search
(Created page with "In the past few releases we've done a lot of work to remove unnecessary system calls made by PostgreSQL, but there are plenty more opportunities. Here is a log and todo list...")
 
m (Macdice moved page SyscallRemoval to Syscall Reduction)
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
In the past few releases we've done a lot of work to remove unnecessary system calls made by PostgreSQL, but there are plenty more opportunities.  Here is a log and todo list about that.
 
In the past few releases we've done a lot of work to remove unnecessary system calls made by PostgreSQL, but there are plenty more opportunities.  Here is a log and todo list about that.
  
=h1 lseek
+
= lseek =
  
* we used to
+
* we used to do all disk IO with lseek+read/write, but we switched to pread/pwrite
 +
** {{PgCommitURL|3fd2a7932ef0708dda57369bb20c0499d905cc82}} for portable interface (PG12)
 +
** {{PgCommitURL|c24dcd0cfd949bdf245814c4c2b3df828ee7db36}} for main data files and WAL (PG12)
 +
** {{PgCommitURL|0dc8ead46363fec6f621a12c7e1f889ba73b55a9}} for WAL reader (PG13)
 +
** {{PgCommitURL|701a51fd4e01dbbd02067d8f01905a04bc571131}} for miscellaenous other places (PG13)
 +
** {{PgCommitURL|2fd2effc50824a8775a088435a13f47b7a6f3b94}} for base backup (PG14)
 +
** TODO: Some cases remain in the SLRU code [https://www.postgresql.org/message-id/flat/CA%2BhUKGJ%2BoHhnvqjn3%3DHro7xu-YDR8FPr0FL6LF35kHRX%3D_bUzg%40mail.gmail.com thread]
 +
* we use lseek(SEEK_END) to probe the size of relations
 +
** {{PgCommitURL|c5315f4f44843c20ada876fdb0d0828795dfbdf5}} to cache that instead of checking on every block referenced in recovery (PG14)
 +
** TODO: we need to extend that caching to cover non-recovery paths too, because right now we probe relation sizes every time we plan or begin a sequential scan [https://www.postgresql.org/message-id/flat/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com thread]
 +
 
 +
= polling for unexpected postmaster exit =
 +
 
 +
* we used to check if the postmaster had gone away every time through the recovery loop by reading from a pipe
 +
** {{PgCommitURL|9f09529952ac41a10e5874cba745c1c24e67ac79}} to use a process exit signal instead, on Linux (PG12)
 +
** {{PgCommitURL|f98b8476cd4a19dfc602ab95642ce08e53877d65}} to use a process exit signal instead, on FreeBSD (PG12)
 +
* we used to include a pipe in the poll() set we generally use for waiting
 +
** {{PgCommitURL|98a64d0bd713cb89e61bef6432befc4b7b5da59e}} to switch to epoll so that pipe isn't polled inside the kernel every time, on Linux (PG9.6)
 +
** {{PgCommitURL|815c2f0972c8386aba7c606f1ee6690d13b04db2}} to switch to kqueue and EVFILT_PROC instead of polling the pipe, on *BSD + macOS (PG13)
 +
 
 +
= epoll/kqueue setup/teardown =
 +
 
 +
* we used to create and destroy temporary epoll/kqueue objects frequently
 +
** {{PgCommitURL|3347c982bab0dd56d5b6cb784521233ba2bbac27}} to use long-lived WaitEventSet (PG14)
 +
** TODO: more opportunities [https://www.postgresql.org/message-id/flat/CA+hUKGJAC4Oqao=qforhNey20J8CiG2R=oBPqvfR0vOJrFysGw@mail.gmail.com thread]
 +
 
 +
= shm_open =
 +
 
 +
* for parallel query, we allocate and free large chunks of temporary shared memory using POSIX shmem facilities
 +
** {{PgCommitURL|84b1c63ad41872792d47e523363fce1f0e230022}} to preallocate a region up front and recycle that, rather and creating and destroying memory for every parallel query, though it's not enabled by default (PG14)
 +
 
 +
= process titles =
 +
 
 +
* on FreeBSD, we used to call setproctitle() multiple times for every statement
 +
** {{PgCommitURL|1bc180cd2acc55e31b61c4cc9ab4b07670a2566e}} to switch to setproctitle_fast(), which has no system call (PG12)
 +
 
 +
= socket wait in request/response protocol =
 +
 
 +
* currently we often do an extra non-blocking recvfrom() that fails with EAGAIN, followed by epoll_wait()/kevent() after sending a response and then waiting for the next query to arrive
 +
** can we get rid of that extra system call?  go straight to wait, if we predict that is most likely?
 +
** with a local benchmark and a smallish number of threads, often the recvfrom() succeeds due to good timing, but in the read world with many threads and context switches or remote clients it's usually EAGAIN and then sleep -- hence desire for something adaptive
 +
 
 +
= itimer =
 +
 
 +
* the statement_timeout, the deadlock detector and various other things use SIGALARM, but setting that up for every statement is known to cost several percent performance in benchmarks
 +
** TODO: can we skip resetting the timer if there is already a shorter one installed, and then reset it when that one expires?  then perhaps we can call itimer() very infrequently while using the statement_timeout feature and other things like that [https://www.postgresql.org/message-id/flat/77def86b27e41f0efcba411460e929ae%40postgrespro.ru thread]
 +
 
 +
= signals/latches =
 +
 
 +
* can we make SetLatch(), WaitLatch() more efficient?
 +
** Andres Freund has described an alternative design that doesn't require the self-pipe trick (writing a byte in the signal handler, reading it later), and doesn't require signals to be sent at all if the recipient isn't sleeping
 +
** It sounds like this idea would involve removing the timing race by blocking SIGUSR1, and then atomically unblocking it on Linux with epoll_pwait(), and handling on *BSD/macOS with EVFILT_SIGNAL; you'd probably need the self-pipe fallback path for other systems?
 +
** SHOW US THE CODE
 +
 
 +
= disk IO =
 +
 
 +
* Synchronous block-at-a-time IO should be replaced with async scatter/gather, but that's a larger architectural project that doesn't belong on this list of micro-optimisation scale improvements.  Work is in progress...
 +
* Relation extension is currently done by writing zeroes; would fallocate() be better?

Revision as of 03:41, 1 August 2020

In the past few releases we've done a lot of work to remove unnecessary system calls made by PostgreSQL, but there are plenty more opportunities. Here is a log and todo list about that.

lseek

polling for unexpected postmaster exit

epoll/kqueue setup/teardown

shm_open

  • for parallel query, we allocate and free large chunks of temporary shared memory using POSIX shmem facilities
    • 84b1c63ad41872792d47e523363fce1f0e230022 to preallocate a region up front and recycle that, rather and creating and destroying memory for every parallel query, though it's not enabled by default (PG14)

process titles

socket wait in request/response protocol

  • currently we often do an extra non-blocking recvfrom() that fails with EAGAIN, followed by epoll_wait()/kevent() after sending a response and then waiting for the next query to arrive
    • can we get rid of that extra system call? go straight to wait, if we predict that is most likely?
    • with a local benchmark and a smallish number of threads, often the recvfrom() succeeds due to good timing, but in the read world with many threads and context switches or remote clients it's usually EAGAIN and then sleep -- hence desire for something adaptive

itimer

  • the statement_timeout, the deadlock detector and various other things use SIGALARM, but setting that up for every statement is known to cost several percent performance in benchmarks
    • TODO: can we skip resetting the timer if there is already a shorter one installed, and then reset it when that one expires? then perhaps we can call itimer() very infrequently while using the statement_timeout feature and other things like that thread

signals/latches

  • can we make SetLatch(), WaitLatch() more efficient?
    • Andres Freund has described an alternative design that doesn't require the self-pipe trick (writing a byte in the signal handler, reading it later), and doesn't require signals to be sent at all if the recipient isn't sleeping
    • It sounds like this idea would involve removing the timing race by blocking SIGUSR1, and then atomically unblocking it on Linux with epoll_pwait(), and handling on *BSD/macOS with EVFILT_SIGNAL; you'd probably need the self-pipe fallback path for other systems?
    • SHOW US THE CODE

disk IO

  • Synchronous block-at-a-time IO should be replaced with async scatter/gather, but that's a larger architectural project that doesn't belong on this list of micro-optimisation scale improvements. Work is in progress...
  • Relation extension is currently done by writing zeroes; would fallocate() be better?