Signals
Main themes: co-operative interrupts, latches as universal wakeup, robust supervisor
As discussed in the Vancouver Multithreading meeting, one of our subtasks will be to remove the dependency on process IDs, kill() and signal handlers.
- PostgreSQL used to use signal handlers to *do* things, but now it only uses them to set flags, to remember to do things later at the next CHECK_FOR_INTERRUPTS()
- The remaining cases of doing something more in a handler are:
- fast _exit(), but in a multithreaded future that would move out to container-level (backends won't have to exit one at a time)
- timer-related activities, but in a multithreaded future those would presumably move out to a new timer infrastructure
- PostgreSQL used to use signals to interrupt blocking slow system calls, but that was inherently racy, and now everything should be nonblocking
- Latches currently use signals as an implementation detail on Unix (but not always an actual handler), but that could change if required
- The postmaster tries to avoid relying on shared memory contents and is not allowed to use the latch infrastructure, because we want the postmaster to be able to coordinate a crash-restart even if memory is corrupted
Survey of signal uage
Sender→receiver | Signal | Description | Replacement | Patch |
---|---|---|---|---|
administrator→postmaster | SIGTERM | smart shutdown | keep, and add control socket? | |
administrator→postmaster | SIGQUIT | immediate shutdown | keep, and add control socket? | |
administrator→postmaster | SIGINT | fast shutdown | keep, and add control socket? | |
administrator→postmaster | SIGHUP | reload | keep, and add control socket? | |
postmaster→all | SIGQUIT | immediate shutdown, quickdie(), _exit() | keep | |
postmaster→all | SIGTERM | exit at next CFI() | SendInterrupt(INTERRUPT_DIE)? | |
postmaster→avlauncher | SIGUSR2 | start autovacuum | ? | |
postmaster→checkpointer | SIGUSR2 | shutdown | ? | |
postmaster→walsenders | SIGUSR2 | finish and shutdown | ? | |
postmaster→pgarch | SIGUSR2 | shutdown | ? | |
postmaster→startup | SIGUSR2 | promote | ? | |
postmaster→client backend | SIGINT | cancel query | SendInterrupt(INTERRUPT_CANCEL)? | |
postmaster→all | SIGKILL | timed out while waiting for shutdown | keep | |
postmaster→any | SIGUSR1 | bgworker state change notification | SetLatch() (does this need to be 'robust'?) | SendInterrupt() proposal |
postmaster→all | SIGHUP | reload config | SendInterrupt(INTERRUPT_RELOAD_CONFIG)? | |
kernel→postmaster | SIGCHLD | child state change notification | keep, but refactor? | |
kernel→backend | SIGINFO | postmaster exited | keep for now (but see below for MT redesign ideas) | |
kernel→backend | SIGALRM | itimer | keep for now, but change handlers to do RaiseInterrupt(INTERRUPT_XXX) | SendInterrupt() proposal |
backend→backend | SIGURG | latch wakeup | keep for now, but later replace with ? | |
backend→backend | SIGUSR1 | SendProcSignal(pid, PROCSIG_XXX) | SendInterrupt(INTERRUPT_XXX, procno) | SendInterrupt() proposal |
backend→postmaster | SIGUSR1 | SendPostmasterSignal(PMSIGNAL_XXX) | ? |
Thoughts on administrator→postmaster
The signals that pg_ctl and other control programs sends are mostly Unix conventions and it seems OK to keep them. But perhaps we should also have a control pipe?
On Windows, we already have a control socket for pretending to send SIGHUP etc to the postmaster. Maybe we stop pretending Windows has Unix signals, and support a general control pipe? That way we could also implement richer communication, like "what state are you in? what is recovery progress?".
Thoughts on backend→postmaster
Idea #1: We could teach postmaster to accept a shared latch. Some say it can't because that means it is exposed to shared memory corruption risks, but in fact it already has some exposure though PMSIGNAL_ vector. Perhaps it could have a "robust" latch mode, that always takes the slow patch (system call), so that it is no less robust than the current PMSIGNAL_ mechanism. Specifically, it is not possible for one backend to trash memory in such a way that prevents another backend from waking the postmaster.
Idea #2: We could use a pipe/socketpair to talk to the postmaster and send it richer messages.
In general, a lot of messages to the postmaster would probably go away in a multithreaded model anyway, because it would no longer be in charge of starting new backends.
Thoughts on postmaster→backend
To replace the current SIGUSR1/SIGUSR2 signals, in an intermediate phase, we could decide that it is OK for the postmaster to use SetLatch(). If we are worried about shared memory corruption, we could decide that it has to use a "robust" SetLatch() in the multi-process model, meaning it doesn't check shmem, it just always uses the system call slow path.
In the multi-threaded future, some of those probably go away and are replaced with communication between backends or just doing pthread_create() yourself (you don't need to ask the postmaster to start a worker). Some communication is still needed. Should there be a single socketpair connecting the postmaster to the backend container process, through which it can coordinate eg promotion, shutdown? Perhaps that implies a special monitor thread inside the backend container process that would forward such communications, ie it receives eg "PROMOTE\n" through a pipe and the monitor thread generates SendInterrupt() and/or raw SetLatch() calls as required?
Thoughts on latch wakeups with threads
Idea #1: We could do pthread_kill(pthread_t, SIGURG) on the sending side. Then for WAIT_USE_POLL give each backend its own self-pipe, for WAIT_USE_EPOLL it might already work with one shared signalfd or maybe they need one each (?), and for WAIT_USE_KQUEUE no change is needed. (And Windows just works, native events.)
Idea #2: pthread_keill(), then on the receiving side, WAIT_USE_POLL could switch to ppoll() (finally standardised in POSIX 2024, atomic signal masking, Solaris has it and it is currently the only user of WAIT_USE_POLL?), for WAIT_USE_EPOLL we could witch to epoll_pwait2() (get rid or the signal pipe and use atomic signal masking). Again no change for WAIT_USE_KQUEUE. (And Windows just works, native events.)
Idea #3: We could give every backend a pipe, and write a byte to it. We could have done that already, but in a multi-process model it might create a MaxBackends^2 explosion of duplicated kernel descriptors. Should be OK for single-process multi-thread mode, and on Linux it replaces the current per-backend signalfd.
Idea #4: We could give every backend a pipe as a fallback, but use better options when available: With kqueue you can send a custom wakeup event directly to someone else's kqueue from inside the same process. For Linux we could replace the current signalfd that is in the epoll with an eventfd, which anyone can write into. For Linux we might eventually want to switch to a per-backend uring, in which case any thread could post a custom wakeup to any other backend's uring directly, or replace a latches with a futex (uring can multiplex futex wait; note that latches basically are 1 bit futexes, and are also basically Windows' VMS-style event flags).
Thoughts on timers and SIGALRM
Each backend currently has its own separate itimer to manage various timeouts. In an intermediate phase that could continue, but the timer handlers could just call RaiseInterrupt(INTERRUPT_xxx), as shown the SendInterrupt() patch. (This remaining manipulation of the interrupt bitmap from inside a signal handler is the reason why the SendInterrupt() patch relies on --disable-atomics being dropped, because it's not safe to use lock-based emulation from inside a signal handler.)
In a multi-threaded future, I think we'd probably need to invent our own timer monitor thread, that would maintain a schedule table and do SendInterrupt(INTERRUPT_XXX, target_procno) at the right times as requested, or something like that? Then SIGALRM would not be needed, but each backend could still configure its own timeout schedule separately. Right?
Thoughts on cache coherency
Old way to interrupt another backend for reason XXX using SendProcSignal():
- sender: target_proc->pss_signalFlags[PROCSIG_XXX] = true
- sender: kill(target_proc->pid, SIGUSR1);
- receiver: handle SIGUSR1
- receiver: see pss_signalFlags[PROCSIG_XXX]
- receiver: clear pss_signalFlags[PROCSIG_XXX]
- receiver: XxxPending = true, InterruptPending = true
- recevier: SetLatch(MyLatch)
- receiver: CFI() sees InterruptPending, XxxPending
- receiver: ProcessXxx()
New way to interrupt another backend for reason XXX proposed by SendInterrupt() patch:
- sender: atomic_fetch_or(&target_proc->pending_interrupts, INTERRUPT_XXX) (implies memory_order_seq_cst store)
- sender: SetLatch(target_proc->latch)
- receiver: CFI() sees pending_interrupts != 0 (implies memory_order_relaxed load)
- receiver: atomic_fetch_and(&pending_interrupts, ~INTERRUPT_XXX)
- receiver: ProcessXxx()
(That's the case where the target is running (eg computing a hash table in the executor), but if it is sleeping (eg WaitLatch()) then it also wakes up due to the latch slow path, and then cache coherency is assumed via the context switches.)
You might ask how long it takes the receiver to see pending_interrupts != 0 with a relaxed load. I don't think the language standard tells us, but I think under real cache coherency protocols (MESI et al), almost no time, as the cache line will be invalidated in all caches. The same question comes up with the current signal-based system: when will the signal be handled? No standard tells us, and in practice the answer was "at the next scheduler tick or system call" in years gone by when this system was developed, for example 10ms, but on recent kernels it's an inter-process interrupt. Which is faster, MESI invalidation or IPI? Beats me but it seems like the key point is that your core is actively firing electric signals at the target core, so I'm not too worried about that. I think.
On Windows, where the signals are fake, the traditional SIGUSR1 handler doesn't run until pgwin32_dispatch_queued_signals() is reached, probably in the next WaitLatch() call, so they are probably extremely lazy, ever lazier than ancient Unix systems that wouldn't run signal handlers until you next reached a system call, and apparently no one has ever complained about that (?).
Thoughts on Windows code cleanup
Instead of giving every backend a named pipe to send fake signals to, we could delete all that fake signal stuff and keep just latches, and give both Unix and Windows master pipe/control socket for top level? Instead of generating a fake SIGCHLD, we could maybe add some way to consume WL_PROCESS_EXIT events to WaitEventSet, to abstract over Unix and Windows?
Thoughts on traces of reliance on EINTR in socket calls
There are a couple of leftover bits that still use blocking socket I/O. One I know of: RADIUS authentication. That's racy (one of the main problems latch multiplexing fixed) and we have to get rid of it. Then we can rip out most of the horrible socket wrapper code for Windows which is known to be buggy.