Main themes: co-operative interrupts, latches as universal wakeup, robust supervisor

As discussed in the Vancouver Multithreading meeting, one of our subtasks will be to remove the dependency on process IDs, kill() and signal handlers.

PostgreSQL used to use signal handlers to *do* things, but now it only uses them to set flags, to remember to do things later at the next CHECK_FOR_INTERRUPTS()
The remaining cases of doing something more in a handler are:
- fast _exit(), but in a multithreaded future that would move out to container-level (backends won't have to exit one at a time)
- timer-related activities, but in a multithreaded future those would presumably move out to a new timer infrastructure
PostgreSQL used to use signals to interrupt blocking slow system calls, but that was inherently racy, and now everything should be nonblocking
Latches currently use signals as an implementation detail on Unix (but not always an actual handler), but that could change if required
The postmaster tries to avoid relying on shared memory contents and is not allowed to use the latch infrastructure, because we want the postmaster to be able to coordinate a crash-restart even if memory is corrupted

Survey of signal uage

Sender→receiver	Signal	Description	Replacement	Patch
administrator→postmaster	SIGTERM	smart shutdown	keep, and add control socket?
administrator→postmaster	SIGQUIT	immediate shutdown	keep, and add control socket?
administrator→postmaster	SIGINT	fast shutdown	keep, and add control socket?
administrator→postmaster	SIGHUP	reload	keep, and add control socket?
postmaster→all	SIGQUIT	immediate shutdown, quickdie(), _exit()	keep
postmaster→all	SIGTERM	exit at next CFI()	SendInterrupt(INTERRUPT_DIE)?
postmaster→avlauncher	SIGUSR2	start autovacuum	?
postmaster→checkpointer	SIGUSR2	shutdown	?
postmaster→walsenders	SIGUSR2	finish and shutdown	?
postmaster→pgarch	SIGUSR2	shutdown	?
postmaster→startup	SIGUSR2	promote	?
postmaster→client backend	SIGINT	cancel query	SendInterrupt(INTERRUPT_CANCEL)?
postmaster→all	SIGKILL	timed out while waiting for shutdown	keep
postmaster→any	SIGUSR1	bgworker state change notification	SetLatch() (does this need to be 'robust'?)	SendInterrupt() proposal
postmaster→all	SIGHUP	reload config	SendInterrupt(INTERRUPT_RELOAD_CONFIG)?
kernel→postmaster	SIGCHLD	child state change notification	keep, but refactor?
kernel→backend	SIGINFO	postmaster exited	keep for now (but see below for MT redesign ideas)
kernel→backend	SIGALRM	itimer	keep for now, but change handlers to do RaiseInterrupt(INTERRUPT_XXX)	SendInterrupt() proposal
backend→backend	SIGURG	latch wakeup	keep for now, but later replace with ?
backend→backend	SIGUSR1	SendProcSignal(pid, PROCSIG_XXX)	SendInterrupt(INTERRUPT_XXX, procno)	SendInterrupt() proposal
backend→postmaster	SIGUSR1	SendPostmasterSignal(PMSIGNAL_XXX)	?

Thoughts on administrator→postmaster

The signals that pg_ctl and other control programs sends are mostly Unix conventions and it seems OK to keep them. But perhaps we should also have a control pipe?

On Windows, we already have a control socket for pretending to send SIGHUP etc to the postmaster. Maybe we stop pretending Windows has Unix signals, and support a general control pipe? That way we could also implement richer communication, like "what state are you in? what is recovery progress?".

Thoughts on backend→postmaster

Idea #1: We could teach postmaster to accept a shared latch. Some say it can't because that means it is exposed to shared memory corruption risks, but in fact it already has some exposure though PMSIGNAL_ vector. Perhaps it could have a "robust" latch mode, that always takes the slow patch (system call), so that it is no less robust than the current PMSIGNAL_ mechanism. Specifically, it is not possible for one backend to trash memory in such a way that prevents another backend from waking the postmaster.

Idea #2: We could use a pipe/socketpair to talk to the postmaster and send it richer messages.

In general, a lot of messages to the postmaster would probably go away in a multithreaded model anyway, because it would no longer be in charge of starting new backends.

Thoughts on postmaster→backend

To replace the current SIGUSR1/SIGUSR2 signals, in an intermediate phase, we could decide that it is OK for the postmaster to use SetLatch(). If we are worried about shared memory corruption, we could decide that it has to use a "robust" SetLatch() in the multi-process model, meaning it doesn't check shmem, it just always uses the system call slow path.

In the multi-threaded future, some of those probably go away and are replaced with communication between backends or just doing pthread_create() yourself (you don't need to ask the postmaster to start a worker). Some communication is still needed. Should there be a single socketpair connecting the postmaster to the backend container process, through which it can coordinate eg promotion, shutdown? Perhaps that implies a special monitor thread inside the backend container process that would forward such communications, ie it receives eg "PROMOTE\n" through a pipe and the monitor thread generates SendInterrupt() and/or raw SetLatch() calls as required?

Thoughts on latch wakeups with threads

Idea #1: We could do pthread_kill(pthread_t, SIGURG) on the sending side. Then for WAIT_USE_POLL give each backend its own self-pipe, for WAIT_USE_EPOLL it might already work with one shared signalfd or maybe they need one each (?), and for WAIT_USE_KQUEUE no change is needed. (And Windows just works, native events.)

Idea #2: pthread_keill(), then on the receiving side, WAIT_USE_POLL could switch to ppoll() (finally standardised in POSIX 2024, atomic signal masking, Solaris has it and it is currently the only user of WAIT_USE_POLL?), for WAIT_USE_EPOLL we could witch to epoll_pwait2() (get rid or the signal pipe and use atomic signal masking). Again no change for WAIT_USE_KQUEUE. (And Windows just works, native events.)

Idea #3: We could give every backend a pipe, and write a byte to it. We could have done that already, but in a multi-process model it might create a MaxBackends^2 explosion of duplicated kernel descriptors. Should be OK for single-process multi-thread mode, and on Linux it replaces the current per-backend signalfd.

Idea #4: We could give every backend a pipe as a fallback, but use better options when available: With kqueue you can send a custom wakeup event directly to someone else's kqueue from inside the same process. For Linux we could replace the current signalfd that is in the epoll with an eventfd, which anyone can write into. For Linux we might eventually want to switch to a per-backend uring, in which case any thread could post a custom wakeup to any other backend's uring directly, or replace a latches with a futex (uring can multiplex futex wait; note that latches basically are 1 bit futexes, and are also basically Windows' VMS-style event flags).

Thoughts on timers and SIGALRM

Each backend currently has its own separate itimer to manage various timeouts. In an intermediate phase that could continue, but the timer handlers could just call RaiseInterrupt(INTERRUPT_xxx), as shown the SendInterrupt() patch. (This remaining manipulation of the interrupt bitmap from inside a signal handler is the reason why the SendInterrupt() patch relies on --disable-atomics being dropped, because it's not safe to use lock-based emulation from inside a signal handler.)

In a multi-threaded future, I think we'd probably need to invent our own timer monitor thread, that would maintain a schedule table and do SendInterrupt(INTERRUPT_XXX, target_procno) at the right times as requested, or something like that? Then SIGALRM would not be needed, but each backend could still configure its own timeout schedule separately. Right?

Thoughts on cache coherency

Old way to interrupt another backend for reason XXX using SendProcSignal():

sender: target_proc->pss_signalFlags[PROCSIG_XXX] = true
sender: kill(target_proc->pid, SIGUSR1);
receiver: handle SIGUSR1
receiver: see pss_signalFlags[PROCSIG_XXX]
receiver: clear pss_signalFlags[PROCSIG_XXX]
receiver: XxxPending = true, InterruptPending = true
recevier: SetLatch(MyLatch)
receiver: CFI() sees InterruptPending, XxxPending
receiver: ProcessXxx()

New way to interrupt another backend for reason XXX proposed by SendInterrupt() patch:

sender: atomic_fetch_or(&target_proc->pending_interrupts, INTERRUPT_XXX) (implies memory_order_seq_cst store)
sender: SetLatch(target_proc->latch)
receiver: CFI() sees pending_interrupts != 0 (implies memory_order_relaxed load)
receiver: atomic_fetch_and(&pending_interrupts, ~INTERRUPT_XXX)
receiver: ProcessXxx()

(That's the case where the target is running (eg computing a hash table in the executor), but if it is sleeping (eg WaitLatch()) then it also wakes up due to the latch slow path, and then cache coherency is assumed via the context switches.)

You might ask how long it takes the receiver to see pending_interrupts != 0 with a relaxed load. I don't think the language standard tells us, but I think under real cache coherency protocols (MESI et al), almost no time, as the cache line will be invalidated in all caches. The same question comes up with the current signal-based system: when will the signal be handled? No standard tells us, and in practice the answer was "at the next scheduler tick or system call" in years gone by when this system was developed, for example 10ms, but on recent kernels it's an inter-process interrupt. Which is faster, MESI invalidation or IPI? Beats me but it seems like the key point is that your core is actively firing electric signals at the target core, so I'm not too worried about that. I think.

On Windows, where the signals are fake, the traditional SIGUSR1 handler doesn't run until pgwin32_dispatch_queued_signals() is reached, probably in the next WaitLatch() call, so they are probably extremely lazy, ever lazier than ancient Unix systems that wouldn't run signal handlers until you next reached a system call, and apparently no one has ever complained about that (?).

Thoughts on Windows code cleanup

Instead of giving every backend a named pipe to send fake signals to, we could delete all that fake signal stuff and keep just latches, and give both Unix and Windows master pipe/control socket for top level? Instead of generating a fake SIGCHLD, we could maybe add some way to consume WL_PROCESS_EXIT events to WaitEventSet, to abstract over Unix and Windows?

Thoughts on traces of reliance on EINTR in socket calls

There are a couple of leftover bits that still use blocking socket I/O. One I know of: RADIUS authentication. That's racy (one of the main problems latch multiplexing fixed) and we have to get rid of it. Then we can rip out most of the horrible socket wrapper code for Windows which is known to be buggy.

Signals

Contents

Main themes: co-operative interrupts, latches as universal wakeup, robust supervisor

Survey of signal uage

Thoughts on administrator→postmaster

Thoughts on backend→postmaster

Thoughts on postmaster→backend

Thoughts on latch wakeups with threads

Thoughts on timers and SIGALRM

Thoughts on cache coherency

Thoughts on Windows code cleanup

Thoughts on traces of reliance on EINTR in socket calls

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Tools

Search