Windows
Major differences
Windows
- Three levels: NT kernel, Windows subsystem, C library
- notions such as current locale and encoding may differ depending on Windows vs UCRT APIs
- C library also includes some POSIX-like extensions for a few things
- UTF-16 is the true encoding of arguments, environment, filenames at NT level
- char interfaces for C and minimal POSIX-like APIs
- wchar_t and char interfaces for Windows APIs
- hidden destructive conversions
- UTF-8 support is new and transitional
- most native applications probably use/used wchar_t everywhere, but we don't
- preemptive multi-threading from beginning of NT
- no fork() accessible via Windows API, child processes can only load new images
- multi-root file system namespace
- files are locked by default when you open them
- mapped file regions can't be truncated
- might explain why you so often have to reboot to upgrade things...
- other programs can block you unexpectedly
- I/O model is fundamentally asynchronous, with synchronous interfaces for convenience
- kernel objects referenced by handles
- file descriptors exist in C runtime
- sockets exist in socket library (not in C runtime descriptor table)
- pipes aren't like files or sockets
- asynchronous event handlers run at defined times (no asynchronous preemptive signals)
Unix
- I/O buffering and asynchronicity was originally hidden from user space by design
- ancestors Multics was more like VMS in this respect but UNIX chopped it all out
- NFS was an interesting collision between the UNIX philosophy and the asynchronous nature of the universe: uninterruptable D state because someone else's computer rebooted
- nonblocking sockets and associated system calls allowed for multiplexing but
- there is no sane way to generalise that API to file I/O
- even for sockets, it didn't allow zero-copy networking
- the VMS/NT people were probably not wrong to criticise Unix I/O
- POSIX AIO essentially failed (commercial Unixen had to extend it to make it useful)
- But finally... io_uring is very good
- to this day, O_DIRECT is not standardised by POSIX
- long period before threading APIs were standardised, traditional fork() based servers
- single tree file system namespace
- names are always char-based with no defined encoding, only minor restrictions on bytes allowed
- in practice UTF-8 is the only interesting encoding
- kernel objects referenced by general descriptor table
- except processes, threads, ...
- having files open never stops unlinking, truncating mapped regions (-> SIGBUS) etc and locking schemes much be created by advisory means
- ... but I can upgrade pretty much anything except the kernel without rebooting
- hardware interrupts modeled as signal handlers
- powerful low-level abstraction but never ending source of bugs
Convergence
Due to the order things were developed, we're currently simulating asynchronous file I/O with worker processes that call synchronous functions that are emulated by asynchronous operations and waiting, on Windows. But there is a place carved out to fix that, and some sketch patches that have worked at various times...
Likewise for sockets, though the way forward is a bit less clear (several options mentioned below)....
Along with threads (WIP), we'd be getting closer to the way a 90s Windows engineer would have written a database from scratch, probably, while also driving io_uring efficiently on Linux, minimising stalls and context switches.
(If we additionally implemented large tablespace files then we could boot out all the unlink(), rename(), symlink() stuff, which is probably not ideal on any OS. Not imagined here, I'm just observing that if you don't do that stuff then you don't have to worry about weird semantic differences, but...)
Windows port variants
Windows versions
- v16 required Windows 10+ and set _WIN32_WINNT to 0x0A00 (commit 495ed0ef)
- older versions claimed to run on Windows 8.1, 8, 7 but there is no testing
C runtimes
- v18 required the UCRT C runtime available on Windows 10+ (commit 1758d424)
- older versions theoretically supported MSVCRT C runtime with various tolerances, but correct behaviour was unverified as of several years
Architectures
- x86_32: in theory 32-bit Windows builds might still work, but this is not tested by the build farm or CI
- x86_64: tested by build farm and CI
- ARM32: never reported to work, and dropped by Visual Studio 2022
- ARM64: never reported to work, but we should fix that
- we need a build farm animal to claim support, and it'll probably need to be allowed to fail for a while
- a small number of changes are known to be needed (TODO: link to threads)
- this is increasingly common as an architecture for laptops
Toolchains
Visual Studio
- Supported by Microsoft:
- Visual Studio 2019 mainstream support has ended
- Visual Studio 2022 mainstream support ends Jan 2027
- Visual Studio 2026
- Supported by PostgreSQL:
- v19-to-be currently requires Visual Studio 2019+ (commit 8fd9bb1d)
- v16 required Visual Studio 2015+ (commit 6203583b) but we were already only testing 2017+
Visual Studio is the primary Windows port, and used for the EDB installer that most PostgreSQL-on-Windows users seem to be using. It uses our own porting layer to emulate many POSIX system calls, hiding UCRT's implementations and in many cases redirecting to underlying Win32 calls or retrying to hide errors. See individual API sections below for gory details.
Up until v16, a separate custom set of scripts was used to drive msbuild (removed in commit 1301c80b2167). This was never able to run the full test suite, hiding many bugs. Since v16, meson + ninja/msbuild is used and is the same build system and test suite used for Unix systems (commit e6927270).
MinGW
MinGW is usually used as parts of an MSYS2 environment, which provides a GCC compiler and a universe of free/open source tools and libraries with a package manager. This is effectively just another Windows flavour in most respects as far as we are concerned, and usually takes the same codepaths as Visual Studio, including calling Win32 APIs directly when appropriate. MinGW provides compatible headers, but may occasionally lag or different in supporting UCRT or Win32 APIs.
Historically there was another MSYS project and MinGW from which the current versions forked.
It does provide its own implementation of *some* libc functionality rather than exposing the UCRT implementations, and notably it uses a 64-bit off_t when building with meson (but not configure but only because we skip AC_SYS_LARGEFILE for PORTNAME=win32). This is vexing: do we want MinGW builds to be as much like Visual Studio builds as possible, or as natural as possible for MinGW/MSYS2 environments as possible? Why is it even OK for off_t to have a different size across libraries built in an MSYS2 environment?
Cygwin
GCC compiler + runtime environment that emulates POSIX/GNU/Linux environments facilities. The environment provides implementations of signals, symlinks, sockets, fork(), etc, so this is effectively just another Unix flavour in most respects. We still carry some special code paths to cope with failure modes due to NT file locking etc.
Before PostgreSQL 8.0, this was the only way to build PostgreSQL on Windows. Since lorikeet was decommissioned in 2024, we are no longer actively testing it. It was very unstable in our build farm until PostgreSQL 16, mostly due to bugs in its implementation of signals (that commit 7389aad6 worked around, though Cygwin was not the motivation), so it's very unlikely that anyone ran production databases.
There is a package, so we would presumably hear from the maintainer if we broke compilation. It's possible that Cygwin users are using the command line tools and client library, either directly or to satisfy requirements of other packages. This would explain why we never heard field reports of the server being effectively unusable before 16.
Directories
Symlinks
Windows has three kinds of links: hard links (we use those via link() in win32link.c), symlinks (a class of reparse points that require unusual privileges so we can't use them) and junction points (another class of reparse points that we do use). We have a lots of code to make junction points appear to be Unix symlinks. This includes symlink(), readlink(), stat(), fstat().
We used symlinks to implement tablespaces on Unix because it was convenient and simple, but it turns out to be inconvenient and unsimple on Windows. So perhaps we should reconsider that, and just create a mapping file instead? Then we could rip out all that symlink emulation code.
Hard links
Windows has completely different locking semantics for files. We have a lot of code doing highly questionable sleep-and-retry loops in our wrappers for open(), stat(), rename(), unlink() etc to cope with this. We also have complex code to force all backends to close all file handles at certain times.
According to the testing done in this thread, we could make all of those problems go away if we enabled POSIX-mode unlink behaviour. But then PostgreSQL would stop working on those other file systems as we'd forget about all the semantics not tested by CI/BF. ReFS sounds potentially quite interesting anyway.
File I/O
Large files
When the industry tackled large files in the new age of 64-bit computing, the Unixen all either flipped off_t to 64-bit always (eg FreeBSD) or optionally based on a macro. Windows supports large files just fine in the true Windows file APIs, but its C library also has a few POSIX-like functions, and that's where off_t comes from, and it never changed. (See note about MinGW though, which does have optional large off_t.)
We've gradually changed our porting layer to replace all such functions using pgoff_t.
- v19-to-be finally allowed large relation segment files (commit 84fb27511) by passing pgoff_t through more layers
- there may be more places eg buffile.c that could potentially be changed
- older versions supported large files in front-end code only (through accretion of bugfixes)
Allocation
Before v16, relations were always extended one block at a time using sgmrextend(), which wrote 8KB of zeroes to the file to physically extend it. v16 learned how to extend by multiple blocks at a time, and call posix_fallocate() to reserve disk space in a... hopefully efficient way (hopefully good interaction with extents and space reservation, without having to copy a bunch of zeroes into the page cache). That doesn't exist on Windows. Ideas:
- NTFS doesn't make 'holes' by default, so ftruncate() must be something like posix_fallocate() as long as you only make the file bigger, right?
- therefore when we commit and back-patch file_extend_method=(fallocate|ftruncate|write) and assocated threshold GUC, it would be possible to to investigate that, and consider adding other options if there are fancier approaches that could help
- COPY performance on Windows is a related report.
Cloning
APIs found in the wild for cloning or copying files:
OS | System call | arguments | subfile? | strategy Linux, FreeBSD | copy_file_range() | fds | yes | silent best macOS | copyfile() | paths | no | flags Windows | CopyFile() | paths | no | silent best Solaris | reflink() | paths | no | ?
Presumably that means that we get block cloning on ReFS in various code paths that reach CopyFile(), and pushdown on NTFS, SMB, etc. Is that true? There is something a bit inconsistent about all of this though... It seems like on Windows we'll do pushdown/clone when Unixen won't in some places. Are there more Windows functions for this type of thing?
Flushing and O_DSYNC
- We map O_DSYNC to FILE_FLAG_WRITE_THROUGH in our open() wrapper
- The default value of wal_sync_method is open_datasync (= O_DSYNC) on Windows
- That level is known *not* to flush the drive write cache on at least SATA drives
- v16 added support for fdatasync (commit 9430fb40), like fsync without useless mtime flush
- the thread has some drive-by testing on laptops
What should the default be? Is there any way to find out that FILE_FLAG_WRITE_THROUGH is being ignored by the driver? FILE_FLAG_WRITE_THROUGH is obviously better because it is asynchronous (FUA flag is attached to single write, assuming we get async writes, see below), while FileFlushBuffers() has no OVERLAPPED variant and writes the whole device write cache. Unless it doesn't work, in which case it is inf% worse...
Direct I/O
v16 added settings debug_io_direct=data,wal. O_DIRECT is converted to FILE_FLAG_NO_BUFFERING on Windows by our open() wrapper. When enough I/O combining and asynchronicity has been developed, we will eventually be able to remove the "debug_" prefix and it will become a realistic option.
Some interesting per-platform topics to research include:
- what happens when a file is opened in both O_DIRECT and non-O_DIRECT mode by different programs (imagine a virus checker, or a backup program, ...)?
- which file systems actually accept direct I/O? ReFS? SMB? what does it do?
A problem we know of for BTRFS is that if you modify source data while a write is in progress, it gets a bad checksum on disk and later reads fail EIO. PostgreSQL needs to stop doing that, it's a bad idea, but it leads to the question of what similar file systems do. ZFS (which only just gained direct I/O in 2.4) has some unfortunately expensive behaviour to defend itself against concurrent modification. What about ReFS?
Vectored I/O (scatter/gather)
- how this looks on Unix systems
- readv(), writev() were originally made for socket I/O (non-seekable files)
- pread(), pwrite() were originally made for file I/O (seekable files, multithreading-friendly)
- preadv(), pwritev() are an obvious combination of the two ideas, available on every Unix but not yet in POSIX, and are used by PostgreSQL I/O workers to execute PGAIO_OP_(READV,WRITEV)
- Linux io_uring has corresponding operations IORING_OP_(READV,WRITEV)
- how this looks on Windows
- for now src/port/pg_iovec.h converts "vectored" synchronous I/O calls into loops
- ideally it should use scatter/gather APIs when possible: ReadFileScatter()/WriteFileGather()
- Direct I/O only (FILE_FLAG_NO_BUFFERING)
- Asynchronous I/O only (FILE_FLAG_OVERLAPPED)
- Unlike Unix's struct iovec (iov_base, iov_len), these calls need a list of memory page addresses
- reward for this pain: DMA transfers between PostgreSQL buffer pool and storage without CPU involvement
- this was interface was basically made for another database that didn't care about buffered or synchronous I/O
Here's one idea for how we could make vectored I/O work. This doesn't make sense as an end point, since it creates the absurdity of a asynchronous I/O being emulated with I/O workers that perform synchronous I/O, which are emulated with asynchronous I/O... TODO
Asynchronous I/O
io_method=worker
io_method=iocp
io_method=ioring
Network I/O
SO_LINGER
Blocking
Nonblocking
Vectored I/O
Asynchronous I/O
Pipe I/O
TAP testing
XXX say more
Subprocess management
Event multiplexing
EVENT-based WaitEventSet
Latches/interrupts
Signals
Postmaster control
Asynchronous signal emulation
Synchronous signals
Processes/threads
Locales
Legacy naming
It is not OK to query and record the default locale using setlocale(""), because that returns "display" names like "Norwegian (Bokmål).1252" that are unsuitable and unstable:
- they contain non-ASCII characters, but we use locale names in shared catalogues that must be ASCII, which occasionally causes strange problems (win32locale.c has some defences against that but they are kludgy and incomplete)
- the names change on Windows OS updates, and then PostgreSQL clusters using the old names break because setlocale(x) fails; this periodically happens because countries change their name ("Czech Republic" → "Czechia", "Turkey" → "Türkiye", ...)
- Bug report (there are many more)
BCP 47
The solution to that seems to be to recommend BCP47 names instead ("tr-TR"), and to teach initdb.c to query that default using a different interface that returns that (patch). Howevever, that raises several questions:
- when choosing a default locale in initdb, should we add ".UTF-8" (or some other "code page" AKA encoding)?
- if we don't, it seems that the encoding is the current Windows "ACP"; apparently that can be changed at any time by some settings windows? What does that means for a database cluster depending on stable behaviour?
- if we do, it is unclear why the ctype support seems not to work (or perhaps our UTF-8 support was simply always broken in this way on Windows?); unclear if this is a pre-existing condition
Versioning
Transitivity
Windows locales don't seem to be fully transitive, as required for database use:
One option would be to drop support for the libc provider on Windows, given the above problems and the lack of interest in addressing them. That would leave users with the new "builtin" provider and the "ICU" collation provider, both of which are actively maintained and designed with database needs in mind; however we'd still need to eradicate old-style display names from other places that use locales other than for string collations (other LC_XXX categories), and PostgreSQL may also be a little inconsistent about when it uses the collation provider and when it uses libc functions for ctype logic.