User:Nmisch/Wanted
These are projects I expect to reach no earlier than 2028-01, perhaps never. I encourage others (you, anyone) to own these. Feel free to edit your name next to an item if you're working on it.
My focus has been on defect fixing and defense in depth, because I feel it's under-consumed relative to the other top areas, scalability and features. We should not wait for users to say "I want fewer bugs", which you don't hear unless you let things get awful. This list has that same focus. Each section is in descending order of priority. Easier items tend to appear later. (Easy, high-priority items got done without reaching this list.)
Disclaimer: some list entries are likely already fixed, way harder than they look, cures worse than the disease, etc. If I knew everything about them, they'd be done instead of being here. If you try and abandon an item, feel free to edit to leave brief notes about the pitfalls. Ideal: reply to the mailing list thread about the pitfalls, then link your reply here.
Other people have similar lists, e.g. User:Andresfreund/Desired_Changes.
Defect fixes
Review Timeline switching with partial WAL records can break replica recovery
- Restore and recovery failures are trust problems, so these defects warrant high priority.
Fix unfixed loss of DDL changes
- grep source tree for "observe loss of at2+at4 changes XXX is an extant bug"
Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData
- Some work-in-progress exists.
Fix PROC_IN_VACUUM scan breakage
-
- see: git grep 'PROC_IN_VACUUM scan breakage'
- see this from the thread that added that test
- see error message examples and other way to reproduce it
- obstacle: systable scans happen in bulkdelete callbacks
Review other bug fixes awaiting commitfest review
- Priority tiers: (1) wrong query results; recovery failure; persistent corruption (2) crashes (3) the rest
Defense in depth
CI-style execution of PGXN module test suites and client driver test suites
- For back-patched fixes, this would give a straightforward compatibility impact analysis. At any given time, many of these suites will be broken or flaky. The project must account for that. If a test suite changes from N green runs in a row to red, it's a meaningful signal to report.
Add sanitizer or formal verification to catch insufficient memory ordering
-
There's a large and growing appetite to replace mutex (LWLock, spinlock)
algorithms with lockless/atomics algorithms. It's too easy for missed memory
barriers and such to escape review. Our current backstop is ad-hoc chaos
testing by tracking test flakes in the buildfarm and CI. Defects can hide.
(https://postgr.es/c/3fb5862 took less than a day. The
Non-reproducible AIO failure
investigation lasted months of involvement from senior hackers, though this
project might not catch that one.)
Adding infrastructure to more-reliably detect missing barriers will give us freedom to implement more aggressive use of non-mutex algorithms. For example, I consider this a prudent prerequisite for adopting CSN snapshots to fix snapshot isolation on standby. Also, this may be one step toward a shorter beta period.
I would start with:- literature review of relevant formal verification technologies
- check the prior art in Kernel Concurrency Sanitizer
- evaluate ThreadSanitizer: can we make it support multiple processes or, more likely, have a quick-and-dirty multithreaded PostgreSQL just for sanitizer runs?
Revive oss-fuzz integration
- It's been failing due to patch bitrot. Consider putting the changes into postgresql.git, so oss-fuzz can just build with e.g. -DOSSFUZZ. Perhaps have a buildfarm member confirm -DOSSFUZZ still compiles.
Automate detection of missing fsync
- https://postgr.es/c/0b6517a fixed such an omission after 23 months in the tree, so we're not set up to detect these quickly. LazyFS reproduced that particular defect, so automating LazyFS testing could be a start. However, LazyFS wouldn't detect a case where a file should have received N syncs but received only N-1 syncs. Years ago, someone (could have been Jeff Janes) did a series of testing by killing a VM at random moments and running recovery. More ideas may emerge.
Add sanitizer or formal verification to catch memcpy() on vars where we rely on atomicity
- PostgreSQL has various spots assuming "four-bytes, updates are atomic". In other words, even without constraining memory order, we assume we can't see half of a four-byte store. We should not use memcpy on such addresses, because memcpy is entitled to use 1-byte stores. (We could use our own memcpy variant that always uses 4b or 8b stores.) memcpy implementations have incentives to use larger stores, making this defect hard to detect via chaos testing.
Fix timing of SimpleLruTruncate() caller WAL records vs. "apparent wraparound" error
- TruncateMultiXact() calls WriteMTruncateXlogRec() to write WAL, then calls PerformOffsetsTruncation() -> SimpleLruTruncate(). SimpleLruTruncate() might fail its "important safety check". At that moment, I suspect recovery bypasses the safety check, because multixact_redo() updates latest_page_number to essentially disable that check. Perhaps we should perform the "important safety check" before writing WAL?
Check invalid pages at the end of recovery to alarm lost data
- This is defense in depth because it catches a user error. Since backup protocol violations are a known class of user mistake, it's nice to catch that mistake if we can do so economically.
Supportability
Make poll_query_until add less test latency
- The 0.1s check interval is too high for fast machines, too low for slow machines. Start lower and use exponential backoff.
Prevent GETTEXT_FILES getting out of date
- These makefile vars control which translated strings each binary loads. When we make a message reachable from a given binary without naming the message's C file in GETTEXT_FILES, the user gets an untranslated message. https://postgr.es/c/914ea1c fixed such an omission months after introduction. Add automation to block inaccurate GETTEXT_FILES.
Make PostgreSQL::Test::Utils::run_log() print the full story for pipelines and redirections
-
run_log(['ls'], '|', ['cat'], '>', 'filename')prints just thels, but it should print everything.