User:Nmisch/Wanted

From PostgreSQL wiki
Jump to navigationJump to search

These are projects I expect to reach no earlier than 2028-01, perhaps never. I encourage others (you, anyone) to own these. Feel free to edit your name next to an item if you're working on it.

My focus has been on defect fixing and defense in depth, because I feel it's under-consumed relative to the other top areas, scalability and features. We should not wait for users to say "I want fewer bugs", which you don't hear unless you let things get awful. This list has that same focus. Each section is in descending order of priority. Easier items tend to appear later. (Easy, high-priority items got done without reaching this list.)

Disclaimer: some list entries are likely already fixed, way harder than they look, cures worse than the disease, etc. If I knew everything about them, they'd be done instead of being here. If you try and abandon an item, feel free to edit to leave brief notes about the pitfalls. Ideal: reply to the mailing list thread about the pitfalls, then link your reply here.

Other people have similar lists, e.g. User:Andresfreund/Desired_Changes.


Defect fixes

Incomplete item  Review Timeline switching with partial WAL records can break replica recovery
Restore and recovery failures are trust problems, so these defects warrant high priority.
Incomplete item  Fix unfixed loss of DDL changes
grep source tree for "observe loss of at2+at4 changes XXX is an extant bug"
Incomplete item  Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData
Some work-in-progress exists.
Incomplete item  Fix PROC_IN_VACUUM scan breakage
This makes VACUUM's MVCC scans fail to find rows. It's hard to reason about how much could go wrong after that, so I consider this severe despite the lack of a specific corruption story.
Incomplete item  Vacuum full failing xmin check, but vacuum freeze ok on v16
Incomplete item  Bug in recovery of drop database? / redo of dropdb must bump min recovery point
Incomplete item  check_exclusion_or_unique_constraint false negative
Incomplete item Review other bug fixes awaiting commitfest review
Priority tiers: (1) wrong query results; recovery failure; persistent corruption (2) crashes (3) the rest


Defense in depth

Incomplete item  CI-style execution of PGXN module test suites and client driver test suites
For back-patched fixes, this would give a straightforward compatibility impact analysis. At any given time, many of these suites will be broken or flaky. The project must account for that. If a test suite changes from N green runs in a row to red, it's a meaningful signal to report.
Incomplete item  Add sanitizer or formal verification to catch insufficient memory ordering
There's a large and growing appetite to replace mutex (LWLock, spinlock) algorithms with lockless/atomics algorithms. It's too easy for missed memory barriers and such to escape review. Our current backstop is ad-hoc chaos testing by tracking test flakes in the buildfarm and CI. Defects can hide. (https://postgr.es/c/3fb5862 took less than a day. The Non-reproducible AIO failure investigation lasted months of involvement from senior hackers, though this project might not catch that one.)

Adding infrastructure to more-reliably detect missing barriers will give us freedom to implement more aggressive use of non-mutex algorithms. For example, I consider this a prudent prerequisite for adopting CSN snapshots to fix snapshot isolation on standby. Also, this may be one step toward a shorter beta period.

I would start with:
  • literature review of relevant formal verification technologies
  • check the prior art in Kernel Concurrency Sanitizer
  • evaluate ThreadSanitizer: can we make it support multiple processes or, more likely, have a quick-and-dirty multithreaded PostgreSQL just for sanitizer runs?
Incomplete item Revive oss-fuzz integration
It's been failing due to patch bitrot. Consider putting the changes into postgresql.git, so oss-fuzz can just build with e.g. -DOSSFUZZ. Perhaps have a buildfarm member confirm -DOSSFUZZ still compiles.
Incomplete item  Automate detection of missing fsync
https://postgr.es/c/0b6517a fixed such an omission after 23 months in the tree, so we're not set up to detect these quickly. LazyFS reproduced that particular defect, so automating LazyFS testing could be a start. However, LazyFS wouldn't detect a case where a file should have received N syncs but received only N-1 syncs. Years ago, someone (could have been Jeff Janes) did a series of testing by killing a VM at random moments and running recovery. More ideas may emerge.
Incomplete item  Add sanitizer or formal verification to catch memcpy() on vars where we rely on atomicity
PostgreSQL has various spots assuming "four-bytes, updates are atomic". In other words, even without constraining memory order, we assume we can't see half of a four-byte store. We should not use memcpy on such addresses, because memcpy is entitled to use 1-byte stores. (We could use our own memcpy variant that always uses 4b or 8b stores.) memcpy implementations have incentives to use larger stores, making this defect hard to detect via chaos testing.
Incomplete item  Fix timing of SimpleLruTruncate() caller WAL records vs. "apparent wraparound" error
TruncateMultiXact() calls WriteMTruncateXlogRec() to write WAL, then calls PerformOffsetsTruncation() -> SimpleLruTruncate(). SimpleLruTruncate() might fail its "important safety check". At that moment, I suspect recovery bypasses the safety check, because multixact_redo() updates latest_page_number to essentially disable that check. Perhaps we should perform the "important safety check" before writing WAL?
Incomplete item  Check invalid pages at the end of recovery to alarm lost data
This is defense in depth because it catches a user error. Since backup protocol violations are a known class of user mistake, it's nice to catch that mistake if we can do so economically.


Supportability

Incomplete item  Make poll_query_until add less test latency
The 0.1s check interval is too high for fast machines, too low for slow machines. Start lower and use exponential backoff.
Incomplete item  Prevent GETTEXT_FILES getting out of date
These makefile vars control which translated strings each binary loads. When we make a message reachable from a given binary without naming the message's C file in GETTEXT_FILES, the user gets an untranslated message. https://postgr.es/c/914ea1c fixed such an omission months after introduction. Add automation to block inaccurate GETTEXT_FILES.
Incomplete item  Make PostgreSQL::Test::Utils::run_log() print the full story for pipelines and redirections
run_log(['ls'], '|', ['cat'], '>', 'filename') prints just the ls, but it should print everything.