User:Nmisch/Wanted

These are projects I expect to reach no earlier than 2028-01, perhaps never. I encourage others (you, anyone) to own these. Feel free to edit your name next to an item if you're working on it.

My focus has been on defect fixing and defense in depth, because I feel it's under-consumed relative to the other top areas, scalability and features. We should not wait for users to say "I want fewer bugs", which you don't hear unless you let things get awful. This list has that same focus. Each section is in descending order of priority. Easier items tend to appear later. (Easy, high-priority items got done without reaching this list.)

Disclaimer: some list entries are likely already fixed, way harder than they look, cures worse than the disease, etc. If I knew everything about them, they'd be done instead of being here. If you try and abandon an item, feel free to edit to leave brief notes about the pitfalls. Ideal: reply to the mailing list thread about the pitfalls, then link your reply here.

Other people have similar lists, e.g. User:Andresfreund/Desired_Changes.

Defect fixes

Review Timeline switching with partial WAL records can break replica recovery: Restore and recovery failures are trust problems, so these defects warrant high priority.

Fix unfixed loss of DDL changes: grep source tree for "observe loss of at2+at4 changes XXX is an extant bug"

Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData: Some work-in-progress exists.

Fix PROC_IN_VACUUM scan breakage

see: git grep 'PROC_IN_VACUUM scan breakage'
see this from the thread that added that test
see error message examples and other way to reproduce it
obstacle: systable scans happen in bulkdelete callbacks

This makes VACUUM's MVCC scans fail to find rows. It's hard to reason about how much could go wrong after that, so I consider this severe despite the lack of a specific corruption story.

Vacuum full failing xmin check, but vacuum freeze ok on v16

Bug in recovery of drop database? / redo of dropdb must bump min recovery point

check_exclusion_or_unique_constraint false negative

Review other bug fixes awaiting commitfest review: Priority tiers: (1) wrong query results; recovery failure; persistent corruption (2) crashes (3) the rest

Defense in depth

CI-style execution of PGXN module test suites and client driver test suites: For back-patched fixes, this would give a straightforward compatibility impact analysis. At any given time, many of these suites will be broken or flaky. The project must account for that. If a test suite changes from N green runs in a row to red, it's a meaningful signal to report.

Add sanitizer or formal verification to catch insufficient memory ordering

There's a large and growing appetite to replace mutex (LWLock, spinlock) algorithms with lockless/atomics algorithms. It's too easy for missed memory barriers and such to escape review. Our current backstop is ad-hoc chaos testing by tracking test flakes in the buildfarm and CI. Defects can hide. (https://postgr.es/c/3fb5862 took less than a day. The Non-reproducible AIO failure investigation lasted months of involvement from senior hackers, though this project might not catch that one.)

Adding infrastructure to more-reliably detect missing barriers will give us freedom to implement more aggressive use of non-mutex algorithms. For example, I consider this a prudent prerequisite for adopting CSN snapshots to fix snapshot isolation on standby. Also, this may be one step toward a shorter beta period.

I would start with:

literature review of relevant formal verification technologies
check the prior art in Kernel Concurrency Sanitizer
evaluate ThreadSanitizer: can we make it support multiple processes or, more likely, have a quick-and-dirty multithreaded PostgreSQL just for sanitizer runs?

Revive oss-fuzz integration: It's been failing due to patch bitrot. Consider putting the changes into postgresql.git, so oss-fuzz can just build with e.g. -DOSSFUZZ. Perhaps have a buildfarm member confirm -DOSSFUZZ still compiles.

Automate detection of missing fsync: https://postgr.es/c/0b6517a fixed such an omission after 23 months in the tree, so we're not set up to detect these quickly. LazyFS reproduced that particular defect, so automating LazyFS testing could be a start. However, LazyFS wouldn't detect a case where a file should have received N syncs but received only N-1 syncs. Years ago, someone (could have been Jeff Janes) did a series of testing by killing a VM at random moments and running recovery. More ideas may emerge.

Add sanitizer or formal verification to catch memcpy() on vars where we rely on atomicity: PostgreSQL has various spots assuming "four-bytes, updates are atomic". In other words, even without constraining memory order, we assume we can't see half of a four-byte store. We should not use memcpy on such addresses, because memcpy is entitled to use 1-byte stores. (We could use our own memcpy variant that always uses 4b or 8b stores.) memcpy implementations have incentives to use larger stores, making this defect hard to detect via chaos testing.

Fix timing of SimpleLruTruncate() caller WAL records vs. "apparent wraparound" error: TruncateMultiXact() calls WriteMTruncateXlogRec() to write WAL, then calls PerformOffsetsTruncation() -> SimpleLruTruncate(). SimpleLruTruncate() might fail its "important safety check". At that moment, I suspect recovery bypasses the safety check, because multixact_redo() updates latest_page_number to essentially disable that check. Perhaps we should perform the "important safety check" before writing WAL?

Check invalid pages at the end of recovery to alarm lost data: This is defense in depth because it catches a user error. Since backup protocol violations are a known class of user mistake, it's nice to catch that mistake if we can do so economically.

Supportability

Make poll_query_until add less test latency: The 0.1s check interval is too high for fast machines, too low for slow machines. Start lower and use exponential backoff.

Prevent GETTEXT_FILES getting out of date: These makefile vars control which translated strings each binary loads. When we make a message reachable from a given binary without naming the message's C file in GETTEXT_FILES, the user gets an untranslated message. https://postgr.es/c/914ea1c fixed such an omission months after introduction. Add automation to block inaccurate GETTEXT_FILES.

Make PostgreSQL::Test::Utils::run_log() print the full story for pipelines and redirections: run_log(['ls'], '|', ['cat'], '>', 'filename') prints just the ls, but it should print everything.

User:Nmisch/Wanted

Contents

Defect fixes

Defense in depth

Supportability

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools