User:X4m

Join Removal Based on Foreign Key Constraints

Project Description

The project focuses on adding the ability to remove joins to tables when the join is on a foreign key column and only the child table columns are needed in the query result.

This is a common optimization in enterprise databases (Oracle, DB2) that PostgreSQL currently lacks. When a view joins multiple tables but a query only needs columns from some of them, the optimizer should be able to eliminate unnecessary joins if foreign key constraints guarantee referential integrity.

Example scenario:

-- View joining orders to customers
CREATE VIEW order_summary AS
  SELECT o.* FROM orders o JOIN customers c ON o.customer_id = c.id;

-- Query only uses order columns - join should be eliminated
SELECT order_id, amount FROM order_summary;

The optimizer should recognize that the join to customers table is unnecessary since:

Only columns from orders are selected
The foreign key constraint guarantees the join will succeed
No filtering on customers columns exists

Skills needed

C programming
SQL
PostgreSQL

Difficulty level

Hard

Project size

Large: approximately ~350 hours of work.

Mentors

Kirk Wolak (Primary)
Nik Samokhvalov (Backup)

Expected outcomes

A working implementation of FK-based join elimination in the query optimizer
Updated cost estimation to account for eliminated joins
Comprehensive regression tests covering various join patterns
Documentation outlining the optimization rules and limitations
Performance benchmarks demonstrating query speedups

References

Parallel GiST Index Build

Project Description

The project focuses on enabling parallel workers for GiST (Generalized Search Tree) index construction. PostgreSQL 18 supports parallel builds for B-tree, GIN, and BRIN indexes, but GiST indexes still build using a single process.

GiST indexes are crucial for:

Spatial data (PostGIS geometries)
Full-text search
Range types
Custom data types with complex comparison semantics

For large spatial datasets, GiST index creation can take hours. Parallelizing this operation would significantly improve the user experience for PostGIS users and anyone working with large amounts of indexed geometric or range data.

The project requires understanding PostgreSQL's parallel infrastructure, the GiST access method internals, and designing a parallelization strategy that maintains index correctness while maximizing throughput.

Skills needed

C programming
PostgreSQL

Difficulty level

Hard

Project size

Large: approximately ~350 hours of work.

Mentors

Kirk Wolak (Primary)
Nik Samokhvalov (Backup)

Expected outcomes

A working parallel GiST index build implementation
Proper coordination between parallel workers during tree construction
Performance benchmarks showing speedup on multi-core systems
Documentation outlining the parallelization strategy
Regression tests ensuring correctness of parallel-built indexes

References

Per-Datatype TOAST Slicing Strategies

Project Description

The project focuses on allowing different datatypes to be sliced differently when TOASTed (The Oversized-Attribute Storage Technique).

Currently, PostgreSQL uses fixed ~2KB chunks for all TOAST-able types. This one-size-fits-all approach is suboptimal for different data patterns:

JSON documents might benefit from larger chunks to keep objects together
BYTEA with small random access patterns might prefer smaller chunks
TEXT with typical full-retrieval might prefer maximum compression over chunk granularity
Arrays might benefit from alignment with element boundaries

This project would extend the TOAST infrastructure to allow datatypes to specify their preferred slicing strategy, potentially including:

Custom chunk sizes
Compression preferences per type
Access pattern hints (sequential vs random)

Skills needed

C programming
PostgreSQL

Difficulty level

Hard

Project size

Large: approximately ~350 hours of work.

Mentors

Kirk Wolak (Primary)
Nik Samokhvalov (Backup)

Expected outcomes

An extensible API for datatype-specific TOAST strategies
Implementation for at least one datatype (e.g., JSONB or BYTEA)
Performance benchmarks comparing old vs new strategies
Documentation outlining the new API for datatype authors
Regression tests ensuring backward compatibility

References

Regression Test Coverage Improvements

Project Description

The project focuses on significantly improving PostgreSQL's regression test coverage. While PostgreSQL is known for its reliability, some code paths have surprisingly low test coverage, with some areas at single-digit percentages.

The goals of this project:

Analyze current test coverage using gcov/lcov
Identify critical code paths with low coverage
Write new regression tests (both SQL and TAP tests)
Target improvement from ~73% to 85%+ overall coverage
Focus on error handling paths and edge cases

This is crucial work for PostgreSQL's long-term quality. Better test coverage catches bugs earlier, makes refactoring safer, and improves confidence in releases.

Areas typically needing coverage improvements:

Error handling code paths
Rarely-used SQL features
Edge cases in type conversion
Recovery and crash scenarios
Parallel query edge cases

Skills needed

SQL
Perl (for TAP tests)
Basic C reading skills
PostgreSQL

Difficulty level

Moderate

Project size

Large: approximately ~350 hours of work.

Mentors

Kirk Wolak (Primary)
Nik Samokhvalov (Backup)

Expected outcomes

Measurable improvement in overall test coverage (target: +5-7%)
New SQL regression tests for identified gaps
New TAP tests for isolation and recovery scenarios
Documentation of coverage analysis methodology
Report identifying remaining low-coverage areas for future work

References

ALTER TABLE SET LOGGED/UNLOGGED Performance

Project Description

The project focuses on improving the performance of ALTER TABLE SET LOGGED and ALTER TABLE SET UNLOGGED operations by avoiding a full table rewrite when possible.

Currently, changing a table's persistence status requires rewriting the entire heap, which can be extremely slow for large tables. However, when wal_level = minimal (common during bulk loading), it should be possible to change the persistence flag without touching the actual data pages.

Use case workflow:

-- Bulk loading scenario
CREATE UNLOGGED TABLE staging (...);
COPY staging FROM 'huge_file.csv';
CREATE INDEX ON staging (...);
ALTER TABLE staging SET LOGGED;  -- Currently rewrites entire table!

The project should:

Analyze when a rewrite can be avoided
Implement the optimization for the safe cases
Ensure crash safety and correctness
Provide significant speedup for bulk loading workflows

Skills needed

C programming
PostgreSQL
WAL file handling/recovery

Difficulty level

Hard

Project size

Large: approximately ~350 hours of work.

Mentors

Kirk Wolak (Primary)
Nik Samokhvalov (Backup)

Expected outcomes

Optimized SET LOGGED/UNLOGGED that avoids rewrite when safe
Clear documentation of when optimization applies
Performance benchmarks showing improvement
Comprehensive tests for correctness and crash safety
Documentation outlining the implementation approach

References

Global Temporary Tables

Project Description

The project focuses on implementing SQL-standard global temporary tables where the table definition is shared (visible to all sessions) but the data is session-private.

Current PostgreSQL temporary tables create both metadata and data as session-local, which causes:

Catalog bloat from repeated CREATE TEMPORARY TABLE
Inability to reference temp tables in views or functions
Migration difficulties from Oracle/DB2 which have global temp tables

SQL-standard global temporary tables would:

Have persistent, shared definitions in pg_class
Store data in session-private storage
Support ON COMMIT DELETE ROWS and ON COMMIT PRESERVE ROWS
Allow references in views, functions, and prepared statements
Consider/Test a variation of table partitioning (by session_id, like pg_temp_### does)

Example:

-- Created once by DBA
CREATE GLOBAL TEMPORARY TABLE session_data (
    key text PRIMARY KEY,
    value jsonb
) ON COMMIT PRESERVE ROWS;

-- Each session sees only its own data
INSERT INTO session_data VALUES ('user_id', '"123"');

Skills needed

C programming
SQL
PostgreSQL

Difficulty level

Hard

Project size

Large: approximately ~350 hours of work.

Mentors

Kirk Wolak (Primary)
Nik Samokhvalov (Backup)

Expected outcomes

Working implementation of global temporary tables
Support for ON COMMIT DELETE ROWS and ON COMMIT PRESERVE ROWS
Proper handling in pg_dump/pg_restore
Documentation and migration guide from Oracle GTTs
Comprehensive regression tests

References

Autonomous Transactions

Project Description

The project focuses on implementing autonomous transactions - transactions that can commit or rollback independently of their parent transaction.

This is a long-standing PostgreSQL TODO item that would enable important use cases:

Audit logging: Log entries that persist even if the main transaction rolls back (Like all the locks being held causing our timeout)
Error handling: Record errors in a log table within exception handlers
Sequence-like behavior: Custom sequences that don't roll back
Progress tracking: Update progress indicators during long operations
Background API Calls/Async: Call out to an API Service that could be slow, regardless if we rollback, without blocking

Example use case:

CREATE FUNCTION transfer_funds(from_acc int, to_acc int, amount numeric)
RETURNS void AS $$
BEGIN
    -- This should persist even if transfer fails
    AUTONOMOUS
        INSERT INTO audit_log (action, details, ts)
        VALUES ('transfer_attempt', 
                format('from=%s to=%s amount=%s', from_acc, to_acc, amount),
                now());
    END AUTONOMOUS;
    
    -- Main transaction logic
    UPDATE accounts SET balance = balance - amount WHERE id = from_acc;
    UPDATE accounts SET balance = balance + amount WHERE id = to_acc;
END;
$$ LANGUAGE plpgsql;

This is a complex feature requiring deep understanding of PostgreSQL's transaction management.

Skills needed

C programming
SQL
PL/pgSQL
PostgreSQL

Difficulty level

Hard

Project size

Large: approximately ~350 hours of work.

Mentors

Kirk Wolak (Primary)
Nik Samokhvalov (Backup)

Expected outcomes

Working implementation of autonomous transactions in PL/pgSQL
Proper handling of savepoints and nested autonomous transactions
Clear semantics for lock inheritance and visibility
Documentation and usage examples
Comprehensive regression tests including edge cases

References

B-tree Index Bloat Reduction (Page Merge)

Project Description

The project focuses on implementing efficient B-tree index bloat reduction through page merging. B-tree indexes can become severely bloated after heavy UPDATE/DELETE workloads - in production systems, indexes with 90%+ bloat are common.

Current solutions have significant drawbacks:

REINDEX: Requires AccessExclusiveLock for entire duration (blocks all access)
REINDEX CONCURRENTLY: Very long duration, requires 2x disk space
pg_repack: External tool, complex setup, requires double space

This project would implement in-core page merging that:

Merges sparsely populated leaf pages with their neighbors
Uses two-phase locking (light lock for planning, brief exclusive lock for execution)
Provides crash safety through WAL logging
Achieves 30-50% bloat reduction with minimal disruption

A prototype extension (pg_btree_compact) has been developed demonstrating the approach. This project would work toward integrating similar functionality into core PostgreSQL, potentially as part of VACUUM or as a new REINDEX variant.

Skills needed

C programming
PostgreSQL
B-Tree structures

Difficulty level

Hard

Project size

Large: approximately ~350 hours of work.

Mentors

Kirk Wolak (Primary)
Nik Samokhvalov (Backup)

Expected outcomes

Working implementation of B-tree page merging
Integration with VACUUM or as standalone command
Proper WAL logging for crash safety
Minimal exclusive lock duration (<1ms per 100 page merges)
Performance benchmarks and bloat reduction measurements
Comprehensive tests using amcheck for index integrity verification

References

Monitoring Tools Performance: pg_stat_statements and Lock Contention

Project Description

The project focuses on improving the performance and reducing lock contention in PostgreSQL's monitoring infrastructure, particularly pg_stat_statements and related statistics collectors. Keeping in mind that the more executing backends you have running queries need to lock this structure for writing. Then monitoring software needs to lock this structure for reading. The faster the system, the more locking/unlocking takes up most of the time. Also, as the number of statements being tracked increases, the reading takes longer. Finally, sampling could reduce some of this.

pg_stat_statements is essential for query performance analysis, but it has known scalability issues:

Lock contention: High-frequency queries cause contention on the shared hash table
Memory pressure: Limited entries (default 5000) cause eviction churn on busy systems
Reset overhead: pg_stat_statements_reset() can cause significant pauses

Similar issues affect other monitoring tools:

pg_stat_activity updates
pg_stat_user_tables/indexes statistics
Custom statistics collectors

Goals of this project:

Analyze and profile current lock contention patterns
Implement lock-free or reduced-lock data structures where possible
Add partitioned/sharded hash tables for better scalability
Consider/test ideas on timing out (and not logging) if having to wait too long for the lock
Consider/test a secondary queue for things to add when we have time
Consider/test a circular buffer so reading always returns the previous portion of the buffer w/o locking
Improve statistics aggregation efficiency
Reduce impact on production workloads

This work is critical for observability in high-throughput PostgreSQL deployments where monitoring overhead must be minimized.

Skills needed

C programming
Performance profiling
Concurrency/locking patterns
PostgreSQL

Difficulty level

Hard

Project size

Large: approximately ~350 hours of work.

Mentors

Kirk Wolak (Primary)
Nik Samokhvalov (Backup)

Expected outcomes

Measurable reduction in lock contention for pg_stat_statements
Improved scalability under high query rates (target: 100k+ queries/sec)
Benchmarks comparing before/after performance
Documentation of new architecture and trade-offs
Regression tests ensuring correctness of statistics collection

References

Wait Event Coverage Improvements

Project Description

The project focuses on improving PostgreSQL's wait event coverage to provide more accurate observability into what the database is doing at any given moment. More importantly, what the database is waiting on. While PostgreSQL has many waits being properly recorded, this project is designed to find as many places as possible where we wait, and we should clarify what we are waiting on/for. This project will add the code required to track those new wait areas properly.

Many monitoring tools that implement wait event analysis (AWS RDS Performance Insights, pgsentinel, pg_wait_sampling, and others) visualize samples where "wait_event IS NULL" as "CPU" time, typically shown in green. This convention originated from Oracle and has become widespread.

However, this assumption can make analysis inaccurate because there are many places in PostgreSQL code that are not covered by wait events but technically should be - and such places cannot accurately be labeled as "CPU". When a backend shows NULL wait_event, it might be:

Actually doing CPU work (legitimate)
Performing I/O that isn't instrumented
Waiting on internal synchronization not covered by wait events
Executing extension code without proper instrumentation

This project would:

Systematically analyze PostgreSQL source code to identify non-instrumented waits
Categorize gaps by type (I/O, synchronization, extension hooks, etc.)
Propose and implement new wait events for significant gaps
Ensure backward compatibility with existing monitoring tools
Document the new wait events and their meanings

Initial analysis has identified potential gaps in areas like:

Some file I/O operations
Internal memory allocation paths
Extension hook execution
Background worker coordination
Certain replication scenarios

Skills needed

C programming
PostgreSQL
Basic understanding of performance analysis

Difficulty level

Moderate

Project size

Large: approximately ~350 hours of work.

Mentors

Kirk Wolak (Primary)
Nik Samokhvalov (Backup)

Expected outcomes

Comprehensive analysis of current wait event coverage gaps
Implementation of new wait events for significant uninstrumented code paths
Improved accuracy of "CPU" vs "waiting" distinction in monitoring tools
Documentation of all new wait events
Regression tests ensuring wait events fire correctly
Backward compatibility with existing monitoring infrastructure
Getting and applying feedback (especially on naming) from the Hackers group