GSoC 2016

From PostgreSQL wiki
Jump to navigationJump to search

This page is for collecting ideas for future Summer of Code projects.

Project Ideas

Project ideas are to be added here by community members.

NOTE: Google wants each idea to be supported by a few sentences and then a list of skills/reqs, difficulty level, potential mentors, expected outcomes, etc. Please add sections in this format, removing any ideas from the one-liner list as they are added below.

Core

  • UPDATE ... RETURNING OLD link
  • Add RETURNING to DDL (CREATE, ALTER, DROP) and possibly DCL (GRANT, REVOKE) link
  • Allow different datatypes to be sliced differently, when TOASTed link
  • Support for microvacuum for GiST
  • Parallel GIN build using background workers
  • Implement WAL logging for hash indexes
  • Indexing prolonged geometrical objects (i.e. boxes, circles, polygons, not points) with SP-GiST by mapping to 4d-space.
  • Make index selection sublinear. Useful for index partitioning (vs table partitioning).
  • Optimization- add to the join removal code the ability to remove joins to tables when the join is on a foreign key column (where all entries must exist in the parent anyway)
  • Optimization- Check the set of conditionals on a WHERE clause against CHECK constraints on the table being queried and remove any conditionals which *must* be true due to the CHECK constraints
  • Investigate the HashJoin code- specifically, review cases where the larger table is hashed instead of the smaller, test making NTUP_PER_BUCKET variable, etc.
  • Improve plpgsql error checking- look at existing discussions around GUCs which can be set to increase the warnings that plpgsql throws.
  • per-function temp tables (or some kind of scoping for temp tables when used inside of plpgsql code)
  • date_trunc() supporting intervals
  • Modify the postgres_fdw to allow a server to be defined as "autocommit" or similar- to allow for autonomous transactions
  • Modify plpgsql to complain about unused variables
  • Comments for function parameters
  • Finally clause for pl/pgsql exception handling
  • Invent an async interface for the Append() node to use, allowing a way to parallelize queries across multiple remote FDWs or local tablespaces.
  • Implement a way to store/re-use the constraint exclusion analysis results for an inheiritence-based partition setup
  • Modify snapshot export to also export lock info, to avoid deadlocks between parallel pg_dump and concurrent DDL
  • PL/PgSQL x++; x+= 2; operators.
  • bytea '&' and '|' operators which returns a bytea, written efficiently in C by stepping through the bytea's in chunks, perhaps.

SERIALIZABLE transaction isolation level improvements

Explicitly support predicate locks in index access methods besides btree

Predicate locking is provided using non-blocking "SIRead" locks to support the Serializable Snapshot Isolation used in PostgreSQL. Without explicit support within an access method (AM) any read from an index using that AM takes out an SIRead lock on the index relation, causing a read-write dependency to be created for any insert to the index, potentially (in combination with other circumstances) leading to an unnecessary serialization failure. Page level predicate locking exists in the btree AM, where it required the addition of 17 function calls to implement, but remains unimplemented in the gin, gist, spgist, brin, and hash index AMs.

Skills needed:

  • The ability to read technical papers to understand the serializable implementation used in PostgreSQL
  • The ability to find the appropriate places in each of the index AMs to insert calls to existing functions such as PredicateLockPage(), PredicateLockPageSplit(), CheckForSerializableConflictOut(), and CheckForSerializableConflictIn(). This requires understanding of each AM, and close detail work.
  • The ability to use the existing testing tool to create regression tests involving each AM, similar to what exists for the btree AM.

The implementation style of serializable transactions in PostgreSQL is a fairly new technique (first published in 2008), and has what has been referred to as "a high cognitive load" -- it is hard to get your head around the details of its implementation. That said, this task is fairly narrow and only requires a general understanding of the overall technique; it is more focused on detail work and testing. Basically, the idea is to have each index scan "lock the gaps" in what it read, using page locks, such that any insert into the index which would have caused the index scan to return a different result sees a lock.

Kevin Grittner (one of the original authors of the current serializable support in PostgreSQL, including the btree AM support) can mentor. Kevin is a major contributor and committer.

The expected outcome is to implement page-level predicate locking in the remaining core index AMs, with appropriate comments and regression tests.

Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions

Within predicate.c there are five static functions to provide an API to manage information about read-write dependencies (also referred to as rw-conflicts). This uses a double-linked list in shared memory, as it was assumed the lists would be short. Some benchmarks at high concurrency levels have seen the code in these functions take up to half the CPU time on the test machine, highlighting the O(N^2) nature of the current implementation. Rewrite these five functions to use a technique that scales better. It is likely that the function signatures will not need to change. Locking around these operations is external, which probably does not need to change -- but a review of that would be good.

Skills needed:

  • The ability to analyze and test scaling issues
  • The ability to run careful benchmarks and present them to the community in a clear and understandable format
  • Knowledge of lock-free techniques to manage sets of data would be a big plus

The implementation style of serializable transactions in PostgreSQL is a fairly new technique (first published in 2008), and has what has been referred to as "a high cognitive load" -- it is hard to get your head around the details of its implementation. That said, this task is fairly narrow and only requires reworking a specific internal API which is currently implemented in about 130 lines of code. This is a performance improvement, and the bar for committing those is fairly high in terms of testing and benchmarking, so the majority of the time on this task would be expected to be in constructing and running tests and benchmarks, including careful attention to "worst case" scenarios and what the impact is in those.

Kevin Grittner (one of the original authors of the current serializable support in PostgreSQL, including the current implementation of this API) can mentor. Kevin is a major contributor and committer.

The expected outcome is to produce a patch that shows clear improvement of worst-case high-concurrency loads using SERIALIZABLE transactions without causing unacceptable performance regression for other cases.

Foreign keys for Array elements

Currently, foreign keys can only point from one column to the value of another column. But for certain database designs it is useful to have a table column reference individual elements within an array column in another table. This has been a desired feature for a long time.

Skills needed:

  • ability to update an old patch to codebase that evolved underneath it
  • ability to identify and satisfy performance concerns raised in previous failed efforts to write this feature

Expected outcome is a committable patch to support the feature, closing the holes in patches previously submitted.

Álvaro Herrera (PostgreSQL contributor and committer) can mentor.


Extensions

  • extending JDBC_FDW to support qual pushdown and/or writeable foreign tables API.