GSoC 2017

From PostgreSQL wiki
Jump to navigationJump to search

This page is for collecting ideas for future Summer of Code projects.

Regarding Project Ideas

Project ideas are to be added here by community members.

NOTE: Google wants each idea to be supported by a few sentences and then a list of skills/reqs, difficulty level, potential mentors, expected outcomes, etc. Please add sections in this format.

Mentors

The following individuals have been listed as possible mentors on the below projects, and/or offered to be mentors for student-proposed projects:

  • Stephen Frost
  • David Steele
  • Oleg Bartunov
  • Andrew Borodin
  • Kevin Grittner
  • Álvaro Herrera
  • Konstantin Knizhnik
  • Alexander Korotkov
  • Anastasia Lubennikova
  • Atri Sharma
  • Teodor Sigaev

Explicitly support predicate locks in index access methods besides btree

Project Description

Explicitly support predicate locks in index access methods besides btree

Predicate locking is provided using non-blocking "SIRead" locks to support the Serializable Snapshot Isolation used in PostgreSQL. Without explicit support within an access method (AM) any read from an index using that AM takes out an SIRead lock on the index relation, causing a read-write dependency to be created for any insert to the index, potentially (in combination with other circumstances) leading to an unnecessary serialization failure. Page level predicate locking exists in the btree AM, where it required the addition of 17 function calls to implement, but remains unimplemented in the gin, gist, spgist, brin, and hash index AMs.

Skills needed
  • The ability to read technical papers to understand the serializable implementation used in PostgreSQL
  • The ability to find the appropriate places in each of the index AMs to insert calls to existing functions such as PredicateLockPage(), PredicateLockPageSplit(), CheckForSerializableConflictOut(), and CheckForSerializableConflictIn(). This requires understanding of each AM, and close detail work.
  • The ability to use the existing testing tool to create regression tests involving each AM, similar to what exists for the btree AM.
Difficulty Level

The implementation style of serializable transactions in PostgreSQL is a fairly new technique (first published in 2008), and has what has been referred to as "a high cognitive load" -- it is hard to get your head around the details of its implementation. That said, this task is fairly narrow and only requires a general understanding of the overall technique; it is more focused on detail work and testing. Basically, the idea is to have each index scan "lock the gaps" in what it read, using page locks, such that any insert into the index which would have caused the index scan to return a different result sees a lock.

Potential Mentors
  • Kevin Grittner (one of the original authors of the current serializable support in PostgreSQL, including the btree AM support) can mentor. Kevin is a major contributor and committer.
  • Stephen Frost can mentor. Stephen is a major contributor and committer.
Expected Outcomes

The expected outcome is to implement page-level predicate locking in the remaining core index AMs, with appropriate comments and regression tests.

Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions

Project Description

Within predicate.c there are five static functions to provide an API to manage information about read-write dependencies (also referred to as rw-conflicts). This uses a double-linked list in shared memory, as it was assumed the lists would be short. Some benchmarks at high concurrency levels have seen the code in these functions take up to half the CPU time on the test machine, highlighting the O(N^2) nature of the current implementation. Rewrite these five functions to use a technique that scales better. It is likely that the function signatures will not need to change. Locking around these operations is external, which probably does not need to change -- but a review of that would be good.

Skills needed
  • The ability to analyze and test scaling issues
  • The ability to run careful benchmarks and present them to the community in a clear and understandable format
  • Knowledge of lock-free techniques to manage sets of data would be a big plus
Difficulty Level

The implementation style of serializable transactions in PostgreSQL is a fairly new technique (first published in 2008), and has what has been referred to as "a high cognitive load" -- it is hard to get your head around the details of its implementation. That said, this task is fairly narrow and only requires reworking the implementation of a specific internal API which is currently implemented in about 130 lines of code. This is a performance improvement, and the bar for committing those is fairly high in terms of testing and benchmarking, so the majority of the time on this task would be expected to be in constructing and running tests and benchmarks, including careful attention to "worst case" scenarios and what the impact is in those.

Potential Mentors
  • Kevin Grittner (one of the original authors of the current serializable support in PostgreSQL, including the current implementation of this API) can mentor. Kevin is a major contributor and committer.
  • Stephen Frost can mentor. Stephen is a major contributor and committer.
Expected Outcomes

The expected outcome is to produce a patch that shows clear improvement of worst-case high-concurrency loads using SERIALIZABLE transactions without causing unacceptable performance regression for other cases.

Foreign keys for Array elements

Project Description

Currently, foreign keys can only point from one column to the value of another column. But for certain database designs it is useful to have a table column reference individual elements within an array column in another table. This has been a desired feature for a long time.

Skills needed
  • ability to update an old patch to codebase that evolved underneath it
  • ability to identify and satisfy performance concerns raised in previous failed efforts to write this feature
Difficulty Level
Potential Mentors
  • Álvaro Herrera (PostgreSQL contributor and committer) can mentor.
  • Stephen Frost can mentor. Stephen is a major contributor and committer.
  • Alexander Korotkov can mentor. Alexander is major contributor.
Expected Outcomes

Expected outcome is a committable patch to support the feature, closing the holes in patches previously submitted.

Sorting algorithms benchmark and implementation

Project Description

Currently, the PostgreSQL uses Hoare’s Quicksort implementation based on work of Bentley and McIlroy [1] from 1993, while there exist some more novel algorithms [2], [3], and [4] which are actively used by highly optimized code like Java and .NET. Probably, use of optimized sorting algorithm could yield general system performance improvement. Also, use of non-comparison based algorithms deserves attention and benchmarking [5]. There is also a related presentation by Gregory Stark [6].

The project has four essential parts:

1. Implementation of benchmark for sorting. Making sure that operations using sorting are represented proportionally to some “average” use cases.

2. Selection of benchmark algorithms. Selection can be based, for example, on scientific papers or community opinions.

3. Benchmark implementation of selected algorithms. Analysis of results, picking of winner.

4. Industrial implementation for pg_qsort(), pg_qsort_args() and gen_qsort_tuple.pl. Implemented patch is submitted to commitfest, other patch is reviewed by the student.

Skills needed
  • ability to grasp state-of-the-art sorting algorithm features
  • ability to compose representative benchmark
  • ability to use microoptimizations on different platforms
Difficulty Level

Microoptimization implementations and benchmarking can be hard.

Potential Mentors
  • Andrey Borodin can mentor.
Expected Outcomes

Expected outcomes are the benchmark, results, picked algorithm and committable patch implementing it.

References

[1] Bentley, Jon L., and M. Douglas McIlroy. "Engineering a sort function." Software: Practice and Experience 23.11 (1993): 1249-1265.

[2] Musser, David R. "Introspective sorting and selection algorithms." Softw., Pract. Exper. 27.8 (1997): 983-993.

[3] Auger, Nicolas, Cyril Nicaud, and Carine Pivoteau. "Merge Strategies: from Merge Sort to TimSort." (2015).

[4] Beniwal, Sonal, and Deepti Grover. "Comparison of various sorting algorithms: A review." International Journal of Emerging Research in Management &Technology 2 (2013).

[5] Mcllroy, Peter M., Keith Bostic, and M. Douglas Mcllroy. "Engineering radix sort." Computing systems 6.1 (1993): 5-27.

[6] https://wiki.postgresql.org/images/5/59/Sorting_through_the_ages.pdf

GiST API advancement

Project Description

GiST API was designed at the beginning of 90th to reduce boilerplate code around data access methods over a balanced tree. Now, after 30 years, there are some ideas on improving this API.

Opclass developer must specify 4 core operations to make a type GiST-indexable:

1. Split: a function to split set of datatype instances into two parts.

2. Penalty calculation: a function to measure penalty for unification of two keys.

3. Collision check: a function which determines whether two keys may have overlap or are not intersecting.

4. Unification: a function to combine two keys into one so that combined key collides with both input keys.

Functions 2 and 3 can be improved.

For example, Revised R*-tree[1] algorithm of insertion cannot be expressed in terms of penalty-based algorithms. There were some attempts to bring parts of RR*-tree insertion, but they come down to ugly hacks [2]. Current GiST API, due to penalty-based insertion algorithm, does not allow to implement an important feature of RR*-tree: overlap optimization. As Norbert Beckman, author of RR*-tree, put it in the discussion: “Overlap optimization is one of the main elements, if not the main query performance tuning element of the RR*-tree. You would fall back to old R-Tree times if that would be left off.”

Collision check currently returns a binary result:

1. Query may be collides with subtree MBR

2. Query do not collides with subtree

This result may be augmented with a third state: subtree is totally within the query. In this case, GiST scan can scan down subtree without key checks.

The potential effect of these improvements must be benchmarked. Probably, implementation of these two will spawn more ideas on GiST performance improvements.

Another feature of RR*-tree is storing in the page additional MBR. Such additional MBR covers all the entries MBRs just after page creation but don't updated afterwards. Then, when this page is being split, such additional MBR could be used to select better split ration, and in turn in order to have better space utilization. Currently GiST doesn't allow to store some additional datum per page, but it would be nice to have.

Finally, GiST do not provide API for bulk loading. Alexander Korotkov during GSoC 2011 implemented buffered GiST build. This index construction is faster, but yields the index tree with virtually same querying performance. There are different algorithms aiming to provide better indexing tree due to some knowledge of data, e.g. [3]

Skills needed
  • ability to understand searches within GiST
  • ability to describe and set up datasets for performance features demostration
  • ability to produce committable patch
Difficulty Level

Easy

Potential Mentors
  • Andrey Borodin can mentor.
  • Alexander Korotkov can mentor. Alexander is major contributor.
  • Teodor Sigaev can mentor. Teodor is a major contributor and committer.
Expected Outcomes

Expected outcomes are the benchmark for API features and committable patch implementing them.

References

[1] Beckmann, Norbert, and Bernhard Seeger. "A revised r*-tree in comparison with related index structures." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.

[2] https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail.gmail.com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg@mail.gmail.com

[3] Achakeev, Daniar, Bernhard Seeger, and Peter Widmayer. "Sort-based query-adaptive loading of r-trees." Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012.

TOAST'ing in slices

Project Description

Currently, an entire individual value is compressed all together and then stored in data chunks in the TOAST table without an indication of which piece of the original data made it into what chunk of the TOAST table. What this ends up meaning is that to get a subset of the TOAST'd value, all of the chunks (or at least all of them up to the point which is being searched for, if that is known) have to be re-formed and the entire value de-compressed.

Project details

The best approach to slicing a given value will likely depend on the data type. A text field, for example, would most likely be naturally split upon character (not byte) boundaries, while an array data type would likely be split upon its element boundary. A JSON document might be split in another way (presumably based on how a JSONB is structured). A possible additional task would be to consider if it's possible to include in an index the information about which keys or values exist in a given TOAST chunk. For the JSONB case, in particular, one can imagine that it would eventually be possible to extract out just the JSON chunks which have the key or value being searched for.

Skills needed
  • C skills
  • Ability to understand complex on-disk data structures
  • Ability to understand inverted indexing
  • Ability to understand how traditional compression and block-based compression algorithms function
Difficulty Level
  • Moderate-level
Potential Mentors
  • Stephen Frost can mentor. Stephen is a major contributor and committer.
  • Alexander Korotkov can mentor. Alexander is a major contributor.
  • Teodor Sigaev can mentor. Teodor is a major contributor and committer.
Expected Outcomes

Expected outcomes are a patch which implements the ability to split up a single value into known-sized pre-compressed chunks which are then stored in TOAST, and functions able to be optimized based on that knowledge.



Implementing push-based query executor

Project Description

Currently, PostgreSQL uses traditional Volcano-style [1] query execution model. While it is a simple and flexible model, it behaves poorly on modern superscalar CPUs [2][3] due to lack of locality and frequent instruction mispredictions. It becomes a major issue for complex OLAP queries with CPU-heavy workloads. We propose to implement so-called push-based query executor model as described in [4][5], which improves code and data locality and cache usage itself; also push-based executor can serve as a platform for efficient JIT query compilation.

See [6] for more details.

Skills needed
  • The ability to understand and modify PostgresSQL executor code;
  • The ability to run careful in-memory benchmarks to demonstrate the result;
  • The ability to profile Postgres in order to find slow code;
  • Understanding modern processors features (pipelining, superscalar CPUs, branch prediction, etc) would be very helpful.
Difficulty Level

Moderate-level; however, microoptimizations might be hard. Probably it will also be hard to keep the whole architecture as clean as it is now.

Potential Mentors
  • Will be great to involve Andres Freund who already published a lot of patches pushing Postgres in this direction.
  • Alexander Korotkov can mentor. Alexander is major contributor.
  • Konstantin Knizhnik, author of IMCS (in-memory-columnar store) extension for Postgres and member of multimaster team.
Expected Outcomes

Patch with implemented push-based query executor; small patches, which will optimize the current query executor; benchmarks showing the performance of queries on plain Postgres and with this patch applied.

References

[1] Graefe G.. Volcano — an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng.,6(1): 120–135, 1994.

[2] MonetDB/X100: Hyper-Pipelining Query Execution http://cidrdb.org/cidr2005/papers/P19.pdf

[3] Vectorization vs. Compilation in Query Execution, https://pdfs.semanticscholar.org/dcee/b1e11d3b078b0157325872a581b51402ff66.pdf

[4] Efficiently Compiling Efficient Query Plans for Modern Hardware, http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

[5] Compiling Database Queries into Machine Code, http://sites.computer.org/debull/A14mar/p3.pdf

[6] https://www.postgresql.org/message-id/CAFRJ5K3sDGSn%3DpKgnsobYQX4CMTOU%3D0uJ-vt2kF3t1FsVnTCRQ%40mail.gmail.com


Table density estimation for approximate queries

Project Description

Running OLAP queries in Big Data is challenging, because it requires extraordinary IO and computational resources. In the same time exact query answers are not always required while analyzing extremely huge amounts of data. Many application can tolerate approximated queries. We propose to use probabilistic models (may be based on neural networks) for estimation of probabilities of data in a single table. That allows to model data distribution in the table and therefore to compute approximately aggregate queries for the table with range or equal clauses.

Skills needed

  • ability to perform probabilistic inference and to learn probabilistic models based on neural networks
  • ability to compose or set up a proper benchmark for the feature
  • ability to perform a research with a clear description of the mathematics which is used in the feature and the results of the experimental evaluation on the benchmark

Difficulty Level

The proposed project requires a mathematical research for the certain algorithm designing, which can be hard. Also it probably needs the benchmark creation.

Potential Mentors

  • Oleg Bartunov can mentor. Oleg is major contributor.
  • Alexander Korotkov can mentor. Alexander is major contributor.
  • Teodor Sigaev can mentor. Teodor is a major contributor and committer.

Expected outcomes

The expected outcomes are the research, experimental evaluation results and an extension which implements described algorithm for approximated queries.

Extract scanning strategy to the separate entity from GiST/GIN/SP-GiST opclasses

Project Description

PostgreSQL offers great indexing extendability. It supports index access methods which are templates for particular kinds of indexes. Opclasses specify application of access method to particular data type. There could be multiple opclasses for same access method and data type with different capabilities. Moreover, since PostgreSQL 9.6+ index access methods could be user-defined themselves.

However, there is still one shortcoming in this extendability system. Opclass specifies particular kind of index and its scanning strategies. Problem that scanning strategies of opclass can't be extended. For instance, extension could add more spatial operators to built-in GiST opclasses for geometrical datatypes; or fuzzy string matching operators to build-in SP-GiST opclass for text (radix tree). In order to overcome such shortcoming we need to make a scanning strategy a separate entity. Such entity should contain search/order by operators itself as well as corresponding supporting functions such as:

  • consistent, distance for GiST,
  • inner_consistent and leaf_consistent for SP-GiST,
  • extractQuery, consistent, triConsistent and comparePartial for GIN.
Skills needed
Difficulty Level

The project is difficult from the implementation side, because it requires consistent modification of many places in the source code. Design question doesn't seem to be hard tough.

Potential Mentors
  • Alexander Korotkov can mentor. Alexander is a major contributor.
  • Oleg Bartunov can mentor. Oleg is a major contributor.
  • Teodor Sigaev can mentor. Teodor is a major contributor and committer.
Expected Outcomes

The expected outcome is ability of extensions to add new extensions to existing opclasses.

Improve PostgreSQL Regression Test Coverage

Project Description

The current regression test coverage for PostgreSQL isn't great, to the point where some areas of the code are covered only at single-digit-percent levels.

Having good regression tests for such an important project as PostgreSQL is really key to minimizing the chance that any regressions are introduced. While this might seem like a small project, it isn't, particularly when it comes to PostgreSQL. PostgreSQL is over 1.3M lines of code and some of the code paths can be tricky to reach.

The current regression test coverage can be see here: https://coverage.postgresql.org

PostgreSQL's build system includes a "make coverage-html" to generate the report.

Please note that this project involves writing SQL code and Perl code, at a minimum, to implement the tests necessary to increase the code coverage of the PostgreSQL regression tests. This is not a small task as PostgreSQL is currently at only about 70% LOC coverage.

Skills needed
  • Perl, as many PostgreSQL regression tests are written using the Perl TAP system and new ones will likely need to be.
  • SQL, to craft tests that hit certain code paths
  • Ability to read C code enough to work through a way to get a particular line or lines of code tested.
Difficulty Level

For someone with the skills listed, even at a relatively beginner level, should make this a very straight-forward if tedious project.

Potential Mentors
  • Stephen Frost can mentor. Stephen is a major contributor and committer and has been working on improving regression tests in PG for quite a while.
Expected Outcomes
  • Significantly improved code coverage for the PostgreSQL regression test suite, ie: 71% -> 80%.