GSoC 2018

From PostgreSQL wiki
Jump to navigationJump to search

This page is for collecting ideas for future Summer of Code projects.

Regarding Project Ideas

Project ideas are to be added here by community members.

NOTE: Google wants each idea to be supported by a few sentences and then a list of skills/reqs, difficulty level, potential mentors, expected outcomes, etc. Please add sections in this format.

Mentors (2018)

The following individuals have been listed as possible mentors on the below projects, and/or offered to be mentors for student-proposed projects:

  • Stephen Frost
  • David Steele
  • Aleksander Alekseev
  • Andrey Borodin
  • Alexander Korotkov
  • Simone Gotti
  • Anastasia Lubennikova
  • Atri Sharma
  • Teodor Sigaev
  • Oleg Bartunov
  • Mark Wong

High availability / failover based on logical replication for Stolon (2018)

Project Description

Currently there are plenty HA solutions for PostgreSQL based on physical (streaming) replication - patroni, Stolon, RepMgr, and others. Unfortunately there is no solution that uses logical replication introduced in PostgreSQL 10. The main advantage of logical replication over physical replication in this context is an ability to upgrade PostgreSQL without a downtime. Also logical replication allows to use temporary tables on replicas. The idea of this project is to introduce logical replication support for Stolon.

It worth noticing that this tool has a potential to evolve in a full-feature open source NewSQL layer on top of PostgreSQL that will support sharding, rebalancing (which is also impossible with physical replication), Percolator-like distributed transactions, convenient web-panel, etc. These are subjects for further GSoC projects though.

Skills needed
  • Knowledge of Go programming language
  • Optional - knowledge of Python programming language (for testing)
  • Experience of using Consul, etcd or similar service discovery application
  • Understanding of how distributed systems work - what is netsplit, leader election, Jepsen testing, etc
Difficulty Level

Above average.

Potential Mentors
  • Simone Gotti, Open Source software engineer and architect, Stolon developer
  • Aleksander Alekseev, PostgreSQL contributor, author of zson and pg_protobuf extensions, MSU lecturer.
Expected Outcomes

Augment Stolon to abstract away data replication technique, so that both streaming replication and logical replication could be used. It should also be noted that currently logical replication has some limitations, most importantly DDL replication is not supported. This particular project doesn't imply solving these issues.

Thrift datatype support (2018)

Project Description

One of advantages of document-oriented databases like MongoDB or Couchbase over RDBMSs is an ability to change the data scheme easily, fast and often. The traditional approach in RDBMS world involves doing an expensive ALTER TABLE operation, slow upgrade of an existing data, and stuff like this. This approach is often slow and inconvenient for application developers.

To solve this issue PostgreSQL provides JSON and JSONB datatypes. Unfortunately JSONB has a disadvantage of storing all documents keys, which is a lot of redundant data.

One possibility to to reduce JSONB redundancy is to use zson extension. It compresses JSONB documents using shared dictionary of common strings that appear in all or most documents. This approach has its limitations though. Particularly, since data schema evolves, the dictionary has to be updated from time to time. Also zson can affect the build-in mechanism of PostgreSQL of compressing data using PGLZ algorithm since this mechanism uses some heuristics to recognize data that compresses well. Thus sometimes zson can reduce the overall performance.

There is another extension --- pg_protobuf. Basically it provides Protobuf support for PostgreSQL. It seems to solve all the issues described above and doesn't have any disadvantages of zson extension.

The idea of this project is to create a similar extension that would provide Thrift support. Some users may prefer Thrift to Protobuf or just use it by historical reasons. This project is a bit more complicated than pg_protobuf since unlike Protobuf Thrift support various encoding protocols.

Skills needed
  • Basic skills of programming in C
Difficulty Level

Relatively simple.

Potential Mentors
  • Aleksander Alekseev, PostgreSQL contributor, author of zson and pg_protobuf extensions, MSU lecturer.
  • Anastasia Lubennikova, PostgreSQL contributor, MSU lecturer, attended GSoC 2017 as a mentor and GSoC 2014 as a student
Expected Outcomes

Extension pg_thrift, similar to pg_protobuf but for Thrift.

Related Discussions

Sorting algorithms benchmark and implementation (2018)

Project Description

Currently, the PostgreSQL uses Hoare’s Quicksort implementation based on work of Bentley and McIlroy [1] from 1993, while there exist some more novel algorithms [2], [3], and [4] which are actively used by highly optimized code like Java and .NET. Probably, use of optimized sorting algorithm could yield general system performance improvement. Also, use of non-comparison based algorithms deserves attention and benchmarking [5]. There is also a related presentation by Gregory Stark [6].

The project has four essential parts:

1. Implementation of benchmark for sorting. Making sure that operations using sorting are represented proportionally to some “average” use cases.

2. Selection of benchmark algorithms. Selection can be based, for example, on scientific papers or community opinions.

3. Benchmark implementation of selected algorithms. Analysis of results, picking of the winner.

4. Industrial implementation for pg_qsort(), pg_qsort_args() and gen_qsort_tuple.pl. Implemented patch is submitted to commitfest, other patch is reviewed by the student.

Also, currently PostgreSQL uses HTAB structure for an associative array. This is the implementation of Larson's linear hashing. Chances are this can be improved too by more modern dynamic hashing algorithms.

Skills needed
  • ability to grasp state-of-the-art sorting algorithm features
  • ability to compose representative benchmark
  • ability to use microoptimizations on different platforms
Difficulty Level

Microoptimization implementations and benchmarking can be hard.

Potential Mentors
  • Andrey Borodin can mentor.
  • Atri Sharma can mentor.
Expected Outcomes

Expected outcomes are the benchmark, results, picked algorithm and committable patch implementing it.

References

[1] Bentley, Jon L., and M. Douglas McIlroy. "Engineering a sort function." Software: Practice and Experience 23.11 (1993): 1249-1265.

[2] Musser, David R. "Introspective sorting and selection algorithms." Softw., Pract. Exper. 27.8 (1997): 983-993.

[3] Auger, Nicolas, Cyril Nicaud, and Carine Pivoteau. "Merge Strategies: from Merge Sort to TimSort." (2015).

[4] Beniwal, Sonal, and Deepti Grover. "Comparison of various sorting algorithms: A review." International Journal of Emerging Research in Management &Technology 2 (2013).

[5] Mcllroy, Peter M., Keith Bostic, and M. Douglas Mcllroy. "Engineering radix sort." Computing systems 6.1 (1993): 5-27.

[6] https://wiki.postgresql.org/images/5/59/Sorting_through_the_ages.pdf

WAL-G delta backups with WAL scanning (2018)

Project Description

WAL-G is the simple backup tool for PostgreSQL. Essentially, it supports backup and WAL archiving to cloud storages. WAL-G repository Currently, we have delta backups, based on LSN mechanics. Delta backups contain only changes paged, but to find these pages we have to read everything from disk.

At the same time, WAL-G is used to send WAL files to the cloud, these WALs contain all information about which blocks had changed. WAL-G could accumulate map of changed pages between backups and use it to scan cluster efficiently.

You will need to: 1. Code WAL scanning 2. Design data structure to store and append information on what was changed in the database 3. Integrate and test that everything works fine

Skills needed
  • Go language
  • Understanding of database recovery
  • Good data structure design skills
Difficulty Level

Moderate.

Potential Mentors
  • Andrey Borodin can mentor.
Expected Outcomes

Support of WAL scanning and efficient delta-backups in WAL-G.

GiST API advancement (2018)

Project Description

GiST API was designed at the beginning of 90th to reduce boilerplate code around data access methods over a balanced tree. Now, after 30 years, there are some ideas on improving this API.

Opclass developer must specify 4 core operations to make a type GiST-indexable:

1. Split: a function to split set of datatype instances into two parts.

2. Penalty calculation: a function to measure penalty for unification of two keys.

3. Collision check: a function which determines whether two keys may have overlap or are not intersecting.

4. Unification: a function to combine two keys into one so that combined key collides with both input keys.

Functions 2 and 3 can be improved.

For example, Revised R*-tree[1] algorithm of insertion cannot be expressed in terms of penalty-based algorithms. There were some attempts to bring parts of RR*-tree insertion, but they come down to ugly hacks [2]. Current GiST API, due to penalty-based insertion algorithm, does not allow to implement an important feature of RR*-tree: overlap optimization. As Norbert Beckman, author of RR*-tree, put it in the discussion: “Overlap optimization is one of the main elements, if not the main query performance tuning element of the RR*-tree. You would fall back to old R-Tree times if that would be left off.”

Collision check currently returns a binary result:

1. Query may be collides with subtree MBR

2. Query do not collides with subtree

This result may be augmented with a third state: subtree is totally within the query. In this case, GiST scan can scan down subtree without key checks.

The potential effect of these improvements must be benchmarked. Probably, implementation of these two will spawn more ideas on GiST performance improvements.

Another feature of RR*-tree is storing in the page additional MBR. Such additional MBR covers all the entries MBRs just after page creation but don't updated afterwards. Then, when this page is being split, such additional MBR could be used to select better split ration, and in turn in order to have better space utilization. Currently GiST doesn't allow to store some additional datum per page, but it would be nice to have.

Finally, GiST do not provide API for bulk loading. Alexander Korotkov during GSoC 2011 implemented buffered GiST build. This index construction is faster, but yields the index tree with virtually same querying performance. There are different algorithms aiming to provide better indexing tree due to some knowledge of data, e.g. [3]

Skills needed
  • ability to understand searches within GiST
  • ability to describe and set up datasets for performance features demostration
  • ability to produce committable patch
Difficulty Level

Easy

Potential Mentors
  • Andrey Borodin can mentor.
  • Alexander Korotkov can mentor. Alexander is major contributor.
  • Teodor Sigaev can mentor. Teodor is a major contributor and committer.
Expected Outcomes

Expected outcomes are the benchmark for API features and committable patch implementing them.

References

[1] Beckmann, Norbert, and Bernhard Seeger. "A revised r*-tree in comparison with related index structures." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.

[2] https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail.gmail.com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg@mail.gmail.com

[3] Achakeev, Daniar, Bernhard Seeger, and Peter Widmayer. "Sort-based query-adaptive loading of r-trees." Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012.

TOAST'ing in slices (2018)

Project Description

Currently, an entire individual value is compressed all together and then stored in data chunks in the TOAST table without an indication of which piece of the original data made it into what chunk of the TOAST table. What this ends up meaning is that to get a subset of the TOAST'd value, all of the chunks (or at least all of them up to the point which is being searched for, if that is known) have to be re-formed and the entire value de-compressed.

Project details

The best approach to slicing a given value will likely depend on the data type. A text field, for example, would most likely be naturally split upon character (not byte) boundaries, while an array data type would likely be split upon its element boundary. A JSON document might be split in another way (presumably based on how a JSONB is structured). A possible additional task would be to consider if it's possible to include in an index the information about which keys or values exist in a given TOAST chunk. For the JSONB case, in particular, one can imagine that it would eventually be possible to extract out just the JSON chunks which have the key or value being searched for.

Skills needed
  • C skills
  • Ability to understand complex on-disk data structures
  • Ability to understand inverted indexing
  • Ability to understand how traditional compression and block-based compression algorithms function
Difficulty Level
  • Moderate-level
Potential Mentors
  • Stephen Frost can mentor. Stephen is a major contributor and committer.
  • Alexander Korotkov can mentor. Alexander is a major contributor.
  • Teodor Sigaev can mentor. Teodor is a major contributor and committer.
Expected Outcomes

Expected outcomes are a patch which implements the ability to split up a single value into known-sized pre-compressed chunks which are then stored in TOAST, and functions able to be optimized based on that knowledge.



Table density estimation for approximate queries (2018)

Project Description

Running OLAP queries in Big Data is challenging, because it requires extraordinary IO and computational resources. In the same time exact query answers are not always required while analyzing extremely huge amounts of data. Many application can tolerate approximated queries. We propose to use probabilistic models (may be based on neural networks) for estimation of probabilities of data in a single table. That allows to model data distribution in the table and therefore to compute approximately aggregate queries for the table with range or equal clauses.

Skills needed

  • ability to perform probabilistic inference and to learn probabilistic models based on neural networks
  • ability to compose or set up a proper benchmark for the feature
  • ability to perform a research with a clear description of the mathematics which is used in the feature and the results of the experimental evaluation on the benchmark

Difficulty Level

The proposed project requires a mathematical research for the certain algorithm designing, which can be hard. Also it probably needs the benchmark creation.

Potential Mentors

  • Oleg Bartunov can mentor. Oleg is major contributor.
  • Alexander Korotkov can mentor. Alexander is major contributor.
  • Teodor Sigaev can mentor. Teodor is a major contributor and committer.

Expected outcomes

The expected outcomes are the research, experimental evaluation results and an extension which implements described algorithm for approximated queries.

Extract scanning strategy to the separate entity from GiST/GIN/SP-GiST opclasses (2018)

Project Description

PostgreSQL offers great indexing extendability. It supports index access methods which are templates for particular kinds of indexes. Opclasses specify application of access method to particular data type. There could be multiple opclasses for same access method and data type with different capabilities. Moreover, since PostgreSQL 9.6+ index access methods could be user-defined themselves.

However, there is still one shortcoming in this extendability system. Opclass specifies particular kind of index and its scanning strategies. Problem that scanning strategies of opclass can't be extended. For instance, extension could add more spatial operators to built-in GiST opclasses for geometrical datatypes; or fuzzy string matching operators to build-in SP-GiST opclass for text (radix tree). In order to overcome such shortcoming we need to make a scanning strategy a separate entity. Such entity should contain search/order by operators itself as well as corresponding supporting functions such as:

  • consistent, distance for GiST,
  • inner_consistent and leaf_consistent for SP-GiST,
  • extractQuery, consistent, triConsistent and comparePartial for GIN.
Skills needed
Difficulty Level

The project is difficult from the implementation side, because it requires consistent modification of many places in the source code. Design question doesn't seem to be hard tough.

Potential Mentors
  • Alexander Korotkov can mentor. Alexander is a major contributor.
  • Oleg Bartunov can mentor. Oleg is a major contributor.
  • Teodor Sigaev can mentor. Teodor is a major contributor and committer.
Expected Outcomes

The expected outcome is ability of extensions to add new extensions to existing opclasses.

Improve PostgreSQL Regression Test Coverage (2018)

Project Description

The current regression test coverage for PostgreSQL isn't great, to the point where some areas of the code are covered only at single-digit-percent levels.

Having good regression tests for such an important project as PostgreSQL is really key to minimizing the chance that any regressions are introduced. While this might seem like a small project, it isn't, particularly when it comes to PostgreSQL. PostgreSQL is over 1.3M lines of code and some of the code paths can be tricky to reach.

The current regression test coverage can be see here: https://coverage.postgresql.org

PostgreSQL's build system includes a "make coverage-html" to generate the report.

Please note that this project involves writing SQL code and Perl code, at a minimum, to implement the tests necessary to increase the code coverage of the PostgreSQL regression tests. This is not a small task as PostgreSQL is currently at only about 73% LOC coverage.

Skills needed
  • Perl, as many PostgreSQL regression tests are written using the Perl TAP system and new ones will likely need to be.
  • SQL, to craft tests that hit certain code paths
  • Ability to read C code enough to work through a way to get a particular line or lines of code tested.
Difficulty Level

For someone with the skills listed, even at a relatively beginner level, should make this a very straight-forward if tedious project.

Potential Mentors
  • Stephen Frost can mentor. Stephen is a major contributor and committer and has been working on improving regression tests in PG for quite a while.
Expected Outcomes
  • Significantly improved code coverage for the PostgreSQL regression test suite, ie: 73% -> 80%.

Enhancing amcheck for all AMs (2018)

Project Description

Amcheck is a PostgreSQL extension to verify the integrity of index against invariants that should always hold in the valid index. This tool is designed to diagnose corruption and help developers during the implementation of new features in access methods. Currently, amcheck supports only B-tree. Also, work on GiST is in progress https://github.com/petergeoghegan/amcheck/pull/11 But amcheck could be used for many other indexes: GIN, SP-GiST, BRIN, RUM. For each AM it is necessary to deduce invariants to check, implement this checks and test against various index states. Also, it would be useful to unite all AM check methods in a single entry point for checking index. The interface of check functions can also be enhanced in favor of more detailed corruption information. It would be useful to model various corruptions, including both those which can be found by data_checksums and those which cannot.

Skills needed
  • Good C code skills
  • Understanging of access methods ideas and details
Difficulty Level

The project requires to grasp algorithms behind very sophisticated data structures, along with concurrency and recovery over them.

Potential Mentors
  • Andrey Borodin can mentor. Peter Geoghegan can consult team on controversial cases. Peter Geoghegan is the original author of amcheck.
  • Stephen Frost can mentor.
Expected Outcomes
  • Support for all major PostgreSQL AMs in amcheck.

Develop Performance Farm Database and Website (2018)

Project Description

The PostgreSQL Performance Farm project is a community project to collect performance data from tests as code changes are made to PostgreSQL. To support this effort, a database needs to be created for storing results, and a Web site developed to review results. This project will focus on developing the Web site on top of the database.

The database will be using PostgreSQL in the back-end. Test results will come in the form of JSON and flat files. The Web application will be developed using the Django Web framework.

For reference, the code that will be supplying test results is https://git.postgresql.org/gitweb/?p=pgperffarm.git;a=summary.

As an example, the PostgreSQL Build Farm site [1] is a central repository for the results of testing source code changes for PostgreSQL as they occur, on a wide variety of platforms.

Skills needed
  • Python programming
  • SQL programming
Difficulty Level

This project requires familiarity with Python programming and basic database experience.

Potential Mentors
  • Mark Wong can mentor.
Expected Outcomes
  • A functional Web site where clients can upload test results, and users can search and review uploaded results.
References

[1] https://buildfarm.postgresql.org/