GSoC 2019

From PostgreSQL wiki
Jump to navigationJump to search

This page is for collecting ideas for future Summer of Code projects.

Regarding Project Ideas

Project ideas are to be added here by community members.

NOTE: Google wants each idea to be supported by a few sentences and then a list of skills/reqs, difficulty level, potential mentors, expected outcomes, etc. Please add sections in this format.

Mentors (2019)

The following individuals have been listed as possible mentors on the below projects, and/or offered to be mentors for student-proposed projects:

  • Stephen Frost
  • David Steele
  • Andrey Borodin
  • Vladimir Leskov
  • Dave Page
  • Andreas Scherbaum
  • Mark Wong

WAL-G safety features (2019)

Project Description

WAL-G is the simple backup tool for PostgreSQL. Essentially, it supports backup and WAL archiving to cloud storages. WAL-G repository. Currently, we want to improve backup storing. To be more precise, we would like to see implemented following features:

  • Retention policy improvement -- currently, we can only delete everything before specified backup or retain fix number of backups. We want to improve backup management, E.g. give user ability to mark important backup permanent to avoid its deletion.
  • WAL's history consistency check -- we want to ensure, that we can provide PITR (point in time recovery) for the user.
  • Page checksum verification -- it's one of the key features needed for backup verification.
Skills needed
  • Go language
  • Ability to read technical papers and documentation to understand PostgreSQL inner structure.
Difficulty level

Moderate

Potential Mentors
  • Andrey Borodin can mentor. Andrey is WAL-G developer and PostgreSQL contributor.
  • Vladimir Leskov can mentor. Vladimir is WAL-G developer.
Expected Outcomes

All before mentioned features properly implemented, covered with tests and Pull-Requested to WAL-G.

GiST API advancement (2019)

Project Description

GiST API was designed at the beginning of 90th to reduce boilerplate code around data access methods over a balanced tree. Now, after 30 years, there are some ideas on improving this API.

Opclass developer must specify 4 core operations to make a type GiST-indexable:

1. Split: a function to split set of datatype instances into two parts.

2. Penalty calculation: a function to measure penalty for unification of two keys.

3. Collision check: a function which determines whether two keys may have overlap or are not intersecting.

4. Unification: a function to combine two keys into one so that combined key collides with both input keys.

Functions 2 and 3 can be improved.

For example, Revised R*-tree[1] algorithm of insertion cannot be expressed in terms of penalty-based algorithms. There were some attempts to bring parts of RR*-tree insertion, but they come down to ugly hacks [2]. Current GiST API, due to penalty-based insertion algorithm, does not allow to implement an important feature of RR*-tree: overlap optimization. As Norbert Beckman, author of RR*-tree, put it in the discussion: “Overlap optimization is one of the main elements, if not the main query performance tuning element of the RR*-tree. You would fall back to old R-Tree times if that would be left off.”

Collision check currently returns a binary result:

1. Query may be collides with subtree MBR

2. Query do not collides with subtree

This result may be augmented with a third state: subtree is totally within the query. In this case, GiST scan can scan down subtree without key checks.

The potential effect of these improvements must be benchmarked. Probably, implementation of these two will spawn more ideas on GiST performance improvements.

Another feature of RR*-tree is storing in the page additional MBR. Such additional MBR covers all the entries MBRs just after page creation but don't updated afterwards. Then, when this page is being split, such additional MBR could be used to select better split ration, and in turn in order to have better space utilization. Currently GiST doesn't allow to store some additional datum per page, but it would be nice to have.

Finally, GiST do not provide API for bulk loading. Alexander Korotkov during GSoC 2011 implemented buffered GiST build. This index construction is faster, but yields the index tree with virtually same querying performance. There are different algorithms aiming to provide better indexing tree due to some knowledge of data, e.g. [3]

Skills needed
  • ability to understand searches within GiST
  • ability to describe and set up datasets for performance features demostration
  • ability to produce committable patch
Difficulty Level

Easy

Potential Mentors
  • Andrey Borodin can mentor.
  • Alexander Korotkov can mentor. Alexander is major contributor and committer.
Expected Outcomes

Expected outcomes are the benchmark for API features and committable patch implementing them.

References

[1] Beckmann, Norbert, and Bernhard Seeger. "A revised r*-tree in comparison with related index structures." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.

[2] https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail.gmail.com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg@mail.gmail.com

[3] Achakeev, Daniar, Bernhard Seeger, and Peter Widmayer. https://www.mathematik.uni-marburg.de/~achakeye/publications/sortBasedLoading.pdf Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012.

TOAST'ing in slices (2019)

Project Description

Currently, an entire individual value is compressed all together and then stored in data chunks in the TOAST table without an indication of which piece of the original data made it into what chunk of the TOAST table. What this ends up meaning is that to get a subset of the TOAST'd value, all of the chunks (or at least all of them up to the point which is being searched for, if that is known) have to be re-formed and the entire value de-compressed.

Project details

The best approach to slicing a given value will likely depend on the data type. A text field, for example, would most likely be naturally split upon character (not byte) boundaries, while an array data type would likely be split upon its element boundary. A JSON document might be split in another way (presumably based on how a JSONB is structured). A possible additional task would be to consider if it's possible to include in an index the information about which keys or values exist in a given TOAST chunk. For the JSONB case, in particular, one can imagine that it would eventually be possible to extract out just the JSON chunks which have the key or value being searched for.

Skills needed
  • C skills
  • Ability to understand complex on-disk data structures
  • Ability to understand inverted indexing
  • Ability to understand how traditional compression and block-based compression algorithms function
Difficulty Level
  • Moderate-level
Potential Mentors
  • Stephen Frost can mentor. Stephen is a major contributor and committer.
Expected Outcomes

Expected outcomes are a patch which implements the ability to split up a single value into known-sized pre-compressed chunks which are then stored in TOAST, and functions able to be optimized based on that knowledge.


de-TOAST'ing using an iterator (2019)

Project Description

Currently, an entire individual value is compressed all together and then stored in data chunks in the TOAST table without an indication of which piece of the original data made it into what chunk of the TOAST table. What this ends up meaning is that to get a subset of the TOAST'd value, all of the chunks (or at least all of them up to the point which is being searched for, if that is known) have to be re-formed and the entire value de-compressed.

Project details

This project would aim to provide the ability to de-TOAST a fully TOAST'd and compressed field using an iterator, and then update the appropriate parts of the code to use the iterator where possible instead of de-TOAST'ing and de-compressing the entire value. Examples where this can be helpful include using substring() from the beginning of the value, or doing a pattern or substring match. This is distinct and simpler than the above project which envisions a larger change to the TOAST system.

Skills needed
  • C skills
  • Ability to understand how traditional compression and block-based compression algorithms function
Difficulty Level
  • Easy-level
Potential Mentors
  • Stephen Frost can mentor. Stephen is a major contributor and committer.
Expected Outcomes

Expected outcomes are a patch which implements de-TOAST'ing a value using an iterator and then updates to the other parts of the code to use that iterator, resulting in a significant performance improvements for certain queries.


Improve PostgreSQL Regression Test Coverage (2019)

Project Description

The current regression test coverage for PostgreSQL isn't great, to the point where some areas of the code are covered only at single-digit-percent levels.

Having good regression tests for such an important project as PostgreSQL is really key to minimizing the chance that any regressions are introduced. While this might seem like a small project, it isn't, particularly when it comes to PostgreSQL. PostgreSQL is over 1.3M lines of code and some of the code paths can be tricky to reach.

The current regression test coverage can be see here: https://coverage.postgresql.org

PostgreSQL's build system includes a "make coverage-html" to generate the report.

Please note that this project involves writing SQL code and Perl code, at a minimum, to implement the tests necessary to increase the code coverage of the PostgreSQL regression tests. This is not a small task as PostgreSQL is currently at only about 73% LOC coverage.

Skills needed
  • Perl, as many PostgreSQL regression tests are written using the Perl TAP system and new ones will likely need to be.
  • SQL, to craft tests that hit certain code paths
  • Ability to read C code enough to work through a way to get a particular line or lines of code tested.
Difficulty Level

For someone with the skills listed, even at a relatively beginner level, should make this a very straight-forward if tedious project.

Potential Mentors
  • Stephen Frost can mentor. Stephen is a major contributor and committer and has been working on improving regression tests in PG for quite a while.
  • Andreas Scherbaum can mentor. Andreas mentored GSoC and Code-In in the past, and in general helps boosting PostgreSQL in Europe.
Expected Outcomes
  • Significantly improved code coverage for the PostgreSQL regression test suite, ie: 73% -> 80%.

Enhancing amcheck for all AMs (2019)

Project Description

Amcheck is a PostgreSQL extension to verify the integrity of index against invariants that should always hold in the valid index. This tool is designed to diagnose corruption and help developers during the implementation of new features in access methods. Currently, amcheck supports only B-tree. Also, work on GiST is in progress https://github.com/petergeoghegan/amcheck/pull/11 But amcheck could be used for many other indexes: GIN, SP-GiST, BRIN, RUM. For each AM it is necessary to deduce invariants to check, implement this checks and test against various index states. Also, it would be useful to unite all AM check methods in a single entry point for checking index. The interface of check functions can also be enhanced in favor of more detailed corruption information. It would be useful to model various corruptions, including both those which can be found by data_checksums and those which cannot.

Skills needed
  • Good C code skills
  • Understanging of access methods ideas and details
Difficulty Level

The project requires to grasp algorithms behind very sophisticated data structures, along with concurrency and recovery over them.

Potential Mentors
  • Andrey Borodin can mentor. Peter Geoghegan can consult team on controversial cases. Peter Geoghegan is the original author of amcheck.
Expected Outcomes
  • Support for all major PostgreSQL AMs in amcheck.

Develop Performance Farm Database and Website (2019)

Project Description

The PostgreSQL Performance Farm project is a community project to collect performance data from tests as code changes are made to PostgreSQL. To support this effort, a database needs to be created for storing results, and a Web site developed to review results. This project will focus on developing the Web site on top of the database.

The database will be using PostgreSQL in the back-end. Test results will come in the form of JSON and flat files. The Web application will be developed using the Django Web framework.

For reference, the code that will be supplying test results is https://git.postgresql.org/gitweb/?p=pgperffarm.git;a=summary.

As an example, the PostgreSQL Build Farm site [1] is a central repository for the results of testing source code changes for PostgreSQL as they occur, on a wide variety of platforms.

Skills needed
  • Python programming
  • SQL programming
Difficulty Level

This project requires familiarity with Python programming and basic database experience.

Potential Mentors
  • Mark Wong can mentor.
  • Andreas Scherbaum can mentor.
  • Pavan Agrawal can mentor.
Expected Outcomes
  • A functional Web site where clients can upload test results, and users can search and review uploaded results.
References

[1] https://buildfarm.postgresql.org/

Read/write transaction-level routing in Odyssey (2019)

Project Description

Odyssey is an advanced multi-threaded PostgreSQL connection pooler and request router. Odyssey allows defining client routing rules by specifying client database, client user name and storage connection. In some HA scenarios, it is necessary to define different routing for read-only and read-write transactions. Odyssey can parse transaction opening statements and assign different routes depending on transaction types.

It would be cool to support many consistency models for reads on standby: 1. Redirecting read query on standby only if standby is lagging no more than X seconds 2. Redirecting read query on standby unless there was recent write (writer should see his writes) 3. Redirecting some read queries on syn or async standby

Skills needed
  • C programming
Difficulty Level

Moderate.

Potential Mentors
  • Andrey Borodin can mentor.
Expected Outcomes
  • Pull request to Odyssey with described feature
References

[1] https://github.com/yandex/odyssey

pgAdmin 4 Query Tool Graphing (2019)

Project Description

pgAdmin has a GIS Viewer built into it's Query Tool as the result of a 2018 GSoC Project, that allows the user to render PostGIS shapes and points etc. and view them on a plain canvas or over one of a number of map sources.

This project would be to add a similar capability to allow the user to plot query results in graph formats, for example, selecting column(s) to view on the X/Y axis and then rendering the output as a line chart, bar chart, or even a pie chart.

Skills needed
  • Python
  • Javascript/JQuery
  • HTML/CSS
Difficulty Level

Moderate.

Potential Mentors
  • Dave Page
Expected Outcomes
  • Feature committed to the pgAdmin tree for the next release, with documentation and test cases.
References

[1] https://www.pgadmin.org/

[2] https://git.postgresql.org/gitweb/?p=pgadmin4.git;a=summary

pgAdmin 4 Query Tool bytea support (2019)

Project Description

pgAdmin's Query Tool currently doesn't render bytea data in its Query Tool, or allow options to update such columns. This project would have 2 core aims:

  • Update the Query Tool to enable bytea data to be uploaded when editing a row.
  • Add viewers to the Query Tool output grid such that when it can be detected that the data in a field is of a known type (image, sound or video file), a button can be displayed in the grid that when clicked will display or play the media.
Skills needed
  • Python
  • Javascript/JQuery
  • HTML/CSS
Difficulty Level

Moderate.

Potential Mentors
  • Dave Page
Expected Outcomes
  • Feature committed to the pgAdmin tree for the next release, with documentation and test cases.
References

[1] https://www.pgadmin.org/

[2] https://git.postgresql.org/gitweb/?p=pgadmin4.git;a=summary

pgAdmin 4 Query Tool automatic mode detection (2019)

Project Description

pgAdmin's Query Tool currently works in 2 different modes:

  • View Data mode allows modifications to the SQL Query only through the UI, but allows editing of data when used with a table with an appropriate primary key.
  • Query mode allows arbitrary queries to be written and executed, but the user cannot edit the results.

The aim of this project is to automatically detect if the query entered will produce in an updatable resultset and enable/disable editing of the results and other parts of the UI as appropriate (for example, disabling the sort/filter UI options if the query string is not one that can be programmatically changed).

Skills needed
  • Python
  • Javascript/JQuery
  • HTML/CSS
Difficulty Level

High.

Potential Mentors
  • Dave Page
Expected Outcomes
  • Feature committed to the pgAdmin tree for the next release, with documentation and test cases.
References

[1] https://www.pgadmin.org/

[2] https://git.postgresql.org/gitweb/?p=pgadmin4.git;a=summary

pgBackRest port to Windows (2019)

Project Description

pgBackRest is being reimplemented into C currently and it is well on its way. There is already enough of the code in C that it's time to start working on porting pgBackRest to Windows. This project would aim to have all of the C code ported to Windows, which would also include some changes in how pgBackRest operates. The necessary changes for running pgBackRest on Windows have been contemplated during the architecture and design of pgBackRest and therefore should be reasonably straight-forward to implement.

Skills needed
  • C
  • Working in a Windows development environment
Difficulty Level

Medium.

Potential Mentors
  • Stephen Frost
  • David Steele
Expected Outcomes
  • pgBackRest able to be built and run on Windows.
References

[1] https://www.pgbackrest.org/

[2] https://github.com/pgbackrest

Extract scanning strategy to the separate entity from GiST/GIN/SP-GiST opclasses (2019)

Project Description

PostgreSQL offers great indexing extensibility. It supports index access methods which are templates for particular kinds of indexes. Opclasses specify application of access method to particular data type. There could be multiple opclasses for same access method and data type with different capabilities. Moreover, since PostgreSQL 9.6+ index access methods could be user-defined themselves.

However, there are still shortcomings in this extendability system. Opclass specifies particular kind of index and its scanning strategies. One problem is that scanning strategies of opclass can't be extended. For instance, extension could add more spatial operators to built-in GiST opclasses for geometrical datatypes; or fuzzy string matching operators to build-in SP-GiST opclass for text (radix tree). In order to overcome such shortcoming we need to make a scanning strategy a separate entity. Such entity should contain search/order by operators itself as well as corresponding supporting functions such as:

  • consistent, distance for GiST,
  • inner_consistent and leaf_consistent for SP-GiST,
  • extractQuery, consistent, triConsistent and comparePartial for GIN.
Skills needed
Difficulty Level

The project is difficult from the implementation side, because it requires consistent modification of many places in the source code. Design question doesn't seem to be hard tough.

Potential Mentors
  • Alexander Korotkov can mentor. Alexander is a major contributor and committer.
  • Oleg Bartunov can mentor. Oleg is a major contributor.
  • Teodor Sigaev can mentor. Teodor is a major contributor and committer.
Expected Outcomes

The expected outcome is ability of extensions to add new extensions to existing opclasses.