Postgres Professional roadmap
Postgres Professional (Postgres Pro) a company founded in 2015 by 3 PostgreSQL major contributors: Oleg Bartunov, Alexander Korotkov and Teodor Sigaev. Postgres Pro develops its own proprietary fork Postgres Pro Enterprise as well as contributes to PostgreSQL. We share our development roadmap in hope to coordinate resources better, avoid work duplication and contradictions of interests.
Multimaster cluster with sharding
Multimaster cluster with sharding providing both read and write scalability as well as high availability is obviously one of most wanted DBMS features. PostgreSQL community development experience shows that so huge features should be implemented in step-by-step manner. Professional company joins community efforts in this direction.
- Essential property of distributed database is support of distributed transactions including atomic commit and atomic visibility. Atomic commit might be implemented using 2PC, 3PC or other distributed commit protocol while atomic visibility requires even more sophisticated techniques. In order to solve both this problems we proposed extensible transaction manager API (xTM). It's pluggable API with two implementations: DTM (snapshot sharing) and tsDTM (Clock-SI [1]). xTM was presented at PGConf.EU 2015, patches are published in pgsql-hackers [2]. xTM not yet received deep enough review to understand how to push it into PostgreSQL core. We're looking for opportunities to attract more community members to review on this subject.
- We've developed multi-master cluster with high availability based on tsDTM and logical decoding. Will be presented at PGConf.EU 2016 [3]. Open source version is available at [4], version with commercial support is shipped as part of Postgres Pro Enterprise. We would like to eventually push multi-master to PostgreSQL core. However, according to step-by-step approach, we should deal with xTM first.
- We're working on sharding solution based on our multi-master cluster and FDW. We're going to provide beta-release during Autumn 2017.
Vectorized execution
Vectorized execution is widely used technique which allows to accelerate some of OLAP queries in many times. Frequently, unleashing full power of vectorized execution require also vectorazed storage (i.e. columnar storage). Konstantin Knizhnik from Postgres Professional developed vops extension implementing both vectorized storage (so called "parquet" layout) and vectorized execution. Vops is implmented using FDW interface and has somewhat limited functionality, however it's already useful for many practical applications.
We hope that one day functionality offered by vops become part of PostgreSQL core. This is why, we're participating in pgsql-hackers threads related to pluggable storage and executor improvements (including JIT).
Pluggable storages
We achieved a significant progress in PostgreSQL extendability: FDWs, custom access methods, generic WAL. And we definitely need pluggable storage API. Alexander Korotkov have presented concept of API at PGCon 2016. In short, this approach means development of special kind of FDW with extended capabilities (native way to store files in database cluster, vacuum, index definition etc.). However, in this case it's especially hard to write new storage engine, because you can't re-use existing index access methods.
Another approach is currently under development in pgsql-hackers mailing list [5]. Idea of this approach is to extract interface methods of heap into special API called storage access method API. Thus, users would be able to replace heap with some other data structure compatible with existing index AMs. However, any storage access method completely compatible with existing index AMs has to be very similar to our heap. So, in order to implement something really different from our current heap, we've to extend abilities of index AM API as well including: storage of arbitrary payload instead of TID, retail delete of index tuple, update for payload. Therefore, ability to reuse existing index AMs might cost us moving very long step-by-step way.
During PGConf.EU 2017, we're going to present in-memory OLTP storage based on FDW interface. The main goal of this talk is to show what completely different data storage PostgreSQL might have. We hope that eventually storage access method API would cover our in-memory engine.
Effective partitioning
We identify the following important features of effective partitioning:
- HASH, RANGE, LIST partition type support
- Fast partitions selection – select required partitions without linear-time scan of them (see blog post).
- Filter condition simplification – simplify filter conditions, which comes to particular partitions (see blog post).
- Hash-join pushdown – automatically switch from join of partitioned tables to joins of individual partitions when possible.
- Partition selection at execution time – in some cases it's impossible to select the required partitions during query planning, but it's possible to do this at execution time, when values of parameters are already known. These cases are: nested loops, parameters obtained from subqueries, generic plans for prepared statements.
- Subpartitioning
We are developing these features in pg_pathman extension with keeping in mind declarative partitioning syntax project by Amit Langote.
Adaptive query planning
Query plans management
Business-critical applications require continuous monitoring and adjustment of query plans. The problem hardens due to the fact that actual resource consumption by query execution could differ significantly with slight changes of the query plan cost. Thus, there is a permanent risk of bad plan selection and overall system performance degradation after collecting new statistics. DBAs want to protect themselves from this risk by freezing the query plans of the queries that are most critical to their particular workload. Postgres Professional provides the sr_plan extension which allows DBAs to freeze plans of selected queries. Currently it uses complete serialization/deserialization of query plans. The negative side of such an approach is in portability issues and sensitivity to schema changes. We are looking for a better solution based on platform-independent "logical" serialization of execution plans which could also be less sensitive to schema changes.
Machine learning
The query planner selects the "cheapest" query plan based on the estimated cost of the plan. But that cost is calculated using many rough assumptions. This is why the estimated cost can be significantly different from the real execution cost. One possibility is to improve the cost estimate mechanism itself by adding features like multivariate statistics. Another possibility is to use query execution feedback: see how estimated parameter values differ from actual parameter values. We can apply machine learning techniques to improve the cost estimates using this feedback, so that the DBMS would learn on its own mistakes. We've already done this in a simple case, and further work is planned in the following directions:
- Extend implemented model to cover more use cases,
- Provide the infrastructure necessary to make our machine learning an extension.
Execution-time planning
Currently, query planning strictly precedes query execution. Sometimes it appears to be a serious limitation. When one part of a plan is already executed it could be possible to significantly improve the rest of the plan on the basis of gathered statistics. We can see two cases when this approach could be applied:
- Online reordering of filter expressions. During sequential scans of large tables, it's important to do the cheapest and most selective checks first. However estimated selectivity and cost of filtering are inaccurate, and thus the order of applying filters based on estimates can be less than optimal. But filter expressions could be reordered online on the basis of statistics of their previous execution.
- Some queries could be divided into a sequence of steps where subsequent steps could be replanned on the basis of results of previous steps. For instance, suppose that step 1 is a scan of table A and that step 2 is a join of tables A and B. We could use the actual row count and data distribution from step 1 to choose the join algorithm for step 2.
Backup
Postgres Professional offers own backup solution called pg_probackup. It was originally forked from pg_arman, but it was significantly reworked since fork and now it can be considered as separate product. There is open source version of pg_probackup. Also commercially supported version of pg_probackup is shipped as part of Postgres Pro Enterprise.
Block-level incremental backup
Block-level incremental backup appears to be a good alternative to full backup and continuous WAL archiving, when the following two restrictions are satisfied.
- Number of changed blocks is low in comparison to total number of blocks.
- Volume of WAL archive is much larger than volume of changed blocks. In particular, this means that the same blocks were changed.
The main problem is how to get the map of blocks changed since the last backup. There are several options for that:
- Do full scan of all blocks of database cluster to be backed up.
- Extract changed blocks from WAL.
- Make PostgreSQL maintain map of changed blocks (either bit or LSN per page).
Barman implements #1, but has problems with high IO load on the database cluster during backup. We've implemented #2 and #3 in pg_probackup. #3 requires patch into core. Such patch is integrated into Postgres Pro Enterprise. There are discussions about adding this functionality to PostgreSQL core [6], [7].
Backup validation
Many users complain that once a file-level backup has been made, checking that the backup is valid is hard. There are a lot of things we can check in backed up database cluster. However, these checks should be fast enough to be suitable for running after each backup, but nevertheless protect from typical errors. pg_probackup backup validation: it validates that every datafile is backuped and checksums are valid.
Partial backup and partial restore
Current restriction of file-level backups is that only a full database cluster can be backed up. However, users want to backup only some of a cluster's databases as well as restore only some databases into an existing cluster. The second feature would require something invasive like rewriting all xids in the heaps of the databases to be restored. But it still seems to be much cheaper than pg_dump/pg_restore.
Connection pooling
Connection pooling was considered to be a purely external solution (like pgbouncer or pgpool) for a very long time. However, external poolers have restrictions. For instance, in session/transaction mode the benefit of poolers is limited: each active session/transaction still needs a separate backend. In statement mode, additional effort is needed to maintain all connections in the same state. Otherwise you may be faced with a situation where a prepared statement which you're using is not defined or GUCs have wrong values. The true solution of connection pooling problem would be built-in pooling facility which guarantees that all the user sessions would always work in the same way that they work in separate backends. However, this needs significant infrastructure rework, such as storing the whole session states in shared memory.
Page-level data compression
We currently use PGLZ compression of individual values. That's good, but sometimes it's possible to achieve significant compression only when compressing multiple values together. This is why we're considering page-level data compression. The important properties of compressed pages which allow to operate with them normally are:
- Decompression of a single page item shouldn't require decompression of the whole page;
- Deletion of a page item shouldn't increase the compressed page size;
- Update of an item header shouldn't increase the compressed page size.
In order to fit those requirements, we propose to compress each item of the page independently but with common dictionary leaving item headers uncompressed.
Wait events monitoring
It's essential for DBA to see what DBMS is spending time for. Wait event monitoring allows DBA to quickly identify bottlenecks in production systems. Basic infrastructure for wait events monitoring was committed to 9.6. This infrastructure allows to read current wait event for particular process. Community continuously works on incresing number of available wait events.
Statistics over wait events in individual time instants could be accumulated by sampling. Postgres Professional provides pg_wait_sampling for gathering this kind of statistics.
However, when sampling frequency is low, then statistics might appear to be not accurate enough. When sampling frequency is high, then statistics collection might appear to cause high overhead. This is why an alternative to sampling statistics is measuring time of individual wait events.
Another issue is exposing parameters of wait events. For example, using existing wait events monitoring infrastructure one can observe that there is high contention over some buffer. Determining which relation this buffer belongs to would require seeing wait event parameter: oid of relation. However, that would require either synchronization protocol for reporing wait events, either tolerating sometimes inconsistent wait events information.
Both of these features were implemented by Postgres Professional. During discussions in pgsql-hackers, there features were citicized for causing high overhead. The real life overhead depends on hardware, OS and workload. For instance, Yandex which is in the TOP-20 (by traffic) internet companies in the world runs both of these features in production heavily loaded database without noticing any overhead of them.
Now, it's important to realize if PostgreSQL community needs any of these features or rejects both of them.
Better temporary tables
Temporary tables outside of system catalog
Some applications create temporary tables at a very high rate. In such cases, writing all of these temporary tables to system catalog becomes an issue, as:
- System catalog suffers from bloat;
- OIDs are consumed at high rate with risk of wraparound.
We've proposed patch solving this problem by storage of temporary table meta-information in memory without touching persistent data structures [8]. Discussion showed that this approach has restrictions. For example, it doesn't support tracking dependencies from temporary tables to objects created in other sessions. However, we would like this feature to get in PostgreSQL core if even it would support only limited cases.
Temporary tables on standbys
Temporary tables on standbys is a very frequently asked for feature. Assuming we can store temporary table meta-information in-memory, we can implement single-transaction temporary tables on standbys. In order to implement multi-transaction temporary tables we need to solve the MVCC problem. Currently, standby receives information about xids from master and can't issue any xid itself. The solution could be an introduction of parallel axis of "local xids" on standby.
SQL/JSON standard
PostgreSQL was the first relational DBMS providing effective binary JSON datatype. However, time is passing and SQL/JSON became part of SQL standard. We've submitted patch implementing SQL/JSON for PostgreSQL [9]. We hope initial implementation of SQL/JSON to be committed to PostgreSQL 11 while there are a lot of improvements for further releases.
Generic WAL
Generic WAL was introduced in PostgreSQL 9.6 and it allow extensions (such as pluggable index access methods) to write WAL records whose contain per-byte difference between pages. However, when modification of page involves data shift (for instance, insert or remove of index tuple leads to the shift of the page tail) then generic WAL size becomes unreasonable high. Oleg Ivanov proposed improvement to generic WAL including variation of alignment algorithm. This improvement enables compact representation of data shift in general WAL records. Patch is published in pgsql-hackes and registered in commitfest.
Indexing
Covering indexes
Index-only scan is wonderful feature, and users may intend to add more columns into their indexes to get index-only scans used for more queries. However, when you need to force uniqueness on some column subset, then you might end up with two indexes (instead of one).
CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2); CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4);
Covering indexes solve this issue. You might add columns, whose are needed barely for fetching them from index during index-only scan, into INCLUDING clause.
CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4);
Patch is published in pgsql-hackes and registered in commitfest.
HOT support for projection indexes
Jsonb datatype introduced in PostgreSQL 9.4 has continuously raising popularity. However, HOT which is vital optimization for many kinds of workloads, isn't used at full capacity for jsonb. The thing is, HOT works only when no indexed column has been changed. However, expression indexes on jsonb are frequently used. And when jsonb document is changed then indexed expressions might be not affected, but HOT is automatically not applied. Proposed patch adds ability to recheck if update of particular column has really modified any indexed expression. Then HOT technique can be effectively applied to jsonb. Patch is published in pgsql-hackes and registered in commitfest.
Miscellaneous
Incremental sort
Incremental sort provides cheaper sort by list of columns when your dataset is already sorted by prefix of that list of columns. For example, you've an index on c1 and you need to sort dataset by c1, c2. Then incremental sort can help you, because it wouldn't sort the whole dataset, but sort individual groups whose have same value of c1 instead. Incremental sort is extremely helpful when you've LIMIT clause.
Patch is published in pgsql-hackes and registered in commitfest.
Jsonb transform for pl/perl and pl/python
Jsonb transform allows you to see corresponding values as native data structures in stored procedures rather than strings. During further enhancements, we're going to deserialize jsonb into native representation lazily (i.e. fetching single value from jsonb wouldn't require full deserialization).
Patches [10], [11] are published in pgsql-hackes and registered in commitfest.
64-bit xids
Despite freeze map whose helps a lot, users with high transactional throughput still suffers from wraparound issues. Real 64-bit xids are fundamental solution of this problem. We're proposing path implementing 64-bit xids without significantly bloating heap with 64-bit values in following manned. xmin and xmax are remained 32-bit values, and 64-bit base is stored in the page header. On overflow of 32-bit values, single-page freeze is performed on the fly.
Patch is published in pgsql-hackes and registered in commitfest.