Future of storage
The future of Storage
For some use cases, our current storage design is sub-optimal. We believe it can be improved by adding an abstraction layer between the "heap" operations and storage.
Currently, the storage is assumed to be row-organized pages: heapam.h is written in assumption that each tuple has exactly one tuple header and a single data area, a HeapTuple, and at the same time contains code for logical operations such as tuple deletion, updating, and tuple locking. Similarly, the executor code represents tuples in the TupleTableSlot abstraction layer which currently assumes HeapTuple underneath.
During 2015, 2ndQuadrant worked on a project to implement columnar storage in Postgres. What follows is a plan derived from the lessons learned during that implementation:
Project Outline
There are six sub-projects:
- Vertical partitioning
- Executor batching
- Executor vectorization
- Columnar indexes
- Pluggable storage for tables
- Columnar storage plug-in
The highest performance gains reported by other "columnar" database systems are found when using a vectorized execution engine plugged to columar storage: it is possible to do columnar storage without vectorization, but the gains are modest [citation needed] (because CPUs still need to work on one element at a time); and it's possible to do vectorization without columnar storage, and the gains are also modest [citation needed] (because for vectorization to work it's necessary to convert the row-based data into column-based, a slow operation).
Vertical Partitioning
The ability to split a table's storage area in several pieces, putting a subset of columns into each storage area. There are several points to this:
- skip reading any storage area containing columns not used in the query
- different columns can use different storage strategies (row-based or column-based; different implementations for column-based, for experimentation; compression or no; etc)
- reading tuples on a tuple with more than one storage area requires joins between them.
Challenges:
- Joining between table and storage area needs separate handling
- Join elimination is key
- Logical/physical tuple representation needs to change (in particular, pg_attribute with a single attrelid value no longer represents the tuple descriptor for a table)
Executor Batching
This refers to the ability of the executor to process multiple tuples at a time in a single node execution, rather than one at a time as currently. Requires changes to TupleTableSlot with large fallout; also, changes to individual node execution routines.
This is intended for 9.7.
Executor Vectorization
This refers to the ability of the executor to have specialized functions for using SIMD at the CPU level. This builds on top of executor batching. Aggregates need to provide specialized code to do this.
Columnar indexes
This project is about a new index access method that uses columnar storage. One obvious output is insight into what columnar storage methods work best.
Advantages:
- Indexes are much more compact than standard indexes, so scans are faster
Pluggable storage for tables
This project is about creating an access-method-like interface for table storage. Currently, all storage goes through heapam.c; this would enable different implementations to be written.
The heapam.c interface assumes that the user has a Relation and a TID (tuple identifier). Currently the TID is simply the (block,item number) physical position of the tuple within the relation. This project might need to change tuple identifiers to suit different storage implementations.
At the same time, the current heapam.c implementation returns a HeapTuple struct containing a tuple, but different implementations might have completely different ways to represent tuples in storage. Because we will want to utilize the different representation of tuples without having to "heapify" them, some more changes might be necessary so that the tuple can be passed to the executor code. How this works is unclear; this needs more research. Executor batching could rely on this to act on several tuples at once.
Cautionary tale from Tom Lane
We need to avoid a rewrite of the DDL code. Currently, all the utility code assumes that HeapTuples can be passed all over the place. This would fail completely with different storage formats. We need some way to avoid this project from becoming tangled in endless refactoring of utility code.
The solution seems simple: we do not need to attack this problem in system catalogs right away: if we forbid the use of different storage formats for system catalogs we are protected from having to edit enormous amounts of utility code.
In the future, somebody could refactor code touching an individual catalog to allow pluggable (non-heapam) storage to be used for that catalog. This could be done piecemeal, lifting the restriction on one specific catalog as the code to handle that catalog is reworked.
Columnar storage plug-in
This would be a mechanism that plugs as a storage module for tables and acts (per the subproject above) as columnar-oriented storage for them.
Analysis of existing use-cases
While there is a lot of discussion about what PostgreSQL pluggable storage API should look like, it's useful to analyze experience of other DBMS'es in this field.
MySQL/MariaDB
MySQL and MariaDB provide pluggable table engines. See documentation for MySQL and MariaDB.
Table engine name(s) | Description | Resume: do we need something like this in PostgreSQL? |
---|---|---|
InnoDB | Provides index-organized tables where previous versions of rows are displaced into undo tablespace. Secondary indexes are indirect with separate versioning. | Yes. Index-organized tables, undo tablespace and indirect indexes are useful features. It seems, that at least some of them could be implemented using pluggable storage API. |
MyISAM | Table engine without transactions and recovery. | No. We have unlogged tables if needed. |
CSV, Federated, Cassandra, CONNECT, SphinxSE | These table engines are accessing either local or remote foreign data sources. | No, because we have FDWs. |
MGR_MyIsam, Merge | These engines allow defining tables which are union of primary tables set. Such union is updatable: updates are pushed down to primary tables. | No, because we have updatable view, partitioning, table inheritance etc. |
Archive | Storage for archive data: append-only, compressed storage | Yes, archive storage would be useful. |
Blackhole | Table engine which silently "eats" all the inserted data. It have useful usecases in proxy which reduces network traffic in partial replication by filtering replication traffic on master side. | No, our logical decoding already provides such capability. Moreover, there is Blackhole FDW with same functionality. |
Mroonga | Mroonga implements non-transactional tables with FTS, geospatial and other indexes. | No, because in PostgreSQL new index types should be implemented as index access methods. |
OQGraph | Table engine which allows querying and indexing of graph. OQGraph let users query some kind of view while primary data is stored in another table. | No, because this table engine is not intended to store primary data. If we would need similar solution, we should implement it using index access methods, views, materialized views and so on. |
Sequence | This table engine provides functional analogue of generate_series() function. | No, generate_series() is completely sufficient for that purpose. |
Aria, XtraDB | Enhanced versions of MyISAM and InnoDB from MariaDB. | No, we should evade fragmentation if possible. |
TokuDB, RocksDB | These engines provide write-optimized tables. TokuDB implements fractal trees while RocksDB implements LSM trees. | Yes, write-optimized tables are nice to have. |
ScaleDB, Spider | Provide clusters built in the table engines. | No, pluggable storage doesn't seem to be appropriate mechanism for clustering. |
Memory | Memory table engine implements non-persistent tables which resides completely in memory. Besides shortcomings of MySQL implementation, in-memory engine could give us following benefits: faster in-memory operations bypass buffer manager; optimized work with disk for persistent in-memory tables due to full data snapshots and row-level WAL. | Yes, it would be nice to have in-memory tables if they would perform faster. Somebody could object that PostgreSQL is on-disk database which shouldn't utilize in-memory storage. But users would be interested in such storage engine if it would give serious performance advantages. |
MongoDB
MongoDB provides pluggable table engines. See documentation.