PostgreSQL 9.6 timeline

PG-Strom extension

PL/CUDA (done)
- It allows to write up raw CUDA code as SQL function. PG-Strom moves bunch of data set to GPU then kicks user defined kernel function.
CPU+GPU Hybrid Parallel (in-progress)
- It makes CustomScan nodes under the Gather node. Each worker uses GPU for finer grained parallel execution.
SSD-to-GPU Direct DMA (in-progress)
- It enables direct data transfer from SSD blocks to GPU RAM, then CPU receives only valid rows, joined records, or partially aggregated values.

Sparse Matrix Datatype - It tries to represent vector/matrix data as a basis of analytic/statistical algorithm on database system. It intends (1) large amount of data set more than 1GB, and (2) reasonable memory consumption if matrix is actually sparse.
Analytic Functions - as a use case of sparse matrix, we plan to submit several analytic/statistical functions as a contrib module. Unsupervized learning algorithms (kmeans, Ward-clustering) are candidates.

Parallel aware Append - In case when multiple partitioned tables are scanned or simple UNION ALL conjunction, we have good chance to run these portions in parallel under the Gather node. It is a feature 0racle does not have but people often wants.

Improvement - Continuous improvement to follow up planner / executor enhancement. The goal of CustomScan is allowing extensions to implement arbitrary logic as if it is built-in feature.
Limit support - pass_down_bound() tells underlying nodes how much rows are required to produce if it is Sort, MergeAppend or Result. If a sorting logic is implemented on top of CustomScan, it cannot see the hint information.

Right now, I don't have enough time working on the topic below, even motivated.

Risk factor in nrows estimation
- Execution cost of nested-loop growth rapidly if number of outer rows is much larger than the estimation. Right now, optimizer relies on the estimated nrows, however, its exactness fully depends on complicity of the underlying sub-plan. If outer plan is simple table scan, its estimated nrows are almost correct. So, its risk factor shall be small. On the other hands, if outer plan is multi-tables join with complicated qualifiers, estimation is less reliable.
- For OLAP workloads, we have to pay attention to variation of the estimated nrows. It may needs to swith more robust algorithm (HashJoin, MergeJoin) when estimated nrows is less reliable.
Group by before JOIN
- Under some condition, we can rewrite query to place GROUP BY before join. It will reduce number of rows to be joinned, and allows to utilize massive parallel processors more efficient.
- Earlier study: http://www.comp.nus.edu.sg/~cs5226/papers/groupby-join-icde94.pdf
Wise optimization on some corner cases
- In some TPC-DS workloads, parametalized sub-query makes massive performance degradation, even though we can rewrite most of the problematic queries in mechanical way. And, 0racle pulls-up these queries correctly.