From PostgreSQL wiki

Revision as of 14:28, 21 April 2014 by Simon (Talk | contribs)

Jump to: navigation, search

Project to implement Column Oriented STorage in PostgreSQL

As of 9.4, PostgreSQL is a row-oriented database server. The project and requirements described here have the objective of implementing performance improvements using column storage techniques.


Overview of Column Oriented STorage (COST)

Row oriented storage means that columns in the table are generally stored in a single heap, with each column stored on a single tuple.

Various database systems have implemented physical storage so that individual columns are stored separately, with a values of each tuple stored in an array or similar structure, which is known as Column Oriented DBMS: [Wikipedia]

Column oriented storage has become associated with Data Warehousing and Business Intelligence systems where in conjunction with other related features it can enhance performance. While true, this is a simplistic viewpoint and we need to understand where the speed comes from and how that could work for PostgreSQL.

Advantages of COST

Column Oriented Storage would give PostgreSQL

  • Faster business intelligence queries
  • Ability to improve compression of individual columns
  • Ability for PostgreSQL to exceed the current 1600 column limit in one tuple


What would COST look like?

Just as we set storage parameters for each column for TOAST, we would set them for COST also...



We might also want to store columns in groups together, like so



which would give us the ability to store columns in secondary relations in groups, like Cassandra.

From a User Interface perspective, that's it. We want these performance gains to be mostly transparent to the user.

But we also want to continue to use Row Oriented Storage as the default choice, probably forever.

Deeper Insights

What's wrong with TOAST?

The Oversized Attribute Storage Technique (TOAST) does a great job at what it does - storing large values for columns. There are some improvements that can be made, such as improving compression performance or making it more configurable. However, those don't change the basic nature of TOAST: it is designed for storage, not as a means to improve performance of queries.

Specifically, TOAST can be slow for queries because

  • TOAST stores all column values separately - multiple row values are not stored adjacent to each other
  • TOAST chops up values into chunks, though without much pattern - the column value must be reconstructed for use in search queries

Dangers of Simple Solutions

Row vs. Column Stores

"The elevator pitch behind (the column store) performance difference is straightforward: column-stores are more I/O efficient for read-only queries since they only have to read from disk (or from memory) those attributes accessed by a query. This simplistic view leads to the assumption that one can obtain the performance benefits of a column-store using a row-store: either by vertically partitioning the schema, or by indexing every column so that columns can be accessed independently. In this paper, we demonstrate that this assumption is false. We compare the performance of a commercial row-store under a variety of different configurations with a column-store and show that the row-store performance is significantly slower on a recently proposed data warehouse benchmark. We then analyze the performance difference and show that there are some important differences between the two systems at the query executor level (in addition to the obvious differences at the storage layer level). Using the column-store, we then tease apart these differences, demonstrating the impact on performance of a variety of column-oriented query execution techniques, including vectorized query processing, compression, and a new join algorithm we introduce in this paper. We conclude that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of a column-oriented approach."

That's all a bit negative, which is a shame.

Basically, we end up by saying that if we store columns individually, compress them all together and then access them without decompression using bulk access methods, then we'll go faster.

Anyway, that's going some way forwards into design details, so lets keep it light for now.

Personal tools