ColumnOrientedSTorage

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
= Project to implement Column Oriented STorage in PostgreSQL
+
----
 +
Project to implement Column Oriented STorage in PostgreSQL
  
 
As of 9.4, PostgreSQL is a row-oriented database server. The project and requirements described here have the objective of implementing performance improvements using column storage techniques.
 
As of 9.4, PostgreSQL is a row-oriented database server. The project and requirements described here have the objective of implementing performance improvements using column storage techniques.
 +
----
  
= Overview of Column Oriented STorage (COST)
+
= Overview of Column Oriented STorage (COST) =
  
 
Row oriented storage means that columns in the table are generally stored in a single heap, with each column stored on a single tuple. Various database systems have implemented physical storage so that individual columns are stored separately, with a value for each tuple in the table stored.
 
Row oriented storage means that columns in the table are generally stored in a single heap, with each column stored on a single tuple. Various database systems have implemented physical storage so that individual columns are stored separately, with a value for each tuple in the table stored.
Line 11: Line 13:
 
This style of storage has become associated with Data Warehousing and Business Intelligence systems where in conjunction with other related features it can enhance performance.
 
This style of storage has become associated with Data Warehousing and Business Intelligence systems where in conjunction with other related features it can enhance performance.
  
= Advantages of COST
+
= Advantages of COST =
  
 
Column Oriented Storage would give PostgreSQL
 
Column Oriented Storage would give PostgreSQL
Line 19: Line 21:
 
* Ability for PostgreSQL to exceed the current 1600 column limit in one tuple
 
* Ability for PostgreSQL to exceed the current 1600 column limit in one tuple
  
= Requirements
+
= Requirements =
== What would COST look like?
+
== What would COST look like? ==
  
 
Just as we set storage parameters for each column for TOAST, we would set them for COST also...
 
Just as we set storage parameters for each column for TOAST, we would set them for COST also...
Line 36: Line 38:
 
which would give us the ability to store columns in secondary relations in groups, like Cassandra.
 
which would give us the ability to store columns in secondary relations in groups, like Cassandra.
  
= Deep Insights
+
= Deeper Insights =
  
== What's wrong with TOAST?
+
== What's wrong with TOAST? ==
  
 
The Oversized Attribute Storage Technique (TOAST) does a great job at what it does - storing large values for columns. There are some improvements that can be made, such as improving compression performance or making it more configurable. However, those don't change the basic nature of TOAST: it is designed for storage, not as a means to improve performance of queries.
 
The Oversized Attribute Storage Technique (TOAST) does a great job at what it does - storing large values for columns. There are some improvements that can be made, such as improving compression performance or making it more configurable. However, those don't change the basic nature of TOAST: it is designed for storage, not as a means to improve performance of queries.
Line 46: Line 48:
 
* TOAST chops up values into chunks, though without much pattern - the column value must be reconstructed for use in search queries
 
* TOAST chops up values into chunks, though without much pattern - the column value must be reconstructed for use in search queries
  
== Dangers of Simple Solutions
+
== Dangers of Simple Solutions ==
  
 
[Row vs. Column Stores http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf]
 
[Row vs. Column Stores http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf]

Revision as of 14:20, 21 April 2014


Project to implement Column Oriented STorage in PostgreSQL

As of 9.4, PostgreSQL is a row-oriented database server. The project and requirements described here have the objective of implementing performance improvements using column storage techniques.


Contents

Overview of Column Oriented STorage (COST)

Row oriented storage means that columns in the table are generally stored in a single heap, with each column stored on a single tuple. Various database systems have implemented physical storage so that individual columns are stored separately, with a value for each tuple in the table stored.

[Wikipedia http://en.wikipedia.org/wiki/Column-oriented_DBMS]

This style of storage has become associated with Data Warehousing and Business Intelligence systems where in conjunction with other related features it can enhance performance.

Advantages of COST

Column Oriented Storage would give PostgreSQL

  • Faster business intelligence queries
  • Ability to improve compression of individual columns
  • Ability for PostgreSQL to exceed the current 1600 column limit in one tuple

Requirements

What would COST look like?

Just as we set storage parameters for each column for TOAST, we would set them for COST also...

ALTER TABLE foo

 ALTER COLUMN foocol
 SET ORIENTATION column;

We might also want to store columns in groups together, like so

ALTER TABLE foo

 ALTER COLUMN foocol1 SET COLUMN GROUP 2,
 ALTER COLUMN foocol2 SET COLUMN GROUP 2;

which would give us the ability to store columns in secondary relations in groups, like Cassandra.

Deeper Insights

What's wrong with TOAST?

The Oversized Attribute Storage Technique (TOAST) does a great job at what it does - storing large values for columns. There are some improvements that can be made, such as improving compression performance or making it more configurable. However, those don't change the basic nature of TOAST: it is designed for storage, not as a means to improve performance of queries.

Specifically, TOAST can be slow for queries because

  • TOAST stores all column values separately - multiple row values are not stored adjacent to each other
  • TOAST chops up values into chunks, though without much pattern - the column value must be reconstructed for use in search queries

Dangers of Simple Solutions

[Row vs. Column Stores http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf]

"The elevator pitch behind (the column store) performance difference is straightforward: column-stores are more I/O efficient for read-only queries since they only have to read from disk (or from memory) those attributes accessed by a query. This simplistic view leads to the assumption that one can obtain the performance benefits of a column-store using a row-store: either by vertically partitioning the schema, or by indexing every column so that columns can be accessed independently. In this paper, we demonstrate that this assumption is false. We compare the performance of a commercial row-store under a variety of different configurations with a column-store and show that the row-store performance is significantly slower on a recently proposed data warehouse benchmark. We then analyze the performance difference and show that there are some important differences between the two systems at the query executor level (in addition to the obvious differences at the storage layer level). Using the column-store, we then tease apart these differences, demonstrating the impact on performance of a variety of column-oriented query execution techniques, including vectorized query processing, compression, and a new join algorithm we introduce in this paper. We conclude that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of a column-oriented approach."

That's all a bit negative, which is a shame.

Basically, we end up by saying that if we store columns individually, compress them all together and then access them without decompression, then we'll go faster.

Personal tools