Cross Columns Stats

This page is an overview of an attempt to implement cross-column statistics, i.e. an effort to enhance the optimizer so that it can perform better in case the columns of a table are not independent. You'll find examples of fail cases here, ideas on how the solution might work, links to papers on this topic, etc.

Most of the information listed comes from the discussions on a mailing list (especially this thread which has a follow-up), but it's quite difficult to follow the topic once it reaches some size. So from this point of view, this page is just an easily readable summary.

Possible road-map

Simply said, there is no clear road-map, rather a general direction that may suddenly change. The current intent is to build a set of proof of concepts, attempting to build various possibilities, and then maybe a contrib module for estimating COUNT(*) results (which is very useful for pagination). And then, if everything works fine, this might be eventually used within the core.

Attribute Value Independence assumption

The attribute value independence assumption is the root cause of the problems - it's equivalent to the statistical independence of columns. According to this assumption, the joint distribution is equal to multiplication of the probability distributions of individual columns.

When the columns are not statistically independent, this usually leads to severe underestimate of the number of rows. And in reality, columns are almost never independent. The question is how strong the dependence actually is.

Let's see a real-world example.

Example

A typical fail case is a table with ZIP codes and city names. Those two columns are highly dependent, as a ZIP code generally determines a city.

Assume there are 100 cities, each of them has 100 ZIP codes. That means there are 100 cities and 10.000 ZIP codes. When estimating a number of rows satisfying a condition

WHERE zip_code = '12345' AND city = 'cityname'

the optimizer currently does this

estimates the selectivity of "zip_code = '12345'" (assuming uniform distribution, the selectivity is 1/10.000 = 0.01%)
estimates the selectivity of "city = 'cityname'" (assuming uniform distribution, the selectivity is 1/100 = 1%)
multiplies the selectivities and gets selectivity 0.0001%

The problem is the actual selectivity is 0.01%, as the ZIP code implies the city. So the obtained estimate will be 100x underestimated. And by combining more conditions, this might lead to even worse estimates. This may be serious issue, as the planner might decide to use an index scan instead of a sequential scan in such cases. And with a complicated query, the effects are sometimes very difficult to predict.

All this is due to "attribute value independence" which is basically the same as statistical independence in probability theory.

If you need a real-world sample data to play with, try 1999 U.S. Postal Service ZIP Codes (or directly ZIP). It's not exactly fresh, but it's good enough. It's in DBF format so you'll need to use something to extract the data (e.g. PgDBF).

Assumptions, assumptions, assumptions ...

When collecting stats and using them to estimate the number of rows, there is always some assumption that says "if this holds, then the estimate should not be very far from the actual value." The problem is real-world data sets almost never satisfy these assumptions perfectly.

One example of such assumption - attribute value independence assumption (AVI) - was already mentioned above, along with a fail case. But there are several other assumptions, some of them are used at a different place, some of them are a possible replacement for the AVI assumption. Just a very short list

uniform distribution within a histogram bin - the optimizer assumes that within a histogram bin, the values are uniformly distributed
uniform correlation - instead of independence, it's assumed that P(A=a|B=b)=c (constant), so that P(A=a,B=b)=P(A=a|B=b)*P(B=b)=c*P(B=b)
conditional independence - this is a bit more complicated, see for example the paper Selectivity Estimation using Probabilistic Models [4]

The idea is to replace a very strong assumption (e.g. AVI) with a much weaker one (e.g. the uniform correlation or conditional independence). Breaking a weaker assumption usually results in a much smaller error of the estimate.

Instances of the problem

As I've explained in this post, I think there are four very different instances of this problem. Maybe there is a 'unified theory' that handles all of them, but I'm not aware of it. So what instances are there?

First, there are two quite different types of variables (columns) - numeric and discrete. And by discrete I don't mean integers, but values that serve as a label - names (city, person) are an excellent example. Discrete columns may be actually encoded as numbers, as for example ZIP codes. At first sight these types may seem equal, but that really is not the case - you can do a lot of things with numeric values that either can't be done at all with discrete values or the result does not make much sense. For example subtracting ZIP codes, computing average of city names, and so on.

Second, there are two types of conditions - equality and inequality (range) conditions. These two types of conditions usually need different types of stats, so let's handle them separately.

So in the end, there are four distinct instances of the problem:

	equality conditions	inequality conditions
discrete values	A	D
numeric values	C	B

A) Discrete values and equality conditions

One of the papers (A Bayesian Approach to Estimating The Selectivity of Conjuctive Predicates) describes a quite interesting solution to this problem - I've already posted a description on how to apply it to the ZIP code / city name problem - see this post and the following discussion.

This basically replaces the AVI assumption with a uniform correlation assumption, which is much weaker. There are some nice features (easy extension to more than two columns, a small amount of data to keep), and some unsolved issues related to estimating number of distinct values (for individual columns and for the combination that is used in a query). The current (sampling-based) estimator has known problems, especially in case of the combination - an effort to implement better estimator is described here.

B) Numeric values and inequality conditions

Most of the papers dealing with this problem are based on discretization and multi-dimensional histograms to approximate the joint distribution. So I guess my original proposal (see the first PoC) was not a complete nonsense, as it was based on this approach. Once we have a working histogram-based solution, we can work on precision and efficiency (how much space is needed to store the histogram, how long does it take to compute an estimate, etc.). According to the papers, there are two ways to do that (if you know about other solutions, let us know).

First, there are several papers offering various enhanced types of multi-dimensional histograms (GenHist, SGRID, VMHIST, ...). Sure, every paper states that the histogram they described is actually the best one (most efficient, most precise, etc.) which is a bit suspicious. Anyway there are promising ways to build better multi-dimensional histograms.

Second, the paper "Selectivity estimation using probabilistic models") describes a completely different solution based on Bayesian Networks. That seems to be a really interesting alternative (and it actually it addresses join selectivity estimation too).

So although the initial implementation is probably going to be inefficient and imprecise, I'm quite confident we can improve that significantly. Either by an advanced histogram or using Bayesian Networks.

C) Numeric values and equality conditions

I'm not sure how to handle this case. But the queries against numeric queries are range queries in most cases I guess, so maybe that's not that big deal.

D) Discrete values and inequality conditions

Basically, this can be handled just like numeric values after discretization, i.e. using a histogram. But just like in the previous case this is not a very frequent case. E.g. how often do you run something like this

SELECT * FROM foo WHERE (zip_code BETWEEN '12899' AND '23890') AND (city_name BETWEEN 'Boston' AND 'Chicago')

Probably not very often.

Combination of discrete / numeric columns

I'm not sure how to deal with this right now. Obviously it's possible to build multi-dimensional histogram, and estimate as many queries as possible.

Proof of Concepts

Until now, I've built two proof of concepts - the first one addresses the case B (numeric data with range conditions), the second one addresses case A (discrete data with equality conditions).

Numeric data & range queries

This proof of concept was described in the initial proposal. It's based on multi-dimensional (a quite primitive one).

Discrete data & equality queries

This proof of concept is thoroughly described in this post. It's based on the paper A Bayesian Approach to Estimating the Selectivity of Conjunctive Predicates [2].

Possible features

This section discusses various features the solution might possibly have, what are the pros/cons of those features etc.

Optional vs. automatic

It's not very likely the stats will be collected automatically - see the next section.

Collecting data for all combinations of columns

I really don't think we should collect statistics for every combination of columns of a table. Collecting cross-column stats may consume a lot of resources (time, CPU, memory, I/O, ...), so collecting it for every combination is not very efficient. The plan is to allow a DBA to enable cross-column stats (using an ALTER TABLE, a PL/pgSQL procedure etc.) only when really needed, i.e. when these conditions are met:

the columns are dependent
the columns are used in a query together (frequently)
the current estimate is significantly imprecise, resulting in a choice of an inefficient query plan

It's not very likely the stats will be collected automatically for every combination of columns.

Collecting stats for multi-column indexes

The only case where collecting the cross-column stats may be collected automatically is when there is a multi-column index. This usually indicates that the columns are frequently used together (which is one of the conditions), and there's a slight chance that the index may be used to build the histogram much more efficiently. But there are some counterarguments

a multi-column index does not mean the columns are dependent (producing invalid estimates)
many developers often replace multi-column index with a collection of simple indexes (and leave the database to handle it using a Bitmap Index Scan)

Testing for independence

There are independence tests for contingency tables (e.g. Pearson's Chi-squared test), so that it's easy to find out whether the columns are independent. If such test shows that the columns are independent, we can just throw away the cross-column stats and use the simple estimation based on attribute value independence.

Identifying columns frequently used together

I've noticed demands to collect data about columns used frequently together in a single WHERE condition. This might be an interesting feature (especially when the estimate significantly differs from the actual value), but it's not a part of this effort (currently).

History & Discussions

This is just a list of thread in archives discussing cross-column statistics. This by no means a comprehensive listing (e.g. there's a lot of items in pgsql-performance archive discussing issues caused by incorrect estimates):

Hypothetical suggestions for planner, indexing improvement - pgsql-performance, May 2003
Cross column statistics - pgsql-hackers, February 2006
An Idea for planner hints - pgsql-hackers, August 2006
Stats for multi-column indexes - pgsql-hackers, March 2007
Multi-Dimensional Histograms - pgsql-hackers, June 2009
Cross-column statistics revisited - pgsql-hackers, October 2010
cross column correlation revisted - pgsql-hackers, July 2010
proposal: cross-column stats - pgsql-hackers, December 2010

Interesting papers

There are many interesting papers on selectivity estimation out there. This is a list of papers I've read recently, along with a short description. If you know about another interesting paper on this topic (selectivity estimation, especially with dependent columns), just put it here. The section is split into two - the 'recommended articles' list the really interesting articles (not obsolete, giving general into into the fields or describing a very interesting approach). The 'additional articles' is used for articles that may be interesting in the future, are obsolete, or describe a solution that is not directly applicable to PostgreSQL (e.g. involving random sampling).

Recommended papers

The New Jersey Data Reduction Report citeseerx PDF
- published in 1997, authors: Daniel Barbara, William Dumouchel, Christos Faloutsos, Peter J. Haas, Joseph M. Hellerstein, Yannis Ioannidis, H. V. Jagadish, Theodore Johnson, Raymond Ng, Viswanath Poosala, Kenneth A. Ross, Kenneth C. Sevcik
- a very thorough introduction into 'data reduction' i.e. representing data with a model
- covers about all the possibilities mentioned here (histograms, SVD, sampling) and some more (log linear models)
- does not cover probabilistic models
- anyway this is probably the best starting point if you need an intro into this topic - it's thorough, well written, covers most of the knowledge, etc.
A Bayesian Approach to Estimating the Selectivity of Conjunctive Predicates PDF
- published in 2009, authors: Max Heimel, Volker Markl, Keshava Murthy
- is based assuming "uniform correlation" which is a much weaker assumption compared to attribute value independence
- is easily extensible to more than two columns
- gives good estimates for highly correlated columns
- does not need sophisticated techniques as multi-dimensional histograms etc.
- seems to be a quite good solution to the "ZIP code" fail case
- there are some weak points - most serious one is the need to get good estimates of number of distinct values (individual columns and the combination used for a query)
Selectivity Estimation Without the Attribute Value Independence Assumption citeseerx PDF
- published in 1997, authors: Viswanath Poosala, Yannis E. Ioannidis
- there's nothing revolutionary new - it just shows alternative way to build multidimensional histograms (and a completely different approach based on SVD)
- the histograms are designed to be more efficient and accurate, so it might be a quite interesting improvement in the future, once we have a working (but imprecise) solution
Selectivity Estimation using Probabilistic Models citeseerx PDF
- published in 2001, authors Lise Getoor, Ben Taskar, Daphne Koller
- exchanges the dreaded "attribute value independence" assumption for a much weaker "conditional independence" assumption, which is then used to build Bayesian Network (BN) / Probability Relational Model (PRM)
- obviously this may be used to deduce estimates for a single table as well as for joins
- I haven't studied the BN / PRM construction thoroughtly, but it seems quite tractable (and there's a list of articles that describe this)
- the main effect of the BN/PRM approach is that it significantly reduces the amount of data needed to store joint distribution (instead of storing the whole table, it stores just a few conditional probabilities)
- this may be actually built on top of a multidimensional histogram, as we need to discretize the values somehow (unless the data already is discrete, of course)
Selectivity estimators for multidimensional range queries over real attributes citeseerx PDF
- published in 2005, authors: Dimitrios Gunopulos, George Kollios, Vassilis J. Tsotras, Carlotta Domeniconi
- a really nice wrap-up of problems related to cross-column statistics, and a summary of possible solutions (histograms, SVD, DCT, ...)
- summarizes problems when building histograms in higher (more than 1D) dimensions, i.e. number of cells growing exponentially (more space to store, more intersections for queries) vs. bigger (less accurate) cells.
- offers two new estimators - GENHIST and kernel estimators
- GENHIST is an alternative way to build histograms, allowing intersecting cells (contrary to usual histograms partitioning the space into non-overlapping regions)
- kernel estimators are an enhanced version of random sampling, with a quite interesting speed/accuracy and and other advantages (not based on histograms), but I guess that's not very useful to us as we're not doing any sampling

Additional papers

Query Optimization citeseerx PDF
- published in 1996, author: Yannis E. Ioannidis
- a quite nice theoretical introduction into selectivity estimation, worth reading if you know nothing about the general principles
Improved Histograms for Selectivity Estimation of Range Predicates citeseerx PDF
- published in 1996, authors: Viswanath Poosala, Yannis E. Ioannidis, Peter J. Haas, Eugene J. Shekita
- about the same as "Selectivity Estimation Without the Attribute Value Independence Assumption" (written by two of the authors in 1997)
- does not discuss some of the alternatives (SVD for example) and contains a slightly different tests.
- there is a very interesting chapter 7 on computational techniques (spreads, quantiles, distinct values, construction costs for the histograms, required sample size etc.)
STHoles: A MultidimensionalWorkloadAware Histogram citeseerx PDF
- published in 2001, authors: Nicolas Bruno, Surajit Chaudhuri, Luis Gravano
- another type of histogram - this case it's not build directly from the datasets, but from query results, and it's not a one-time process, the histogram is continuously improved (refined in the frequently queried regions).
- I haven't read the whole article, but although it seems quite interesting it's not very probable this would get into the core soon, and there's nothing like this (updating stats based on query results)
- so it's an interesting topic, but it's out of scope for now.
Summary Grids: Building Accurate Multidimensional Histograms citeseerx PDF
- published in 1999, authors: Pedro Furtado, H. Madeira
- describes an alternative way to build multidimensional histograms - traditional algrithms (MHIST, PHASE) work top-down, i.e. partition the space into smaller and smaller more homogenous areas (buckets), this new algorithm (called SGRID) works bottom-up, i.e. it incrementally joins homogenous areas into larger buckets
- again, this is not quite interesting right at the beginning, maybe later when improving the solution (making it more efficient, occupy less space etc.)
Vmhist: Efficient Multidimensional Histograms with Improved Accuracy PDF
- published in 2000, authors: Pedro Furtado, Henrique Madeira
- description of another histogram, called VMHIST
- authors claim this gives better results than MHIST (see one of the other articles) and is better scalable
- the article is quite coarse, there are not many details about the algorithm and the comparison is not very thorough
Balancing histogram optimality and practicality for query result size estimation PDF
- published in 1995, authors: Yannis E. Ioannidis, Viswanath Poosala
- seems to be a quite old article, and is mostly obsolete (i.e. the following articles from Ioannidis are much more interesting as more advanced histograms are presented)
- contains some basic definitions, lemmas etc.
Fast and Effective Histogram Construction WWW PDF
- published in 2009, authors: Felix Halim, Panagiotis Karras, Roland H. C. Yap
- another article proposing a new way to build histograms, comparing to older histograms, etc.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Predicates PDF
- published in 2006, authors: Xiaohui Yu1, Nick Koudas1, Calisto Zuzarte
- attempts to combine estimates from synopsis (pregenerated info - e.g. histogram) and random sampling (performed when the query is planned)
- currently these two approaches are used separately, and they attempt to combine, eliminate drawbacks and get the best parts of both
- right now this is rather useless for us, as we're not using random sampling at all (maybe in the future this might be an interesting improvement)
Histogram-Based Approximation of Set-Valued Query Answers citeseerx PDF
- published in 1999, authors: Yannis E. Ioannidis, Viswanath Poosala
- not exactly about estimation, rather about approximating results of queries returning sets of rows
- contains some interesting info, but not exactly relevant to estimation