This page describes possible ways to improve estimates of number of distinct values. Originally this was a side-effect of an effort to implement cross-column statistics as one of the proposed approaches needs more precise distinct estimates, but as it's somehow separate effort I've created a separate page.
Sampling based estimators
The traditional estimators in statistics are based on small sample (say 1% of the population). This works quite well except in case of distinct values, where it fails unless a very large portion of the table is sampled (see the following section).
Charikar and Chaudhuri
In their paper (published in 2000), they stated and proved that a estimators based on sampling are a dead-end. The theorem they proved (Theorem 1) basically says that for every estimate based on a small-sample, there's a data set where the ratio error can be made arbitrarily large. The theorem is a bit more complicated (relates the size of the sample, maximal error and the probability of getting such data set), but it the end it says that you can't get a good estimator based on a small sample. And if you replace one estimator with another one, you may fix behavior for one data set, but there is another one.
JOSH BERKUS: Sorry, that paper does NOT say the above. Read it again. It says that perfect sample-based estimates are impossible ... but so are perfect steam-based estimates. In general, the paper says that you can reduce the sample size required or the error rate by knowing enough to choose a sample algorithm based on the expected distribution type of the error, and that reducing the error rate is difficult without this knowledge. Also, I'll note that 2000 was hardly the end of sample estimation papers.
They provide "optimal estimator" that consistently reaches the lower bound of the ratio error, but in general there are better estimators (although for some data sets they fail much harder).
TOMAS VONDRA: But that's exactly the point of that theorem. They basically say "for each estimator based on a sample, we'll give you a data set where it fails with error inverse proportional to the sample size (i.e. the smaller the sample, the bigger the error)." Sure, there are estimators that perform better than the proposed "optimal estimator" for some inputs, but the beauty (and purpose) of the optimal estimator is that it's consistent for all possible data sets.
Stream based estimators
Databases are not the only field where number of distinct values is needed - another field that needs this is data stream analysis. The proposed estimators are based on one pass through the data with incremental updates of a bitmap - the first such estimator was based on probabilistic counting, the next one on Wegman's adaptive sampling etc. In 2010 an algorithm with arbitrary precision and O(log(n)) space complexity was described - that's very promising.
It's very similar to Bloom filter, but the Bloom filter needs more space and provides more information - it's designed to identify elements of the set (in this case distinct values). Which is not the case of bitmaps used in probabilistic counting etc.
A very interesting approach, called Distinct sampling, was described by Gibbons in 2001. Don't be confused by the 'sampling' - it's not a random sampling, it is based on adaptive selection of a sample during one pass through the data (the principle is very similar to the Wegman's adaptive sampling). This algorithm needs much more space, but it can give estimates to question like 'how many distinct values satisfies predicate P' which is not possible with the 'simple' algorithms.
So this is much more interesting, but there are a few drawbacks. First, these estimators require one pass through the data (and then incremental updates), which is completely different from the current estimators. Second, these estimators are designed for 'data streams' where there are no deletes by default. Some of the algorithms do actually describe some solution.
This is a list of papers relevant to distinct estimation, classified into three sections. The papers are always sorted from newest to the oldest. If you know about an interesting paper not listed here, feel free to add it here.
- Towards Estimation Error Guarantees for Distinct Values PDF
- published: 2000
- authors: Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya
- presents a proof that with a limited sample, you really can't get a precise estimate (with limited error)
- in other words: for each estimator based on a limited sample, there's a probability distribution where the estimator fails spectacularly
- so to get a good estimate, you really need to sample most of the table (almost all of it)
- they provide an "optimal estimator" (a hybrid estimator composed of several simple estimators) in the sense that it reaches the lowest possible error (among sampling based estimators)
- Estimating the Number of Classes in a Finite Population citeseerx PDF
- published: 1996
- authors: Peter J. Haas, Lynne Stokes
- presents several sampling-based estimators, we're currently using one of them (D_uj1)
- Sampling-Based Estimation of the Number of Distinct Values of an Attribute PDF
- published: 1995
- authors: Peter J. Haas, Jeffrey F. Naughtont, S. Seshadrit, Lynne Stokes
- presents several sampling-based estimators, compares them etc.
- this is a year older that the article from Haas/Stokes, so read that one instead
- An Optimal Algorithm for the Distinct Elements Problem citeseerx PDF
- published: 2010 (PODS’10, June 6–11, 2010)
- authors: Daniel M. Kane, Jelani Nelson, David P. Woodruff
- basically an improved version of the "probabilistic counting" (see the paper by P. Flajolet and G. N. Martin)
- they present an algorithm with O(log(n)) bits of space and O(1) update complexity
- the precision may be improved by combining several such estimators
- Distinct-Values Estimation over Data Streams PDF
- published: 2009
- author: Phillip B. Gibbons
- this is a quite nice summary of the possible approaches - a short paragraph about sampling-based algorithms and why this is a dead-end, a more thorough analysis of streaming based algorithms (Flajolet-Martin probabilistic counting algorithm and then another algorithm from Alon, Matias and Szegedy) and then a section about "coordinated sampling"
- there is a very nice table of various algorithms summarizing their features (if there is a sample of distinct values, if deletions are handled somehow etc.)
- Distinct Counting with a Self-Learning Bitmap PDF
- published: 2009
- authors: Aiyou Chen, Jin Cao
- they describe an algorithm that performs adaptive sampling using a bitmap
- the algorithm is based on Markov chain model, but in general it seems similar to the Wegman's adaptive sampling (see the paper from 1990)
- Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports citeseerx PDF
- published: 2001
- author: Phillip B. Gibbons
- although the title says "sampling" this paper is not about a traditional sampling (collecting a small random sample from a table and then computing an estimate from it), it's about collecting a sample from a stream of data (one pass through the table) and choosing a "distinct sample"
- this does not provide just an estimate on number of distinct rows, but estimate for distinct values for arbitrary predicate on the row
- seems very interesting, although it needs much more span than the other 'probabilistic counting' algoritms
- On Adaptive Sampling citeseerx PDF
- published: 1990
- authors: Philippe Flajolet
- presents an algorithm alternative to probabilistic counting, based on Wegman's Adaptive Sampling
- this algorithm is less precise than the original probabilistic counting algorithm, but is better for small files (unbiased)
- Probabilistic Counting Algorithms for Data Base Applications citeseerx PDF
- published: 1985
- authors: Philippe Flajolet, G. Nigel Martin
- this is the first paper on this topic, describes a basic algorithm and an improved "stochastic" version (PCSA)
- includes quite thorough proofs of theorems, etc.
- Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes PDF
- published: 2005
- authors: Xiaohui Yu, Calisto Zuzarte, Kenneth C. Sevcik
- this paper is not about estimating number of distinct values for individual columns, but for combination of multiple columns using knowledge of their distribution (or an approximation in the form of histogram)
- the algorithm they propose is called COLSCARD
- a big disadvantage of the paper is that they assume independence of the columns, but it seems this could be solved using a multi-dimensional histogram (just replace the multiplication of distributions with the value from histogram)
- could be a way if we can't use some 'streaming' solution (in that case we could rather easily collect data not just about individual columns but about an interesting combination too)