"clusteredness" metric
Background
Records are arbitrarily located throughout the table's storage. Depending on the access pattern used to build the data it could be stored in a completely random fashion or in very organized fashion such as with monotically increasing values of time for example. However it's also possible for an attribute to be in an well-organized fashion but *not* simple described as "monotically increasing".
It's important for Postgres to know how well organized the data is to estimate how much random access will be needed to access the data. If all the data is closely packed together it will likely be possible to scan it efficiently. If the data is strewn around the database it will require many slow seeks to read the data.
Problem Description
Consider data such as:
Location | A | B | C | D |
---|---|---|---|---|
1 | 1 | 5 | 3 | 5 |
1 | 1 | 5 | 1 | 5 |
1 | 1 | 5 | 5 | 5 |
2 | 3 | 3 | 2 | 1 |
2 | 3 | 3 | 4 | 1 |
2 | 3 | 3 | 1 | 1 |
3 | 5 | 1 | 5 | 3 |
3 | 5 | 1 | 4 | 3 |
3 | 5 | 1 | 2 | 3 |
(Note that "locations" refers to disk blocks so they will always be monotonically increasing)
We need a metric for each of A, B, C which
- Can be stored efficiently (such as a single floating point value)
- Can be calculated by a "small" random sample of the rows (where "small" is similar to the size sample you need to calculate E(col) to 95% confidence or so.
And which can be used to calculate predictions of the some of the following:
- If all the rows for a given value of a column are read how many different "locations" will be included.