From PostgreSQL wiki
This wiki page is design discussion for a feature that does not, at time of writing, exist in PostgreSQL. Check the official documentation for details on supported features.
TABLESAMPLE is an interesting sql clause. It is defined in SQL standard 2003. An example is
It will return a sample of the underlying table of which the size depends on the number specified in the bracket. The detail of the specification of this query from SQL standard 2003 is described below.
FROM emp TABLESAMPLE SYSTEM (50);
Microsoft SQL Server and DB2 have implemented this clause. Querying a sample of a table is often occurring in people’s work. An paper on elaborating the usage of sampling is on a paper from IBM. In page 1 and 2, the author describes the benefits and usage of a fast sampling method towards the discovering general trends and patterns in data.
It will be useful for PostgreSQL to implement this feature and make it available to the users.
This is currently done as a Google Summer of Code 2012 project --- Implementing TableSample for Postgres.
About TABLESAMPLE Clause
<sample clause> can be specified to return a subset of result rows depending on
<sample method> and
<sample percentage>. If the
<sample clause> contains <code><repeatable clause>,
then repeated executions of that
<table reference> return a result table with identical rows for a given
<repeat argument>, provided certain implementation-defined conditions are satisfied.
<table reference> ::= <table factor> | <joined table> <table factor> ::= <table primary> [ <sample clause> ] <table primary> ::= <table or query name> [ [ AS ] <correlation name> ] <sample clause> ::= TABLESAMPLE <sample method> <left paren> <sample percentage> <right paren> [ <repeatable clause> ] <sample method> ::= BERNOULLI | SYSTEM <repeatable clause> ::= REPEATABLE <left paren> <repeat argument> <right paren> <sample percentage> ::= <numeric value expression> <repeat argument> ::= <numeric value expression>
Let TP be the
<table primary> immediately contained in a
<table factor> TF. Let RT be the result of
<sample clause>is specified, then:
- (a) Let N be the number of rows in RT and let S be the value of
- (b) If S is the null value or if S < 0 (zero) or if S > 100, then an exception condition is raised: “data exception — invalid sample size”.
- (c) If
<repeatable clause>is specified, then let RPT be the value of
<repeat argument>. If RPT is the null value, then an exception condition is raised: data exception — invalid repeat argument in a sample clause”.
- (d) Case:
<sample method>specifies BERNOULLI, then the result of TF is a table containing approximately (N ∗ S/100) rows of RT. The probability of a row of RT being included in result of TF is S/100. Further, whether a given row of RT is included in result of TF is independent of whether other rows of RT are included in result of TF.
- Otherwise, result of TF is a table containing approximately (N ∗ S/100) rows of RT. The probability of a row of RT being included in result of TF is S/100.
- (e) If TF contains outer references, then a table with identical rows is generated every time TF is evaluated with a given set of values for outer references.
- (a) Let N be the number of rows in RT and let S be the value of
- Otherwise, result of TF is RT.
sample method is specified in two types: BERNOULLI and SYSTEM
- BERNOULLI implies picking tuples with a specified probability.
- SYSTEM implies picking pages with a specified probability.
Working With Tablesample
TABLESAMPLE is a query dealing with table sampling. Querying "select * from foo TABLESAMPLE SYSTEM (1)" is similiar to "select * from foo where random()<0.01". When you query tablesample, you have to specify the sampling method. Currently, there are two methods, SYSTEM and BERNOULLI, as they are ANSI SQL required. There might be other sampling methods adding into support, if it is of interest to some users and necessary. You can optionally specify the REPEATABLE option, which can give you the same sample in different runs.
The select, delete or update query directly using TABLESAMPLE will use a scan node called SAMPLESCAN. If you use explain to see the query plan used by the postgres optimizer, you will find the "sample scan" which directly scan the sampled table.
Generally, TABLESAMPLE can be used by SELECT, DELETE and UPDATE DML query, but not for DDL commands. You can also use it with any join query and aggregation. But there are still some constraints for TABLESAMPLE, like cursor cannot have TABLESAMPLE. The detail will be specified below.
TABLESAMPLE currently only works with sampling percentage, you can only specify an integer and query will take a sample of that number percent. You cannot specify the number of rows in the query, just like what you can do in SQL Server.
TABLESAMPLE SYSTEM method returns an approximate percentage of rows. It generates a random number for each physical storage page for the underlying relation. Based on this random number and the sampling percentage specified, it either includes or exclude the corresponding storage page. If that page is included, the whole page will be returned in the result set. By the above mechanism, the following *strange* behavior might happen, as you also might have noticed.
- The result set size will vary in different runs. The percentage of result set size to the total tuple size will be sometimes larger, sometimes smaller than the percentage specified. You might use "limit <number>" to get the top <number> of tuples.
- If the underlying relation contains only one page, either the entire page or none tuple gets returned.
TABLESAMPLE BERNOULLI method samples directly on each row of the underlying relation. The algorithm it uses is called Vitter's Reservoir Sampling method, which is used also inside ANALYZE command. The algo must know the number of records it wants to sample first. Thus, the result set will always be of size sample_percentage*relation_size.
The BERNOULLI method is meant to be fast, but in Vitter's algo, it is actually slower, but the sample randomness will be better.
Notice: Before using BERNOULLI method on a relation, make sure you have done analyze on that relation, otherwise the result set will be 0 size.
In REPEATABLE clause, you can specify a random seed number. That number will be used to generate a seeding for the PRNG random generator in Postgres backend. In different runs, if the number is the same, the result set for those runs will be the same, as long as no change has been made to the table. If different numbers are specified, the result set generally will be different. The following actions to the table are considered changes: inserting, updating, deleting, index rebuilding, index defragmenting, restoring a database, and attaching a database.reference
(On August 2012)
- Dealing with CURSOR: Any query with tablesample should not be called in a cursor. For now, it is disabled. If you query like "DECLEAR foo CURSOR FOR select * from table1 tablesample system (10)", the current transaction will be aborted and error message be thrown.
- The implementation on the joins is not really finish. On a join query with tablesample inside, the join will be quite slow and merge join cannot work correctly now. If you try explain these queries, you can see the optimizer will assume samplescan returns 0 rows, which is not true. More work will be needed.
To Do Items
- Improve on joins
- Add support under CURSOR
- TEST and try on any other cases where a select query might apply, as this has not been fully checked.
- Need to review the BERNOULLI sample method. The following problems have to be consider again:
a. The current implementation use Vitter's algo, while this algo is sampling with a reservoir, which need to have an array to store the sample result first. So it takes memory. We might even have memory overflow, if the sample is too large. Should we change the algo? Or we just add the overflow handling?
b. The algo is actually slower than query using "random()<0.1". One reason should be the algo is complex.
c. The code of Vitter's algo for analyze.c and tablesample is still separate. As after trying to merge, there is bug coming, but after reverse back, the bug is gone.
- The items discussed on the mail thread can be discussed in detail and see if possible implementation can be made.
You may wanna read some relevant resources describing this query.
- The community has been discussing this topic for some time, on how to do this as a gSoc project. You may find the thread by the following links or just search tablesample in the archive.
- Some links discussing tablesample, especially on SQL server.
Some ideas are applicable also to Postgres TableSample Query, some are not. These links are just for your reference.
- Selecting a sample
SELECT * FROM foo TABLESAMPLE SYSTEM (10); --Returns about 10% of rows in foo using SYSTEM method SELECT * FROM foo TABLESAMPLE BERNOULLI (10); --Using BERNOULLI sampling method
- Deleting a percentage of rows
DELETE FROM foo TABLESAMPLE SYSTEM (1); --Delete 1 percent of rows from foo
- Updating a percentage of rows
UPDATE foo TABLESAMPLE SYSTEM (1) SET col1='col1'; -- Update the attribute value of col1 in 1 percent of rows in foo to "col1"
- Selecting a sample with limit and order by
SELECT * FROM foo TABLESAMPLE SYSTEM (1) ORDER BY col1 LIMIT 100; -- Select 1 percent of rows from foo, display the first 100 rows, order by column col1
- Selecting with repeatable
SELECT * FROM foo TABLESAMPLE SYSTEM (1) REPEATABLE (200); SELECT * FROM foo TABLESAMPLE SYSTEM (1) REPEATABLE (200); --The result set is the same as above SELECT * FROM foo TABLESAMPLE SYSTEM (1) REPEATABLE (100); --The result set different from above