TABLESAMPLE Implementation

From PostgreSQL wiki
Jump to navigationJump to search

Design page

This wiki page is design discussion for a feature that did not, at time of writing, exist in PostgreSQL. This feature is now available since version 9.5 and its details can be found in the official documentation.

Introduction

TABLESAMPLE is an interesting sql clause. It is defined in SQL standard 2003. An example is
SELECT avg(salary)
FROM emp TABLESAMPLE SYSTEM (50);
It will return a sample of the underlying table of which the size depends on the number specified in the bracket. The detail of the specification of this query from SQL standard 2003 is described below.
Microsoft SQL Server and DB2 have implemented this clause. Querying a sample of a table is often occurring in people’s work. An paper on elaborating the usage of sampling is on a paper from IBM. In page 1 and 2, the author describes the benefits and usage of a fast sampling method towards the discovering general trends and patterns in data.
It will be useful for PostgreSQL to implement this feature and make it available to the users.
This is currently done as a Google Summer of Code 2012 project --- Implementing TableSample for Postgres.

Project Details

About TABLESAMPLE Clause

Concepts

In a <table reference>, <sample clause> can be specified to return a subset of result rows depending on the <sample method> and <sample percentage>. If the <sample clause> contains <repeatable clause>, then repeated executions of that <table reference> return a result table with identical rows for a given <repeat argument>, provided certain implementation-defined conditions are satisfied.

Syntax

<table reference> ::= <table factor> | <joined table>
<table factor> ::= <table primary> [ <sample clause> ]
<table primary> ::= <table or query name> [ [ AS ] <correlation name> ]
<sample clause> ::= TABLESAMPLE <sample method> <left paren> 
                    <sample percentage> <right paren> [ <repeatable clause> ]
<sample method> ::= BERNOULLI | SYSTEM
<repeatable clause> ::= REPEATABLE <left paren> <repeat argument> <right paren>
<sample percentage> ::= <numeric value expression>
<repeat argument> ::= <numeric value expression>

General Rules

Let TP be the <table primary> immediately contained in a <table factor> TF. Let RT be the result of TP. Case:

  1. If <sample clause> is specified, then:
    (a) Let N be the number of rows in RT and let S be the value of <sample percentage>.
    (b) If S is the null value or if S < 0 (zero) or if S > 100, then an exception condition is raised: “data exception — invalid sample size”.
    (c) If <repeatable clause> is specified, then let RPT be the value of <repeat argument>. If RPT is the null value, then an exception condition is raised: data exception — invalid repeat argument in a sample clause”.
    (d) Case:
    1. If <sample method> specifies BERNOULLI, then the result of TF is a table containing approximately (N ∗ S/100) rows of RT. The probability of a row of RT being included in result of TF is S/100. Further, whether a given row of RT is included in result of TF is independent of whether other rows of RT are included in result of TF.
    2. Otherwise, result of TF is a table containing approximately (N ∗ S/100) rows of RT. The probability of a row of RT being included in result of TF is S/100.
    (e) If TF contains outer references, then a table with identical rows is generated every time TF is evaluated with a given set of values for outer references.
  2. Otherwise, result of TF is RT.

- sample method is specified in two types: BERNOULLI and SYSTEM
- BERNOULLI implies picking tuples with a specified probability.
- SYSTEM implies picking pages with a specified probability.

Working With Tablesample

TABLESAMPLE is a query dealing with table sampling. Querying "select * from foo TABLESAMPLE SYSTEM (1)" is similiar to "select * from foo where random()<0.01". When you query tablesample, you have to specify the sampling method. Currently, there are two methods, SYSTEM and BERNOULLI, as they are ANSI SQL required. There might be other sampling methods adding into support, if it is of interest to some users and necessary. You can optionally specify the REPEATABLE option, which can give you the same sample in different runs.
The select query directly using TABLESAMPLE will use a scan node called SAMPLESCAN. If you use explain to see the query plan used by the postgres optimizer, you will find the "sample scan" which directly scan the sampled table.
Generally, TABLESAMPLE can be used only by SELECT query. You can also use it with any join query and aggregation.
TABLESAMPLE currently only works with sampling percentage, you can only specify an float (or expression returning float) and query will take a sample of that number percent. You cannot specify the number of rows in the query, just like what you can do in SQL Server.

SYSTEM Option

TABLESAMPLE SYSTEM method returns an approximate percentage of rows. It generates a random number for each physical storage page for the underlying relation. Based on this random number and the sampling percentage specified, it either includes or exclude the corresponding storage page. If that page is included, the whole page will be returned in the result set. There are some side effects because of the fact that the sampling is done on block (page) level:

  1. The result set size will vary in different runs. The percentage of result set size to the total tuple size will be sometimes larger, sometimes smaller than the percentage specified. You might use "limit <number>" to get the top <number> of tuples.
  2. If the underlying relation contains only one page, either the entire page or none tuple gets returned.

BERNOULLI Option

TABLESAMPLE BERNOULLI method samples directly on each row of the underlying relation. This sampling method will actually scan the whole relation and randomly pick individual tuples (it basically does "coin flip" for each tuple). This algorithm gives better random distribution but will be slower for small percentages.

REPEATABLE Option

In REPEATABLE clause, you can specify a random seed number. That number will be used to generate a seeding for the PRNG random generator in Postgres backend. In different runs, if the number is the same, the result set for those runs will be the same, as long as no change has been made to the table. If different numbers are specified, the result set generally will be different. The following actions to the table are considered changes: inserting, updating, deleting, index rebuilding, index defragmenting, restoring a database, and attaching a database.reference

Examples

  • Selecting a sample
 
Select * 
from foo TABLESAMPLE SYSTEM (10);    --Returns about 10% of rows in foo using SYSTEM method
Select *
from foo TABLESAMPLE BERNOULLI (10); --Using BERNOULLI sampling method
  • Deleting a percentage of rows
Delete
from foo TABLESAMPLE SYSTEM (1);  --Delete 1 percent of rows from foo
  • Updating a percentage of rows
Update foo TABLESAMPLE SYSTEM (1)
set col1='col1';  -- Update the attribute value of col1 in 1 percent of rows in foo to "col1"
  • Selecting a sample with limit and order by
Select *
from foo TABLESAMPLE SYSTEM (1)
order by col1
limit 100;  -- Select 1 percent of rows from foo, display the first 100 rows, order by column col1
  • Selecting with repeatable
Select *
from foo TABLESAMPLE SYSTEM (1) REPEATABLE (200); 
Select *
from foo TABLESAMPLE SYSTEM (1) REPEATABLE (200); --The result set is the same as above 
Select *
from foo TABLESAMPLE SYSTEM (1) REPEATABLE (100); --The result set different from above

References