Grouping Sets

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
(+cat)
(Tidy this up a little, capitalize SQL keywords.)
 
Line 13: Line 13:
 
Table cars:
 
Table cars:
 
<pre>
 
<pre>
create table cars(name varchar, place varchar, count integer);
+
CREATE TABLE CARS(name text, place text, count integer);
insert into cars values('skoda', 'czech rep.', 10000);
+
INSERT INTO CARS VALUES
insert into cars values('skoda', 'germany', 5000);
+
    ('skoda', 'czech rep.', 10000),
insert into cars values('bmw', 'czech rep.', 100);
+
    ('skoda', 'germany', 5000),
insert into cars values('bmw', 'germany', 1000);
+
    ('bmw', 'czech rep.', 100),
insert into cars values('opel', 'czech rep.', 7000);
+
    ('bmw', 'germany', 1000),
insert into cars values('opel', 'germany', 7000);
+
    ('opel', 'czech rep.', 7000),
 +
    ('opel', 'germany', 7000);
  
postgres=# select * from cars;
+
postgres=# SELECT * FROM cars;
 
  name  |  place    | count  
 
  name  |  place    | count  
 
-------+------------+-------
 
-------+------------+-------
Line 34: Line 35:
 
Classic queries are:
 
Classic queries are:
 
<pre>
 
<pre>
postgres=# select name, sum(count) from cars group by name;
+
postgres=# SELECT name, SUM(count) FROM cars GROUP BY name;
 
  name  |  sum   
 
  name  |  sum   
 
-------+-------
 
-------+-------
Line 42: Line 43:
 
(3 rows)
 
(3 rows)
  
postgres=# select place, sum(count) from cars group by place;
+
postgres=# SELECT place, SUM(count) FROM cars GROUP BY place;
 
   place    |  sum   
 
   place    |  sum   
 
------------+-------
 
------------+-------
Line 55: Line 56:
 
<pre>
 
<pre>
  
<postgres=# select name, place, sum(count) from cars group by grouping sets(name, place, ());
+
<postgres=# SELECT name, place, SUM(count) FROM cars GROUP BY GROUPING SETS(name, place, ());
 
  name  |  place    |  sum   
 
  name  |  place    |  sum   
 
-------+------------+-------
 
-------+------------+-------
Line 68: Line 69:
 
Last query allows fast view on count of cars grouped by name and by place (there are two dimensions). Last query is equivalent to query:
 
Last query allows fast view on count of cars grouped by name and by place (there are two dimensions). Last query is equivalent to query:
 
<pre>
 
<pre>
select name, null as place, sum(count) from cars group by name
+
SELECT name, null as place, SUM(count) FROM cars GROUP BY name
 
union all  
 
union all  
select null, place, sum(count) from cars group by place
+
SELECT null, place, SUM(count) FROM cars GROUP BY place
 
union all
 
union all
select null, null, sum(count) from cars;
+
SELECT null, null, SUM(count) FROM cars;
 
</pre>
 
</pre>
  
 
All combination of attributes do CUBE operator:
 
All combination of attributes do CUBE operator:
 
<pre>
 
<pre>
postgres=# select name, place, sum(count) from cars group by cube(name, place);
+
postgres=# SELECT name, place, SUM(count) FROM cars GROUP BY CUBE(name, place);
 
  name  |  place    |  sum   
 
  name  |  place    |  sum   
 
-------+------------+-------
 
-------+------------+-------
Line 96: Line 97:
 
Rollup analyze is supported with ROLLUP operator
 
Rollup analyze is supported with ROLLUP operator
 
<pre>
 
<pre>
postgres=# select name, place, sum(count) from cars group by rollup(name, place);
+
postgres=# SELECT name, place, SUM(count) FROM cars GROUP BY ROLLUP(name, place);
 
  name  |  place    |  sum   
 
  name  |  place    |  sum   
 
-------+------------+-------
 
-------+------------+-------
Line 113: Line 114:
 
ROLLUP and CUBE operators should be simple to transform to GROUPING SETS spec:
 
ROLLUP and CUBE operators should be simple to transform to GROUPING SETS spec:
 
<pre>
 
<pre>
select name, place, sum(count) from cars group by grouping sets((name, place),name, place,());
+
SELECT name, place, SUM(count) FROM cars GROUP BY GROUPING SETS((name, place),name, place,());
 
  name  |  place    |  sum   
 
  name  |  place    |  sum   
 
-------+------------+-------
 
-------+------------+-------
Line 130: Line 131:
 
(12 rows)
 
(12 rows)
 
</pre>
 
</pre>
 +
 
==Virtual values==
 
==Virtual values==
 
Sure, there are some fields that don't have any value (SQL uses NULL value for this case). We need to detect NULLs in data and also generated NULLs. For this purpose ANSI SQL has function grouping(col) that returns one when column is used for grouping or zero when column and all derivated values are NULL. From world of proprietary databases comes the function grouping_id:
 
Sure, there are some fields that don't have any value (SQL uses NULL value for this case). We need to detect NULLs in data and also generated NULLs. For this purpose ANSI SQL has function grouping(col) that returns one when column is used for grouping or zero when column and all derivated values are NULL. From world of proprietary databases comes the function grouping_id:

Latest revision as of 19:44, 19 June 2014

I collect info about Grouping Sets feature.

What is it:

Grouping Sets allows more repeated grouping clauses in one query. The purpose is to support the analytic multidimensional view of data. The keywords CUBE and ROLLUP were used originally only as a group clause flag, but in ANSI SQL they provide more general solutions. It's a big enhancement of GROUP BY clause:

Contents

[edit] Introduction

Table cars:

CREATE TABLE CARS(name text, place text, count integer);
INSERT INTO CARS VALUES
    ('skoda', 'czech rep.', 10000),
    ('skoda', 'germany', 5000),
    ('bmw', 'czech rep.', 100),
    ('bmw', 'germany', 1000),
    ('opel', 'czech rep.', 7000),
    ('opel', 'germany', 7000);

postgres=# SELECT * FROM cars;
 name  |   place    | count 
-------+------------+-------
 skoda | czech rep. | 10000
 skoda | germany    |  5000
 bmw   | czech rep. |   100
 bmw   | germany    |  1000
 opel  | czech rep. |  7000
 opel  | germany    |  7000
(6 rows)

Classic queries are:

postgres=# SELECT name, SUM(count) FROM cars GROUP BY name;
 name  |  sum  
-------+-------
 bmw   |  1100
 skoda | 15000
 opel  | 14000
(3 rows)

postgres=# SELECT place, SUM(count) FROM cars GROUP BY place;
   place    |  sum  
------------+-------
 germany    | 13000
 czech rep. | 17100
(2 rows)

But these queries are isolated and don't allow rich view on data - they are one dimensional.

Multidimensional queries allow us to combine partial views into one report:


<postgres=# SELECT name, place, SUM(count) FROM cars GROUP BY GROUPING SETS(name, place, ());
 name  |   place    |  sum  
-------+------------+-------
 bmw   |            |  1100
 skoda |            | 15000
 opel  |            | 14000
       | germany    | 13000
       | czech rep. | 17100
       |            | 30100
(6 rows)

Last query allows fast view on count of cars grouped by name and by place (there are two dimensions). Last query is equivalent to query:

SELECT name, null as place, SUM(count) FROM cars GROUP BY name
union all 
SELECT null, place, SUM(count) FROM cars GROUP BY place
union all
SELECT null, null, SUM(count) FROM cars;

All combination of attributes do CUBE operator:

postgres=# SELECT name, place, SUM(count) FROM cars GROUP BY CUBE(name, place);
 name  |   place    |  sum  
-------+------------+-------
 skoda | czech rep. | 10000
 opel  | germany    |  7000
 bmw   | germany    |  1000
 bmw   | czech rep. |   100
 opel  | czech rep. |  7000
 skoda | germany    |  5000
 bmw   |            |  1100
 skoda |            | 15000
 opel  |            | 14000
       | germany    | 13000
       | czech rep. | 17100
       |            | 30100
(12 rows)

Rollup analyze is supported with ROLLUP operator

postgres=# SELECT name, place, SUM(count) FROM cars GROUP BY ROLLUP(name, place);
 name  |   place    |  sum  
-------+------------+-------
 skoda | czech rep. | 10000
 opel  | germany    |  7000
 bmw   | germany    |  1000
 bmw   | czech rep. |   100
 opel  | czech rep. |  7000
 skoda | germany    |  5000
 bmw   |            |  1100
 skoda |            | 15000
 opel  |            | 14000
       |            | 30100
(10 rows)

ROLLUP and CUBE operators should be simple to transform to GROUPING SETS spec:

SELECT name, place, SUM(count) FROM cars GROUP BY GROUPING SETS((name, place),name, place,());
 name  |   place    |  sum  
-------+------------+-------
 skoda | czech rep. | 10000
 opel  | germany    |  7000
 bmw   | germany    |  1000
 bmw   | czech rep. |   100
 opel  | czech rep. |  7000
 skoda | germany    |  5000
 bmw   |            |  1100
 skoda |            | 15000
 opel  |            | 14000
       | germany    | 13000
       | czech rep. | 17100
       |            | 30100
(12 rows)

[edit] Virtual values

Sure, there are some fields that don't have any value (SQL uses NULL value for this case). We need to detect NULLs in data and also generated NULLs. For this purpose ANSI SQL has function grouping(col) that returns one when column is used for grouping or zero when column and all derivated values are NULL. From world of proprietary databases comes the function grouping_id:

postgres=# select name, place, sum(count), grouping(name), grouping(place), grouping_id(name, place) 
              from cars group by grouping sets((name, place),name, place,());
 name  |   place    |  sum  | grouping | grouping | grouping_id 
-------+------------+-------+----------+----------+-------------
 bmw   | czech rep. |   100 |        0 |        0 |           0
 skoda | germany    |  5000 |        0 |        0 |           0
 opel  | czech rep. |  7000 |        0 |        0 |           0
 opel  | germany    |  7000 |        0 |        0 |           0
 skoda | czech rep. | 10000 |        0 |        0 |           0
 bmw   | germany    |  1000 |        0 |        0 |           0
 bmw   |            |  1100 |        0 |        1 |           1
 skoda |            | 15000 |        0 |        1 |           1
 opel  |            | 14000 |        0 |        1 |           1
       | germany    | 13000 |        1 |        0 |           2
       | czech rep. | 17100 |        1 |        0 |           2
       |            | 30100 |        1 |        1 |           3
(12 rows)

[edit] Implementation

[edit] Feeder based patch

Current implementation is based on modification of normal hash aggregation. Hash are used because it allows one scan of data and parallel processing of all grouping sets (it doesn't need materialization). The disadvantage is that only types that support hashing are supported, and there could be some "out of memory problem" for very large results. This feature is a little bit similar CTE. It allows repeated scan of some source (result, table), and probably GROUPING SETS should be based on CTE (non hashed mode). CTEs are more general, but GROUPING SETS would be implemented more effectively, because difference to normal query is only in GROUP BY clause. Internally GROUPING SETS suffers from same problem as CTE. It broke current internal system of PostgreSQL data pipeline which is based on data pulling. GS needs data pushing. It's solved with helper node that should carry one tuple (feeder) and special node operation, these ensure only one tuple processing. Hash operations in PostgreSQL are bulk operations - first, hash table is completely filled and later is read row by row.

You can see feeder node in query plan (one feeder is used many times):

 postgres=# explain select name, place, sum(count), 
              grouping(name), grouping(place), grouping_id(name, place) 
             from cars group by grouping sets((name, place),name, place,());
                        QUERY PLAN                         
-----------------------------------------------------------
 Grouping Sets  (cost=1.08..1.29 rows=9 width=18)
   ->  Seq Scan on cars  (cost=0.00..1.06 rows=6 width=18)
   ->  HashAggregate  (cost=1.10..1.14 rows=3 width=18)
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
   ->  HashAggregate  (cost=1.09..1.13 rows=3 width=18)
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
   ->  HashAggregate  (cost=1.09..1.11 rows=2 width=18)
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
   ->  Aggregate  (cost=1.08..1.09 rows=1 width=18)
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
(10 rows)

Calls of functions grouping and grouping_id are replaced at planner time with adequate values:

postgres=# explain verbose select name, place, sum(count), 
           grouping(name), grouping(place), grouping_id(name, place) 
           from cars group by grouping sets((name, place),name, place,());
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Grouping Sets  (cost=1.08..1.29 rows=9 width=18)
   Output: name, place, sum(count), 0, 0, 0
   ->  Seq Scan on cars  (cost=0.00..1.06 rows=6 width=18)
         Output: name, place, count
   ->  HashAggregate  (cost=1.10..1.14 rows=3 width=18)
         Output: name, place, sum(count), 1, 1, 3
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: name, place, count
   ->  HashAggregate  (cost=1.09..1.13 rows=3 width=18)
         Output: name, NULL::character varying, sum(count), 1, 0, 2
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: name, place, count
   ->  HashAggregate  (cost=1.09..1.11 rows=2 width=18)
         Output: NULL::character varying, place, sum(count), 0, 1, 1
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: name, place, count
   ->  Aggregate  (cost=1.08..1.09 rows=1 width=18)
         Output: NULL::character varying, NULL::character varying, sum(count), 0, 0, 0
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: name, place, count
(20 rows)

[edit] CTE based patch

postgres=# explain verbose select name, place, sum(count), 
           grouping(name), grouping(place), grouping_id(name, place) 
           from cars group by grouping sets((name, place),name, place,());
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Append  (cost=1.23..2.09 rows=19 width=44)
   CTE **g**
     ->  Seq Scan on cars  (cost=0.00..1.06 rows=6 width=18)
           Output: cars.name, cars.place, cars.count
   ->  HashAggregate  (cost=0.17..0.24 rows=6 width=68)
         Output: "**g**".name, "**g**".place, sum("**g**".count), 0, 0, 0
         ->  CTE Scan on "**g**"  (cost=0.00..0.12 rows=6 width=68)
               Output: "**g**".name, "**g**".place, "**g**".count
   ->  HashAggregate  (cost=0.15..0.23 rows=6 width=36)
         Output: "**g**".name, NULL::character varying, sum("**g**".count), 0, 1, 1
         ->  CTE Scan on "**g**"  (cost=0.00..0.12 rows=6 width=36)
               Output: "**g**".name, "**g**".place, "**g**".count
   ->  HashAggregate  (cost=0.15..0.23 rows=6 width=36)
         Output: NULL::character varying, "**g**".place, sum("**g**".count), 1, 0, 2
         ->  CTE Scan on "**g**"  (cost=0.00..0.12 rows=6 width=36)
               Output: "**g**".name, "**g**".place, "**g**".count
   ->  Aggregate  (cost=0.14..0.15 rows=1 width=4)
         Output: NULL::character varying, NULL::character varying, sum("**g**".count), 1, 1, 3
         ->  CTE Scan on "**g**"  (cost=0.00..0.12 rows=6 width=4)
               Output: "**g**".name, "**g**".place, "**g**".count
(20 rows)

[edit] Known limits

Only hash based mode is supported - code for materialization mode (with rescans) should be probably shared with CTE. Limits are same as nonrecursive CTE

[edit] Known bugs (feeder based patch)

a) problem with functions - when target list contains functions, that are not in group by clause:

postgres=# explain verbose select sum(sin(count)) from cars group by grouping sets(name,place);
                        QUERY PLAN                         
-----------------------------------------------------------
 Grouping Sets  (cost=1.09..1.21 rows=5 width=18)
   Output: sum(sin((count)::double precision))
   ->  Seq Scan on cars  (cost=0.00..1.06 rows=6 width=18)
         Output: count, name, place
   ->  HashAggregate  (cost=1.09..1.14 rows=3 width=18)
         Output: sum(sin((count)::double precision))
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: count, name, place
   ->  HashAggregate  (cost=1.09..1.12 rows=2 width=18)
         Output: sum(sin((count)::double precision))
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: count, name, place
(12 rows)
postgres=# select sum(sin(count)) from cars group by grouping sets(name,place);
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

Problem is in source plan, it should to do scalar function call too, like:

postgres=# explain verbose select sum(sin(count)) from cars group by grouping sets(sin(count));
                         QUERY PLAN                          
-------------------------------------------------------------
 Grouping Sets  (cost=1.09..1.20 rows=5 width=4)
   Output: sum(sin((count)::double precision))
   ->  Seq Scan on cars  (cost=0.00..1.06 rows=6 width=4)
         Output: count, sin((count)::double precision)
   ->  HashAggregate  (cost=1.09..1.20 rows=5 width=4)
         Output: sum((sin((count)::double precision)))
         ->  Feeder  (cost=0.00..1.06 rows=6 width=4)
               Output: count, sin((count)::double precision)
(8 rows)

b) some strange hashing (in some cases)

postgres=# select name, place from cars group by grouping sets(name, place);
 name  |   place    
-------+------------
 skoda | 
 bmw   | 
 opel  | 
       | czech rep.
       | czech rep.
       | czech rep.
(6 rows)
--with one aggregate the result is correct:
postgres=# select name, place, sum(count) from cars group by grouping sets(name, place);
 name  |   place    |  sum  
-------+------------+-------
 skoda |            | 15000
 bmw   |            |  1100
 opel  |            | 14000
       | germany    | 13000
       | czech rep. | 17100
(5 rows)

CTE based patch hasn't know bugs.

[edit] Current state

Current patch is very experimental. Mainly, the planer stage should be refactored. There are a lot of lines of code redundant with GROUP clause implementation (that would be subset of GROUPING SETS implementation)- Any planner experts are welcome! Current GROUPING SETS planner part is a reason for lot of crashes :(.

CTE based patch is +/- stable, and it is available for testing

[edit] ToDo

  • transformation from CTE to native tuple store interface
  • develop SetsHashAggregate
  • documentation

[edit] Notes

  • using qualifier in GROUP BY clause (default is ALL)
SELECT ...
   GROUP BY [ALL|DISTINCT] GROUPING SETS(...
  
  GROUP BY DISTINCT means
 
  SELECT ...
  UNION 
  SELECT ...
  UNION ...
  SELECT

[edit] new develop cycle - 2010

actualised patch http://archives.postgresql.org/pgsql-hackers/2010-08/msg00647.php version003

- state:

  • there are not issues with empty grouping sets and targetList without aggregates: the solution is relative simple - replace empty grouping sets with fake - GROUP BY true
  • there isn't issues with having - it is necessary transform targetList (replace a GroupingSets function and replace a nongrouped columns)
  • there is issue with WHERE clause - GroupingSets function cannot be there
  • there is issue with GroupingSets functions and ORDER BY clause

[edit] Solved questions

  • parser: ROLLUP and CUBE have to be a reserved keyword - functions and datatypes from "cube" contrib will be renamed

[edit] Unsolved questions

  • Where and when and how to implement a GroupingSets functions - GROUPING and GROUPING_ID
    • early - replace in transformation stage - there are necessary data available, but problem with ORDER BY
    • late - evaluate it in eval_const_expressions_mutator - but there are not necessary data, maybe put a values to context->boundParams (preferred)
  • Some optimization mainly special executor node for ROLLUP

[edit] Plan

Replace trivial transformation to CTE by simple method based on Tuplestore. Can be replaced by non recursive CTE in future.

postgres=# select name, place, sum(count), grouping(name), grouping(place) from cars group by rollup(name, place);
 name  |   place    |  sum  | grouping | grouping 
-------+------------+-------+----------+----------
 bmw   | czech rep. |   100 |        0 |        0
 skoda | germany    |  5000 |        0 |        0
 opel  | czech rep. |  7000 |        0 |        0
 opel  | germany    |  7000 |        0 |        0
 skoda | czech rep. | 10000 |        0 |        0
 bmw   | germany    |  1000 |        0 |        0
 bmw   |            |  1100 |        0 |        1
 skoda |            | 15000 |        0 |        1
 opel  |            | 14000 |        0 |        1
       |            | 30100 |        1 |        1
(10 rows)

postgres=# explain select name, place, sum(count), grouping(name), grouping(place) from cars group by rollup(name, place);
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Append  (cost=41.12..91.67 rows=402 width=52)
   CTE GroupingSets
     ->  Seq Scan on cars  (cost=0.00..18.30 rows=830 width=68)
   ->  HashAggregate  (cost=22.82..25.32 rows=200 width=68)
         ->  CTE Scan on "GroupingSets"  (cost=0.00..16.60 rows=830 width=68)
   ->  HashAggregate  (cost=20.75..23.25 rows=200 width=36)
         ->  CTE Scan on "GroupingSets"  (cost=0.00..16.60 rows=830 width=36)
   ->  Subquery Scan on "*SELECT* 3"  (cost=0.00..20.79 rows=2 width=4)
         ->  GroupAggregate  (cost=0.00..20.77 rows=2 width=4)
               ->  CTE Scan on "GroupingSets"  (cost=0.00..16.60 rows=830 width=4)
(10 rows)
Personal tools