Grouping Sets

From PostgreSQL wiki

Jump to: navigation, search

I collect info about Grouping Sets feature.

What is it:

Grouping Sets allows more repeated grouping clauses in one query. The purpose is to support the analytic multidimensional view of data. The keywords CUBE and ROLLUP were used originally only as a group clause flag, but in ANSI SQL they provide more general solutions. It's a big enhancement of GROUP BY clause:

Contents

Introduction

Table cars:

CREATE TABLE CARS(name text, place text, count integer);
INSERT INTO CARS VALUES
    ('skoda', 'czech rep.', 10000),
    ('skoda', 'germany', 5000),
    ('bmw', 'czech rep.', 100),
    ('bmw', 'germany', 1000),
    ('opel', 'czech rep.', 7000),
    ('opel', 'germany', 7000);

postgres=# SELECT * FROM cars;
 name  |   place    | count 
-------+------------+-------
 skoda | czech rep. | 10000
 skoda | germany    |  5000
 bmw   | czech rep. |   100
 bmw   | germany    |  1000
 opel  | czech rep. |  7000
 opel  | germany    |  7000
(6 rows)

Classic queries are:

postgres=# SELECT name, SUM(count) FROM cars GROUP BY name;
 name  |  sum  
-------+-------
 bmw   |  1100
 skoda | 15000
 opel  | 14000
(3 rows)

postgres=# SELECT place, SUM(count) FROM cars GROUP BY place;
   place    |  sum  
------------+-------
 germany    | 13000
 czech rep. | 17100
(2 rows)

But these queries are isolated and don't allow rich view on data - they are one dimensional.

Multidimensional queries allow us to combine partial views into one report:


<postgres=# SELECT name, place, SUM(count) FROM cars GROUP BY GROUPING SETS(name, place, ());
 name  |   place    |  sum  
-------+------------+-------
 bmw   |            |  1100
 skoda |            | 15000
 opel  |            | 14000
       | germany    | 13000
       | czech rep. | 17100
       |            | 30100
(6 rows)

Last query allows fast view on count of cars grouped by name and by place (there are two dimensions). Last query is equivalent to query:

SELECT name, null as place, SUM(count) FROM cars GROUP BY name
union all 
SELECT null, place, SUM(count) FROM cars GROUP BY place
union all
SELECT null, null, SUM(count) FROM cars;

All combination of attributes do CUBE operator:

postgres=# SELECT name, place, SUM(count) FROM cars GROUP BY CUBE(name, place);
 name  |   place    |  sum  
-------+------------+-------
 skoda | czech rep. | 10000
 opel  | germany    |  7000
 bmw   | germany    |  1000
 bmw   | czech rep. |   100
 opel  | czech rep. |  7000
 skoda | germany    |  5000
 bmw   |            |  1100
 skoda |            | 15000
 opel  |            | 14000
       | germany    | 13000
       | czech rep. | 17100
       |            | 30100
(12 rows)

Rollup analyze is supported with ROLLUP operator

postgres=# SELECT name, place, SUM(count) FROM cars GROUP BY ROLLUP(name, place);
 name  |   place    |  sum  
-------+------------+-------
 skoda | czech rep. | 10000
 opel  | germany    |  7000
 bmw   | germany    |  1000
 bmw   | czech rep. |   100
 opel  | czech rep. |  7000
 skoda | germany    |  5000
 bmw   |            |  1100
 skoda |            | 15000
 opel  |            | 14000
       |            | 30100
(10 rows)

ROLLUP and CUBE operators should be simple to transform to GROUPING SETS spec:

SELECT name, place, SUM(count) FROM cars GROUP BY GROUPING SETS((name, place),name, place,());
 name  |   place    |  sum  
-------+------------+-------
 skoda | czech rep. | 10000
 opel  | germany    |  7000
 bmw   | germany    |  1000
 bmw   | czech rep. |   100
 opel  | czech rep. |  7000
 skoda | germany    |  5000
 bmw   |            |  1100
 skoda |            | 15000
 opel  |            | 14000
       | germany    | 13000
       | czech rep. | 17100
       |            | 30100
(12 rows)

Virtual values

Sure, there are some fields that don't have any value (SQL uses NULL value for this case). We need to detect NULLs in data and also generated NULLs. For this purpose ANSI SQL has function grouping(col) that returns one when column is used for grouping or zero when column and all derivated values are NULL. From world of proprietary databases comes the function grouping_id:

postgres=# select name, place, sum(count), grouping(name), grouping(place), grouping_id(name, place) 
              from cars group by grouping sets((name, place),name, place,());
 name  |   place    |  sum  | grouping | grouping | grouping_id 
-------+------------+-------+----------+----------+-------------
 bmw   | czech rep. |   100 |        0 |        0 |           0
 skoda | germany    |  5000 |        0 |        0 |           0
 opel  | czech rep. |  7000 |        0 |        0 |           0
 opel  | germany    |  7000 |        0 |        0 |           0
 skoda | czech rep. | 10000 |        0 |        0 |           0
 bmw   | germany    |  1000 |        0 |        0 |           0
 bmw   |            |  1100 |        0 |        1 |           1
 skoda |            | 15000 |        0 |        1 |           1
 opel  |            | 14000 |        0 |        1 |           1
       | germany    | 13000 |        1 |        0 |           2
       | czech rep. | 17100 |        1 |        0 |           2
       |            | 30100 |        1 |        1 |           3
(12 rows)

Implementation

Feeder based patch

Current implementation is based on modification of normal hash aggregation. Hash are used because it allows one scan of data and parallel processing of all grouping sets (it doesn't need materialization). The disadvantage is that only types that support hashing are supported, and there could be some "out of memory problem" for very large results. This feature is a little bit similar CTE. It allows repeated scan of some source (result, table), and probably GROUPING SETS should be based on CTE (non hashed mode). CTEs are more general, but GROUPING SETS would be implemented more effectively, because difference to normal query is only in GROUP BY clause. Internally GROUPING SETS suffers from same problem as CTE. It broke current internal system of PostgreSQL data pipeline which is based on data pulling. GS needs data pushing. It's solved with helper node that should carry one tuple (feeder) and special node operation, these ensure only one tuple processing. Hash operations in PostgreSQL are bulk operations - first, hash table is completely filled and later is read row by row.

You can see feeder node in query plan (one feeder is used many times):

 postgres=# explain select name, place, sum(count), 
              grouping(name), grouping(place), grouping_id(name, place) 
             from cars group by grouping sets((name, place),name, place,());
                        QUERY PLAN                         
-----------------------------------------------------------
 Grouping Sets  (cost=1.08..1.29 rows=9 width=18)
   ->  Seq Scan on cars  (cost=0.00..1.06 rows=6 width=18)
   ->  HashAggregate  (cost=1.10..1.14 rows=3 width=18)
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
   ->  HashAggregate  (cost=1.09..1.13 rows=3 width=18)
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
   ->  HashAggregate  (cost=1.09..1.11 rows=2 width=18)
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
   ->  Aggregate  (cost=1.08..1.09 rows=1 width=18)
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
(10 rows)

Calls of functions grouping and grouping_id are replaced at planner time with adequate values:

postgres=# explain verbose select name, place, sum(count), 
           grouping(name), grouping(place), grouping_id(name, place) 
           from cars group by grouping sets((name, place),name, place,());
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Grouping Sets  (cost=1.08..1.29 rows=9 width=18)
   Output: name, place, sum(count), 0, 0, 0
   ->  Seq Scan on cars  (cost=0.00..1.06 rows=6 width=18)
         Output: name, place, count
   ->  HashAggregate  (cost=1.10..1.14 rows=3 width=18)
         Output: name, place, sum(count), 1, 1, 3
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: name, place, count
   ->  HashAggregate  (cost=1.09..1.13 rows=3 width=18)
         Output: name, NULL::character varying, sum(count), 1, 0, 2
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: name, place, count
   ->  HashAggregate  (cost=1.09..1.11 rows=2 width=18)
         Output: NULL::character varying, place, sum(count), 0, 1, 1
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: name, place, count
   ->  Aggregate  (cost=1.08..1.09 rows=1 width=18)
         Output: NULL::character varying, NULL::character varying, sum(count), 0, 0, 0
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: name, place, count
(20 rows)

CTE based patch

postgres=# explain verbose select name, place, sum(count), 
           grouping(name), grouping(place), grouping_id(name, place) 
           from cars group by grouping sets((name, place),name, place,());
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Append  (cost=1.23..2.09 rows=19 width=44)
   CTE **g**
     ->  Seq Scan on cars  (cost=0.00..1.06 rows=6 width=18)
           Output: cars.name, cars.place, cars.count
   ->  HashAggregate  (cost=0.17..0.24 rows=6 width=68)
         Output: "**g**".name, "**g**".place, sum("**g**".count), 0, 0, 0
         ->  CTE Scan on "**g**"  (cost=0.00..0.12 rows=6 width=68)
               Output: "**g**".name, "**g**".place, "**g**".count
   ->  HashAggregate  (cost=0.15..0.23 rows=6 width=36)
         Output: "**g**".name, NULL::character varying, sum("**g**".count), 0, 1, 1
         ->  CTE Scan on "**g**"  (cost=0.00..0.12 rows=6 width=36)
               Output: "**g**".name, "**g**".place, "**g**".count
   ->  HashAggregate  (cost=0.15..0.23 rows=6 width=36)
         Output: NULL::character varying, "**g**".place, sum("**g**".count), 1, 0, 2
         ->  CTE Scan on "**g**"  (cost=0.00..0.12 rows=6 width=36)
               Output: "**g**".name, "**g**".place, "**g**".count
   ->  Aggregate  (cost=0.14..0.15 rows=1 width=4)
         Output: NULL::character varying, NULL::character varying, sum("**g**".count), 1, 1, 3
         ->  CTE Scan on "**g**"  (cost=0.00..0.12 rows=6 width=4)
               Output: "**g**".name, "**g**".place, "**g**".count
(20 rows)

Known limits

Only hash based mode is supported - code for materialization mode (with rescans) should be probably shared with CTE. Limits are same as nonrecursive CTE

Known bugs (feeder based patch)

a) problem with functions - when target list contains functions, that are not in group by clause:

postgres=# explain verbose select sum(sin(count)) from cars group by grouping sets(name,place);
                        QUERY PLAN                         
-----------------------------------------------------------
 Grouping Sets  (cost=1.09..1.21 rows=5 width=18)
   Output: sum(sin((count)::double precision))
   ->  Seq Scan on cars  (cost=0.00..1.06 rows=6 width=18)
         Output: count, name, place
   ->  HashAggregate  (cost=1.09..1.14 rows=3 width=18)
         Output: sum(sin((count)::double precision))
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: count, name, place
   ->  HashAggregate  (cost=1.09..1.12 rows=2 width=18)
         Output: sum(sin((count)::double precision))
         ->  Feeder  (cost=0.00..1.06 rows=6 width=18)
               Output: count, name, place
(12 rows)
postgres=# select sum(sin(count)) from cars group by grouping sets(name,place);
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

Problem is in source plan, it should to do scalar function call too, like:

postgres=# explain verbose select sum(sin(count)) from cars group by grouping sets(sin(count));
                         QUERY PLAN                          
-------------------------------------------------------------
 Grouping Sets  (cost=1.09..1.20 rows=5 width=4)
   Output: sum(sin((count)::double precision))
   ->  Seq Scan on cars  (cost=0.00..1.06 rows=6 width=4)
         Output: count, sin((count)::double precision)
   ->  HashAggregate  (cost=1.09..1.20 rows=5 width=4)
         Output: sum((sin((count)::double precision)))
         ->  Feeder  (cost=0.00..1.06 rows=6 width=4)
               Output: count, sin((count)::double precision)
(8 rows)

b) some strange hashing (in some cases)

postgres=# select name, place from cars group by grouping sets(name, place);
 name  |   place    
-------+------------
 skoda | 
 bmw   | 
 opel  | 
       | czech rep.
       | czech rep.
       | czech rep.
(6 rows)
--with one aggregate the result is correct:
postgres=# select name, place, sum(count) from cars group by grouping sets(name, place);
 name  |   place    |  sum  
-------+------------+-------
 skoda |            | 15000
 bmw   |            |  1100
 opel  |            | 14000
       | germany    | 13000
       | czech rep. | 17100
(5 rows)

CTE based patch hasn't know bugs.

Current state

Current patch is very experimental. Mainly, the planer stage should be refactored. There are a lot of lines of code redundant with GROUP clause implementation (that would be subset of GROUPING SETS implementation)- Any planner experts are welcome! Current GROUPING SETS planner part is a reason for lot of crashes :(.

CTE based patch is +/- stable, and it is available for testing

ToDo

  • transformation from CTE to native tuple store interface
  • develop SetsHashAggregate
  • documentation

Notes

  • using qualifier in GROUP BY clause (default is ALL)
SELECT ...
   GROUP BY [ALL|DISTINCT] GROUPING SETS(...
  
  GROUP BY DISTINCT means
 
  SELECT ...
  UNION 
  SELECT ...
  UNION ...
  SELECT

new develop cycle - 2010

actualised patch http://archives.postgresql.org/pgsql-hackers/2010-08/msg00647.php version003

- state:

  • there are not issues with empty grouping sets and targetList without aggregates: the solution is relative simple - replace empty grouping sets with fake - GROUP BY true
  • there isn't issues with having - it is necessary transform targetList (replace a GroupingSets function and replace a nongrouped columns)
  • there is issue with WHERE clause - GroupingSets function cannot be there
  • there is issue with GroupingSets functions and ORDER BY clause

Solved questions

  • parser: ROLLUP and CUBE have to be a reserved keyword - functions and datatypes from "cube" contrib will be renamed

Unsolved questions

  • Where and when and how to implement a GroupingSets functions - GROUPING and GROUPING_ID
    • early - replace in transformation stage - there are necessary data available, but problem with ORDER BY
    • late - evaluate it in eval_const_expressions_mutator - but there are not necessary data, maybe put a values to context->boundParams (preferred)
  • Some optimization mainly special executor node for ROLLUP

Plan

Replace trivial transformation to CTE by simple method based on Tuplestore. Can be replaced by non recursive CTE in future.

postgres=# select name, place, sum(count), grouping(name), grouping(place) from cars group by rollup(name, place);
 name  |   place    |  sum  | grouping | grouping 
-------+------------+-------+----------+----------
 bmw   | czech rep. |   100 |        0 |        0
 skoda | germany    |  5000 |        0 |        0
 opel  | czech rep. |  7000 |        0 |        0
 opel  | germany    |  7000 |        0 |        0
 skoda | czech rep. | 10000 |        0 |        0
 bmw   | germany    |  1000 |        0 |        0
 bmw   |            |  1100 |        0 |        1
 skoda |            | 15000 |        0 |        1
 opel  |            | 14000 |        0 |        1
       |            | 30100 |        1 |        1
(10 rows)

postgres=# explain select name, place, sum(count), grouping(name), grouping(place) from cars group by rollup(name, place);
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Append  (cost=41.12..91.67 rows=402 width=52)
   CTE GroupingSets
     ->  Seq Scan on cars  (cost=0.00..18.30 rows=830 width=68)
   ->  HashAggregate  (cost=22.82..25.32 rows=200 width=68)
         ->  CTE Scan on "GroupingSets"  (cost=0.00..16.60 rows=830 width=68)
   ->  HashAggregate  (cost=20.75..23.25 rows=200 width=36)
         ->  CTE Scan on "GroupingSets"  (cost=0.00..16.60 rows=830 width=36)
   ->  Subquery Scan on "*SELECT* 3"  (cost=0.00..20.79 rows=2 width=4)
         ->  GroupAggregate  (cost=0.00..20.77 rows=2 width=4)
               ->  CTE Scan on "GroupingSets"  (cost=0.00..16.60 rows=830 width=4)
(10 rows)
Personal tools