PgCon 2020 Developer Unconference/parallel grouping sets and refactor nodeagg notes

From PostgreSQL wiki
Jump to: navigation, search

Parallel Grouping Sets Proposal

Parallel Grouping Sets patchset (authored by Pengzhou Tang and Richard Guo)

Example Grouping Sets Query and Plan in master vs proposed patch

select avg(a), max(c), sum(c) from foo group by grouping sets ((b, c), (a, e), (c, d), (c), (e, b), (b), (a, b));

Serial plan (master):

                    QUERY PLAN                     
---------------------------------------------------
 MixedAggregate
   Output: avg(a), max(c), sum(c), b, c, a, e, d
   Hash Key: foo.a, foo.e
   Hash Key: foo.c, foo.d
   Group Key: foo.c, foo.b
   Group Key: foo.c
   Sort Key: foo.b, foo.e
     Group Key: foo.b, foo.e
     Group Key: foo.b
   Sort Key: foo.a, foo.b
     Group Key: foo.a, foo.b
   ->  Sort
         Output: b, c, a, e, d
         Sort Key: foo.c, foo.b
         ->  Gather
               Output: b, c, a, e, d
               Workers Planned: 1
               ->  Parallel Seq Scan on public.foo
                     Output: b, c, a, e, d

Parallel plan (patch above):

                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 Finalize MixedAggregate
   Output: avg(a), max(c), sum(c), b, c, a, e, d
   Filtered by: (GROUPINGSETID())
   Sort Key: foo.b, foo.c
     Group Key: foo.b, foo.c
   Sort Key: foo.b
     Group Key: foo.b
   Sort Key: foo.e, foo.b
     Group Key: foo.e, foo.b
   Sort Key: foo.a, foo.b
     Group Key: foo.a, foo.b
   Hash Key: foo.c, foo.d
   Hash Key: foo.a, foo.e
   ->  Gather
         Output: b, c, a, e, d, (PARTIAL avg(a)), (PARTIAL max(c)), (PARTIAL sum(c)), (GROUPINGSETID())
         Workers Planned: 1
         ->  Partial MixedAggregate
               Output: b, c, a, e, d, PARTIAL avg(a), PARTIAL max(c), PARTIAL sum(c), GROUPINGSETID()
               Sort Key: foo.b, foo.c
                 Group Key: foo.b, foo.c
                 Group Key: foo.b
               Sort Key: foo.e, foo.b
                 Group Key: foo.e, foo.b
               Sort Key: foo.a, foo.b
                 Group Key: foo.a, foo.b
               Hash Key: foo.c, foo.d
               Hash Key: foo.a, foo.e
               ->  Parallel Seq Scan on public.foo
                     Output: b, c, a, e, d


Pain points for parallelizing grouping sets

  • Sort path being added beneath group agg
    • a Sort below the FinalAgg is superfluous and adds cost for no reason
    • Potential solutions
      • make first GroupAgg do internal sort similar to other phases (done in the patchset by Richard Guo and Pengzhou Tang)
      • all agg nodes could do internal sorts -- and Sort paths are not added
      • separate out all phases into different agg nodes and decide whether or not to add Sort paths beneath each one (don't add Sorts benath FinalAgg phases)
  • all hashagg grouping sets are in the same phase making hash function expression evaluation difficult
    • there is an optimization for hashagg in which the hash function can be evaluated for multiple columns/groups of columns in a single evaluation, however this is undesirable for FinalAgg

Potential improvements to nodeAgg.c for grouping sets

Separate phases into different nodes

  • Goal: simplify code for Grouping Sets phases, especially Mixed Aggregates
  • get rid of the chain as well as the internal sorts and make a new agg node for each phase/rollup
  • have a CTEScan-like Scan operation which allows different GroupAgg phases to access the same tuples (proposed by Melanie Plageman with input from Jesse Zhang)

create table t1 (a int, b bit(4), c int, d int, e xid); -- b is unhashable, e is not sortable select sum(h) from t1 group by grouping sets ((a,e),(a),(c,b),(b))

Current plan:

 MixedAggregate
   Hash Key: a, e
   Hash Key: e
   Group Key: b, c
   Group Key: b
   ->  Sort
         Sort Key: b, c
         ->  Seq Scan on t1

Proposed plan:

GatherAppendOrMergeOrRedirect
    -> HashAgg ((a,e),(a))
        ->Scan(1)
    -> GroupAgg ((c,b),(b))
      -> Sort
          ->Scan(1)

Proposed parallel plan:

GatherAndCollateAndMerge
  ->FinalHashAgg ((a,e), gsetid)
  ->FinalHashAgg ((e), gsetid)
  ->FinalGroupAgg ((c,b), gsetid)
  ->FinalGroupAgg ((b), gsetid)
      ->GatherAppendOrMergeRedirect
          ->PartialHashAgg ((a,e),(a))
              ->Scan(1)
          ->PartialGroupAgg ((c,b),(b))
              ->Sort
                  ->Scan(1)


After discussion it was determined that the above is infeasible and undesirable because:

  • losing interleaving behavior of current tuplesort construction (reuse memory for tuplesort across phases) -- not cache-friendly to not reuse the memory
  • more memory intensive
  • too many ExecProcNode calls when adding more nodes leading to decreased performance because of indirect function calls

"Stacking" Grouping Sets

  • a kind of tuple pass through which allows tuples to skip phases -- "agg node stacking" instead of chaining (from Jesse Zhang and Jeff Davis)
  • Each phase will check a grouping set identifier and only process it if it matches
  • Pros:
    • Kills the chain
    • still cache and memory usage friendly, like the chain
    • It lends itself perfectly to pulling out "rollup" and making that faster
    • Parallel friendly
    • No magic intra-node dispatch
  • Con:
    • We do lose some optimization for hash agg, in the "one bang evaluation of all hashes"

SELECT max(a) FROM foo GROUP BY GROUPING SETS (ROLLUP(b,c), (c,d));

Master

Agg
  Output: max(a), b, c, d
  Group Key: (b,c)
  Group Key: (b)
  Group Key: ()
  Group Key: (c,d)
    Sort:
      Sort Key: (c,d)
 -> Sort
      Sort Key: b
     -> Seq Scan on foo


Proposed plan

Finalize GroupAggregate
  Output: max(a), b, c, d, gsetid
  Group Key: (c,d)
  Input gset: 0b100
 -> Finalize GroupAggregate
      Output: max(a), b, c, d, gsetid
      Group Key: ()
      Input gset: 0b111
     -> Finalize GroupAggregate
          Output: max(a), b, c, d, gsetid
          Group Key: (b)
          Input gset: 0b011
         -> Finalize GroupAggregate
              Output: max(a), b, c, d, gsetid
              Group Key: (b,c)
              Input gset: 0b001
             -> Partial GroupAgg
                  Group Key: (c,d)
                  Output: max(PARTIAL a), b, c, d, gsetid
                  Output gset: 0b100
                  Sort
                    Sort Key: (c,d)
                 -> Partial GroupAgg
                      Group Key: ()
                      Output: max(PARTIAL a), b, c, d, gsetid
                      Output gset: 0b111
                      Input gset: 0b011
                     -> Partial GroupAgg
                          Group Key: (b)
                          Output: max(PARTIAL a), b, c, d, gsetid
                          Output gset: 0b011
                          Input gset: 0b001
                         -> Initial Partial GroupAgg
                              Output: max(PARTIAL a), b, c, d, gsetid
                              Group Key: (b,c)
                              Output gset: 0b001
                              Sort
                                Sort Key: b
                             -> Seq Scan on foo


Other ideas for making nodeAgg.c better

  • Move code out of ExecInitAgg which deduplicates and combines transition functions and puts it in the planner (Andres Freund)
  • Clean up AggState members that impose complexity and are easy to fix (cur*, redundant ExprContext*, AggStatePer* stuff) (Andres Freund)
  • Simplify state management for Agg ([1]) (Andres Freund)
  • Better document grouping sets phases (Andres Freund)
  • Reorder file to have related functions together (Andres Freund)
  • Splitting up nodeAgg.c into multiple files (Heikki Linnakangas)
  • Reordering the grouping sets phases (not squeezing hashaggs into phase 0) (Richard Guo and Pengzhou Tang)
    • see this commit from Pengzhou Tang and Richard Guo that does that (in preparation for implementing parallel grouping sets)
  • Eliminating the fake sort nodes used for GroupAgg phases in the chain (phases after the first GroupAgg phase) (Melanie Plageman and Andres Freund)

Optimizations and future improvements to aggregate execution

  • Hash rollup and groupagg rollup optimization (Jesse Zhang)
  • Spilling partial tuples (disk-based hash aggregation) (Jesse Zhang and Jeff Davis)