PgCon 2020 Developer Unconference/parallel grouping sets and refactor nodeagg notes
From PostgreSQL wiki
Jump to navigationJump to searchParallel Grouping Sets Proposal
Parallel Grouping Sets patchset (authored by Pengzhou Tang and Richard Guo)
Example Grouping Sets Query and Plan in master
vs proposed patch
select avg(a), max(c), sum(c) from foo group by grouping sets ((b, c), (a, e), (c, d), (c), (e, b), (b), (a, b));
Serial plan (master):
QUERY PLAN --------------------------------------------------- MixedAggregate Output: avg(a), max(c), sum(c), b, c, a, e, d Hash Key: foo.a, foo.e Hash Key: foo.c, foo.d Group Key: foo.c, foo.b Group Key: foo.c Sort Key: foo.b, foo.e Group Key: foo.b, foo.e Group Key: foo.b Sort Key: foo.a, foo.b Group Key: foo.a, foo.b -> Sort Output: b, c, a, e, d Sort Key: foo.c, foo.b -> Gather Output: b, c, a, e, d Workers Planned: 1 -> Parallel Seq Scan on public.foo Output: b, c, a, e, d
Parallel plan (patch above):
QUERY PLAN -------------------------------------------------------------------------------------------------------- Finalize MixedAggregate Output: avg(a), max(c), sum(c), b, c, a, e, d Filtered by: (GROUPINGSETID()) Sort Key: foo.b, foo.c Group Key: foo.b, foo.c Sort Key: foo.b Group Key: foo.b Sort Key: foo.e, foo.b Group Key: foo.e, foo.b Sort Key: foo.a, foo.b Group Key: foo.a, foo.b Hash Key: foo.c, foo.d Hash Key: foo.a, foo.e -> Gather Output: b, c, a, e, d, (PARTIAL avg(a)), (PARTIAL max(c)), (PARTIAL sum(c)), (GROUPINGSETID()) Workers Planned: 1 -> Partial MixedAggregate Output: b, c, a, e, d, PARTIAL avg(a), PARTIAL max(c), PARTIAL sum(c), GROUPINGSETID() Sort Key: foo.b, foo.c Group Key: foo.b, foo.c Group Key: foo.b Sort Key: foo.e, foo.b Group Key: foo.e, foo.b Sort Key: foo.a, foo.b Group Key: foo.a, foo.b Hash Key: foo.c, foo.d Hash Key: foo.a, foo.e -> Parallel Seq Scan on public.foo Output: b, c, a, e, d
Pain points for parallelizing grouping sets
Sort
path being added beneath group agg- a Sort below the FinalAgg is superfluous and adds cost for no reason
- Potential solutions
- make first GroupAgg do internal sort similar to other phases (done in the patchset by Richard Guo and Pengzhou Tang)
- all agg nodes could do internal sorts -- and Sort paths are not added
- separate out all phases into different agg nodes and decide whether or not to add Sort paths beneath each one (don't add Sorts benath FinalAgg phases)
- all hashagg grouping sets are in the same phase making hash function expression evaluation difficult
- there is an optimization for hashagg in which the hash function can be evaluated for multiple columns/groups of columns in a single evaluation, however this is undesirable for FinalAgg
Potential improvements to nodeAgg.c for grouping sets
Separate phases into different nodes
- Goal: simplify code for Grouping Sets phases, especially Mixed Aggregates
- get rid of the chain as well as the internal sorts and make a new agg node for each phase/rollup
- have a CTEScan-like Scan operation which allows different GroupAgg phases to access the same tuples (proposed by Melanie Plageman with input from Jesse Zhang)
create table t1 (a int, b bit(4), c int, d int, e xid);
-- b is unhashable, e is not sortable
select sum(h) from t1 group by grouping sets ((a,e),(a),(c,b),(b))
Current plan:
MixedAggregate Hash Key: a, e Hash Key: e Group Key: b, c Group Key: b -> Sort Sort Key: b, c -> Seq Scan on t1
Proposed plan:
GatherAppendOrMergeOrRedirect -> HashAgg ((a,e),(a)) ->Scan(1) -> GroupAgg ((c,b),(b)) -> Sort ->Scan(1)
Proposed parallel plan:
GatherAndCollateAndMerge ->FinalHashAgg ((a,e), gsetid) ->FinalHashAgg ((e), gsetid) ->FinalGroupAgg ((c,b), gsetid) ->FinalGroupAgg ((b), gsetid) ->GatherAppendOrMergeRedirect ->PartialHashAgg ((a,e),(a)) ->Scan(1) ->PartialGroupAgg ((c,b),(b)) ->Sort ->Scan(1)
After discussion it was determined that the above is infeasible and undesirable because:
- losing interleaving behavior of current tuplesort construction (reuse memory for tuplesort across phases) -- not cache-friendly to not reuse the memory
- more memory intensive
- too many ExecProcNode calls when adding more nodes leading to decreased performance because of indirect function calls
"Stacking" Grouping Sets
- a kind of tuple pass through which allows tuples to skip phases -- "agg node stacking" instead of chaining (from Jesse Zhang and Jeff Davis)
- Each phase will check a grouping set identifier and only process it if it matches
- Pros:
- Kills the chain
- still cache and memory usage friendly, like the chain
- It lends itself perfectly to pulling out "rollup" and making that faster
- Parallel friendly
- No magic intra-node dispatch
- Con:
- We do lose some optimization for hash agg, in the "one bang evaluation of all hashes"
SELECT max(a) FROM foo GROUP BY GROUPING SETS (ROLLUP(b,c), (c,d));
Master
Agg Output: max(a), b, c, d Group Key: (b,c) Group Key: (b) Group Key: () Group Key: (c,d) Sort: Sort Key: (c,d) -> Sort Sort Key: b -> Seq Scan on foo
Proposed plan
Finalize GroupAggregate Output: max(a), b, c, d, gsetid Group Key: (c,d) Input gset: 0b100 -> Finalize GroupAggregate Output: max(a), b, c, d, gsetid Group Key: () Input gset: 0b111 -> Finalize GroupAggregate Output: max(a), b, c, d, gsetid Group Key: (b) Input gset: 0b011 -> Finalize GroupAggregate Output: max(a), b, c, d, gsetid Group Key: (b,c) Input gset: 0b001 -> Partial GroupAgg Group Key: (c,d) Output: max(PARTIAL a), b, c, d, gsetid Output gset: 0b100 Sort Sort Key: (c,d) -> Partial GroupAgg Group Key: () Output: max(PARTIAL a), b, c, d, gsetid Output gset: 0b111 Input gset: 0b011 -> Partial GroupAgg Group Key: (b) Output: max(PARTIAL a), b, c, d, gsetid Output gset: 0b011 Input gset: 0b001 -> Initial Partial GroupAgg Output: max(PARTIAL a), b, c, d, gsetid Group Key: (b,c) Output gset: 0b001 Sort Sort Key: b -> Seq Scan on foo
Other ideas for making nodeAgg.c better
- Move code out of
ExecInitAgg
which deduplicates and combines transition functions and puts it in the planner (Andres Freund) - Clean up
AggState
members that impose complexity and are easy to fix(cur*, redundant ExprContext*, AggStatePer* stuff)
(Andres Freund) - Simplify state management for Agg ([1]) (Andres Freund)
- Better document grouping sets phases (Andres Freund)
- Reorder file to have related functions together (Andres Freund)
- Splitting up nodeAgg.c into multiple files (Heikki Linnakangas)
- Reordering the grouping sets phases (not squeezing hashaggs into phase 0) (Richard Guo and Pengzhou Tang)
- see this commit from Pengzhou Tang and Richard Guo that does that (in preparation for implementing parallel grouping sets)
- Eliminating the fake sort nodes used for GroupAgg phases in the chain (phases after the first GroupAgg phase) (Melanie Plageman and Andres Freund)
Optimizations and future improvements to aggregate execution
- Hash rollup and groupagg rollup optimization (Jesse Zhang)
- Spilling partial tuples (disk-based hash aggregation) (Jesse Zhang and Jeff Davis)