Range aggregation

From PostgreSQL wiki

Jump to: navigation, search

Library Snippets

Range aggregation

Works with PostgreSQL

8.4

Written in

SQL

Depends on

Nothing


Assume you have a table of ranges expressed as "start" (s) and "end" (e) columns; we'll assume that these are half-open intervals, i.e. each row represents the range [s,e). We also assume that the constraint (e > s) is enforced.

The problem is to aggregate all overlapping ranges and produce a result with one row for each disjoint collection of ranges.

For example (using integers for simplicity but this will often involve other types, such as timestamps, in practice):

test=# select * from intervals order by s,e;
 s  | e  
----+----
  1 |  3
  2 |  4
  5 |  6
  5 |  8
  6 |  9
  7 | 10
  8 | 10
 10 | 11
 10 | 15
 11 | 12
 12 | 13
(11 rows)

The desired output from this is:

  s  |  e 
-----+-----
   1 |   4
   5 |  10
  10 |  15
(3 rows)

Unusually for SQL, this problem can be solved most easily by thinking about it in purely procedural terms.

First, consider each range one at a time in ascending order of (s,e). For each range there are only two possibilities: either it overlaps with a range which we have already processed, or it begins a new disjoint range. Using PG 8.4 or later, we can express this idea using window functions as follows:

SELECT s, e,
       CASE WHEN s < max(le) OVER (ORDER BY s,e) THEN NULL ELSE s END AS new_start
  FROM (SELECT s, e, lag(e) OVER (ORDER BY s,e) AS le FROM intervals) s1;

The subquery here is because we really want max(lag(e)) in order to exclude the current row's value of "e" but window functions cannot be nested directly. (In 9.0 or later, it may be more convenient to simply change the window framing clause to exclude the current row; this feature is unavailable in 8.4.) The new column "new_start" contains a non-null value only if the current row represents the left edge of a new disjoint interval; all rows which merely extend existing intervals have null values. The output on our test data is therefore:

 s  | e  | new_start 
----+----+-----------
  1 |  3 |         1
  2 |  4 |          
  5 |  6 |         5
  5 |  8 |          
  6 |  9 |          
  7 | 10 |          
  8 | 10 |          
 10 | 11 |        10
 10 | 15 |          
 11 | 12 |          
 12 | 13 |          
(11 rows)

We now extend this result to label every row with the left edge of the aggregated range of which it is part. We do this by wrapping the above in another level of subquery, using max(new_start) as a window function to propagate the value from each "left edge" row to its successors:

SELECT s, e,
       max(new_start) OVER (ORDER BY s,e) AS left_edge
  FROM (SELECT s, e,
               CASE WHEN s < max(le) OVER (ORDER BY s,e) THEN NULL ELSE s END AS new_start
          FROM (SELECT s, e, lag(e) OVER (ORDER BY s,e) AS le FROM intervals) s1) s2;

The output from our test data is now:

 s  | e  | left_edge 
----+----+-----------
  1 |  3 |         1
  2 |  4 |         1
  5 |  6 |         5
  5 |  8 |         5
  6 |  9 |         5
  7 | 10 |         5
  8 | 10 |         5
 10 | 11 |        10
 10 | 15 |        10
 11 | 12 |        10
 12 | 13 |        10
(11 rows)

We can now simply group by left_edge in order to aggregate all overlapping rows together:

SELECT min(s) AS s, max(e) AS e
  FROM (SELECT s, e,
               max(new_start) OVER (ORDER BY s,e) AS left_edge
          FROM (SELECT s, e,
                       CASE WHEN s < max(le) OVER (ORDER BY s,e) THEN NULL ELSE s END AS new_start
                  FROM (SELECT s, e, lag(e) OVER (ORDER BY s,e) AS le FROM intervals) s1) s2) s3
 GROUP BY left_edge;

producing the desired output:

 s  | e  
----+----
  1 |  4
  5 | 10
 10 | 15
(3 rows)
Personal tools