Aggregate Median

From PostgreSQL wiki

Jump to: navigation, search

The statistical median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half. A median is only defined on ordered one-dimensional data, and is independent of any distance metric.

Contents

median(numeric)

Snippets

Aggregate Median

Works with PostgreSQL

8.4

Written in

SQL

Depends on

Nothing

This snippet is also part of the ulib_agg user-defined library.

CREATE OR REPLACE FUNCTION _final_median(numeric[])
   RETURNS numeric AS
$$
   SELECT AVG(val)
   FROM (
     SELECT val
     FROM unnest($1) val
     ORDER BY 1
     LIMIT  2 - MOD(array_upper($1, 1), 2)
     OFFSET CEIL(array_upper($1, 1) / 2.0) - 1
   ) sub;
$$
LANGUAGE 'sql' IMMUTABLE;
 
CREATE AGGREGATE median(numeric) (
  SFUNC=array_append,
  STYPE=numeric[],
  FINALFUNC=_final_median,
  INITCOND='{}'
);

Usage

SELECT median(num_value) AS median_value FROM t;

Caution

Unlike mode() and range(), this one will only work with PostgreSQL 8.4 and higher due to the dynamic limit and offset.

NOTE: This version does not take NULL values into consideration, e.g., median('{1,NULL,NULL}') is NULL.

This implementation is also quite slow. On tables of ~500K elements orafce implementation is ~400 times faster.

median(anyelement)

Snippets

Aggregate Median

Works with PostgreSQL

8.4

Written in

SQL

Depends on

Nothing

Here's a refinement of the above with two innovations: it takes any numerical value via anyelement. It also culls nulls which requires an extra pass so will be a bit slower. Also, I switched it to return float8 which is mainly a cosmetic preference. This snippet is also part of the ulib_agg user-defined library.

CREATE FUNCTION _final_median(anyarray) RETURNS float8 AS $$ 
  WITH q AS
  (
     SELECT val
     FROM unnest($1) val
     WHERE VAL IS NOT NULL
     ORDER BY 1
  ),
  cnt AS
  (
    SELECT COUNT(*) AS c FROM q
  )
  SELECT AVG(val)::float8
  FROM 
  (
    SELECT val FROM q
    LIMIT  2 - MOD((SELECT c FROM cnt), 2)
    OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) - 1,0)  
  ) q2;
$$ LANGUAGE sql IMMUTABLE;
 
CREATE AGGREGATE median(anyelement) (
  SFUNC=array_append,
  STYPE=anyarray,
  FINALFUNC=_final_median,
  INITCOND='{}'
);

Usage

SELECT median(value) AS median_value FROM t;

Caution

If you are stuck with PostgreSQL 8.3 or lower the following modification to the above median aggregate will work (note you first have to define unnest and create the PL/pgsql language (CREATE LANGUAGE plpgsql;) if you haven't already. Also note more efficient median functions are possible on unsorted arrays, e.g., in O(N) using a selection algorithm vs this O(N lg N) that first comparison sorts the array.

CREATE OR REPLACE FUNCTION final_median(anyarray) RETURNS float8 AS
$$ 
DECLARE
  cnt INTEGER;
BEGIN
  cnt := (SELECT count(*) FROM unnest($1) val WHERE val IS NOT NULL);
  RETURN (SELECT avg(tmp.val)::float8 
            FROM (SELECT val FROM unnest($1) val
                    WHERE val IS NOT NULL 
                    ORDER BY 1 
                    LIMIT 2 - MOD(cnt, 2) 
                    OFFSET CEIL(cnt/ 2.0) - 1
                  ) AS tmp
         );
END
$$ LANGUAGE plpgsql;
 
CREATE AGGREGATE median(anyelement) (
  SFUNC=array_append,
  STYPE=anyarray,
  FINALFUNC=final_median,
  INITCOND='{}'
);

External implementations

There are many compiled versions for "external" median calculus:

See Also

Aggregate Mode Aggregate Median

Personal tools