Aggregate Median
NOTE: This page is obsolete.
To get the median in PostgreSQL, use percentile_cont(0.5) WITHIN GROUP (ORDER BY num_value). See the documentation for more details. However ordered-set aggregates cannot be used as Window functions, while the alternatives given below can be.
The statistical median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half. A median is only defined on ordered one-dimensional data, and is independent of any distance metric.
A more efficient implementation of medians can be obtained by using the quantile extension with quantile(some_column,0.5)
.
median(numeric)
This snippet is also part of the ulib_agg user-defined library.
CREATE OR REPLACE FUNCTION _final_median(numeric[])
RETURNS numeric AS
$$
SELECT AVG(val)
FROM (
SELECT val
FROM unnest($1) val
ORDER BY 1
LIMIT 2 - MOD(array_upper($1, 1), 2)
OFFSET CEIL(array_upper($1, 1) / 2.0) - 1
) sub;
$$
LANGUAGE 'sql' IMMUTABLE;
CREATE AGGREGATE median(numeric) (
SFUNC=array_append,
STYPE=numeric[],
FINALFUNC=_final_median,
INITCOND='{}'
);
Usage
SELECT median(num_value) AS median_value FROM t;
Caution
Unlike mode() and range(), this one will only work with PostgreSQL 8.4 and higher due to the dynamic limit and offset.
NOTE: This version does not take NULL values into consideration, e.g., median('{1,NULL,NULL}') is NULL.
This implementation is also quite slow. On tables of ~500K elements orafce implementation is ~400 times faster.
median(anyelement)
Here's a refinement of the above with two innovations: it takes any numerical value via anyelement. It also culls nulls which requires an extra pass so will be a bit slower. Also, I switched it to return float8 which is mainly a cosmetic preference.
This snippet is also part of the ulib_agg user-defined library.
CREATE FUNCTION _final_median(anyarray) RETURNS float8 AS $$
WITH q AS
(
SELECT val
FROM unnest($1) val
WHERE VAL IS NOT NULL
ORDER BY 1
),
cnt AS
(
SELECT COUNT(*) as c FROM q
)
SELECT AVG(val)::float8
FROM
(
SELECT val FROM q
LIMIT 2 - MOD((SELECT c FROM cnt), 2)
OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) - 1,0)
) q2;
$$ LANGUAGE sql IMMUTABLE;
CREATE AGGREGATE median(anyelement) (
SFUNC=array_append,
STYPE=anyarray,
FINALFUNC=_final_median,
INITCOND='{}'
);
Usage
SELECT median(value) AS median_value FROM t;
Caution
If you are stuck with PostgreSQL 8.3 or lower the following modification to the above median aggregate will work (note you first have to define unnest and create the PL/pgsql language (CREATE LANGUAGE plpgsql;) if you haven't already. Also note more efficient median functions are possible on unsorted arrays, e.g., in O(N) using a selection algorithm vs this O(N lg N) that first comparison sorts the array.
CREATE OR REPLACE FUNCTION final_median(anyarray) RETURNS float8 AS
$$
DECLARE
cnt INTEGER;
BEGIN
cnt := (SELECT count(*) FROM unnest($1) val WHERE val IS NOT NULL);
RETURN (SELECT avg(tmp.val)::float8
FROM (SELECT val FROM unnest($1) val
WHERE val IS NOT NULL
ORDER BY 1
LIMIT 2 - MOD(cnt, 2)
OFFSET CEIL(cnt/ 2.0) - 1
) AS tmp
);
END
$$ LANGUAGE plpgsql;
CREATE AGGREGATE median(anyelement) (
SFUNC=array_append,
STYPE=anyarray,
FINALFUNC=final_median,
INITCOND='{}'
);
External implementations
There are many compiled versions for "external" median calculus:
- C code for libpq:
- PL/R (the R Language is also a statistical package)
- orafce has quite fast median implementation. It's packages are also available in Debian/Ubuntu repositories as postgresql-x.x-orafce.
- ...