Aggregate Median

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
m (Added "GREATEST( xxxx, 0) to offset -- if there are no not null values the offset would be -1 which is invalid)
(restructuring)
Line 1: Line 1:
{{SnippetInfo|Aggregate Median|version=8.4|lang=SQL}}
+
The [http://en.wikipedia.org/wiki/Median statistical median] is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half. A median is only defined on ordered one-dimensional data, and is independent of any distance metric.  
  
by Scott Bailey 'Artacus'
+
== median(numeric) ==
 +
{{SnippetInfo2|Aggregate Median|version=8.4|lang=SQL}}
  
Unlike mode() and range(), this one will only work with PostgreSQL 8.4 and higher due to the dynamic limit and offset.
+
This snippet is also part of the [[ulib_agg|ulib_agg user-defined library]].
 
+
NOTE: This version does not take NULL values into consideration, e.g., median('{1,NULL,NULL}') is NULL.
+
  
 
<source lang="sql">
 
<source lang="sql">
Line 30: Line 29:
 
</source>
 
</source>
  
By Merlin Moncure:
+
=== Usage ===
 +
<source lang="sql">SELECT median(num_value) AS median_value FROM t;</source>
 +
 
 +
=== Caution ===
 +
 
 +
Unlike <tt>mode()</tt> and <tt>range()</tt>, this one will only work with PostgreSQL 8.4 and higher due to the dynamic limit and offset.
 +
 
 +
NOTE: This version does not take NULL values into consideration, e.g., median('{1,NULL,NULL}') is NULL.
 +
 
 +
==median(anyelement)==
 +
{{SnippetInfo2|Aggregate Median|version=8.4|lang=SQL}}
 +
 
 
Here's a refinement of the above with two innovations: it takes any numerical value via anyelement.  It also culls nulls which requires an extra pass so will be a bit slower. Also, I switched it to return float8 which is mainly a cosmetic preference.
 
Here's a refinement of the above with two innovations: it takes any numerical value via anyelement.  It also culls nulls which requires an extra pass so will be a bit slower. Also, I switched it to return float8 which is mainly a cosmetic preference.
 +
This snippet is also part of the [[ulib_agg|ulib_agg user-defined library]].
  
 
<source lang="sql">
 
<source lang="sql">
CREATE OR REPLACE FUNCTION _final_median(anyarray)
+
CREATE FUNCTION _final_median(anyarray) RETURNS float8 AS $$  
  RETURNS float8 AS
+
$$  
+
 
   WITH q AS
 
   WITH q AS
 
   (
 
   (
Line 55: Line 64:
 
     OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) - 1,0)   
 
     OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) - 1,0)   
 
   ) q2;
 
   ) q2;
$$
+
$$ LANGUAGE sql IMMUTABLE;
LANGUAGE sql IMMUTABLE;
+
  
 
CREATE AGGREGATE median(anyelement) (
 
CREATE AGGREGATE median(anyelement) (
Line 65: Line 73:
 
);
 
);
 
</source>
 
</source>
 +
 +
=== Usage ===
 +
<source lang="sql">SELECT median(value) AS median_value FROM t;</source>
 +
 +
=== Caution ===
  
 
If you are stuck with PostgreSQL 8.3 or lower the following modification to the above median aggregate will work (note you first have to define [[Array_Unnest | unnest]] and create the PL/pgsql language (CREATE LANGUAGE plpgsql;) if you haven't already.  Also note more efficient median functions are possible on unsorted arrays, e.g., in O(N) using a [http://en.wikipedia.org/wiki/Selection_algorithm selection algorithm] vs this O(N lg N) that first comparison sorts the array.
 
If you are stuck with PostgreSQL 8.3 or lower the following modification to the above median aggregate will work (note you first have to define [[Array_Unnest | unnest]] and create the PL/pgsql language (CREATE LANGUAGE plpgsql;) if you haven't already.  Also note more efficient median functions are possible on unsorted arrays, e.g., in O(N) using a [http://en.wikipedia.org/wiki/Selection_algorithm selection algorithm] vs this O(N lg N) that first comparison sorts the array.
Line 83: Line 96:
 
         );
 
         );
 
END
 
END
$$
+
$$ LANGUAGE plpgsql;
LANGUAGE plpgsql;
+
  
 
CREATE AGGREGATE median(anyelement) (
 
CREATE AGGREGATE median(anyelement) (
Line 93: Line 105:
 
);
 
);
 
</source>
 
</source>
=== See Also ===
 
  
[[Aggregate Mode]]
 
[[Aggregate Median]]
 
  
 +
== C Version ==
 +
There are many compiled versions for "external" ''median'' calculus:
 +
* C code for [[libpq]]:
 +
** [[Aggregate_Median/C code]]
 +
** [http://madlib.net/ MADlib]
 +
* [http://www.joeconway.com/plr/ PL/R]
 +
* ...
  
 +
== See Also ==
  
= C Version =
+
[[Aggregate Mode]]
by Chris Mayfield
+
[[Aggregate Median]]
  
I needed a quick and easy median aggregate for millions of groups of around 10 values each.  Here's a C version that is application specific, however one could change the data type (currently float4), buffer length (14), etc.  This code is a contrib module named statfunc, which maybe someday will implement other things (e.g., quantiles).  There's also http://poststat.projects.postgresql.org/ and http://www.joeconway.com/plr/ for doing statistics in PostgreSQL.
 
 
After posting this code I discovered a related discussion on pgsql-hackers: http://markmail.org/message/wr6rmfd56rhpwnij
 
 
<source lang="sql">
 
--
 
-- SQL interface to statfunc library
 
--
 
 
DROP AGGREGATE IF EXISTS med(real);
 
DROP FUNCTION IF EXISTS med_trans(state bytea, next real);
 
DROP FUNCTION IF EXISTS med_final(state bytea);
 
 
CREATE FUNCTION med_trans(state bytea, next real)
 
RETURNS bytea AS '$libdir/statfunc'
 
LANGUAGE C IMMUTABLE;
 
 
COMMENT ON FUNCTION med_trans(state bytea, next real)
 
IS 'accumulates non-null values into an array';
 
 
CREATE FUNCTION med_final(state bytea)
 
RETURNS real AS '$libdir/statfunc'
 
LANGUAGE C IMMUTABLE STRICT;
 
 
COMMENT ON FUNCTION med_final(state bytea)
 
IS 'sorts the array and returns the median';
 
 
CREATE AGGREGATE med(real) (
 
  SFUNC = med_trans,
 
  STYPE = bytea,
 
  FINALFUNC = med_final
 
);
 
 
COMMENT ON AGGREGATE med(real)
 
IS 'median of all input values, excluding nulls';
 
</source>
 
 
<source lang="c">
 
/*
 
* statfunc.c - statistical functions for PostgreSQL
 
*/
 
 
#include "postgres.h"
 
#include "funcapi.h"
 
 
#define MAXLEN 14
 
 
/*
 
* Internal state: simple array of floats.
 
*/
 
typedef struct {
 
int32 vl_len_;        /* varlena header (do not touch directly!) */
 
int32 nval;          /* number of values accumulated */
 
float4 vals[MAXLEN];  /* array of non-null values */
 
} State;
 
 
/*
 
* Ensures that the library is dynamically loaded into a compatible server.
 
*/
 
#ifdef PG_MODULE_MAGIC
 
PG_MODULE_MAGIC;
 
#endif
 
 
/*----------------------------------------------------------------------------*\
 
  C Implementation
 
\*----------------------------------------------------------------------------*/
 
 
/*
 
* Accumulates non-null values into an array.
 
*/
 
static State *c_med_trans(State *state, float4 next) {
 
if (state == NULL) {
 
/* first value; allocate state */
 
state = palloc0(sizeof(State));
 
SET_VARSIZE(state, sizeof(State));
 
state->nval = 1;
 
state->vals[0] = next;
 
} else {
 
/* next value; check capacity */
 
if (state->nval == MAXLEN) {
 
ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 
errmsg("need to increase MAXLEN in statfunc.c")));
 
}
 
state->vals[state->nval] = next;
 
state->nval++;
 
}
 
return state;
 
}
 
 
/*
 
* Comparison function for qsort.
 
*/
 
static int float4cmp(const void *p1, const void *p2) {
 
float4 f1 = *(const float4 *) p1;
 
float4 f2 = *(const float4 *) p2;
 
if (f1 < f2)
 
return -1;
 
if (f1 > f2)
 
return 1;
 
return 0;
 
}
 
 
/*
 
* Sorts the array and returns the median.
 
*/
 
static float4 c_med_final(State *state) {
 
int32 mid;
 
float4 ret;
 
qsort(state->vals, state->nval, sizeof(float4), float4cmp);
 
mid = state->nval / 2;
 
if (state->nval % 2) {
 
/* odd number of elements */
 
ret = state->vals[mid];
 
} else {
 
/* even number of elements */
 
ret = (state->vals[mid] + state->vals[mid - 1]) / 2;
 
}
 
return ret;
 
}
 
 
/*----------------------------------------------------------------------------*\
 
  Postgres V1 Function Prototypes
 
\*----------------------------------------------------------------------------*/
 
 
Datum med_trans(PG_FUNCTION_ARGS);
 
PG_FUNCTION_INFO_V1(med_trans);
 
 
Datum med_final(PG_FUNCTION_ARGS);
 
PG_FUNCTION_INFO_V1(med_final);
 
 
/*----------------------------------------------------------------------------*\
 
  SQL Interface (Wrappers)
 
\*----------------------------------------------------------------------------*/
 
 
Datum med_trans(PG_FUNCTION_ARGS) {
 
State *state;
 
float4 next;
 
if (!fcinfo->context || !IsA(fcinfo->context, AggState)) {
 
ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 
errmsg("med_trans() - must call from aggregate")));
 
}
 
if (PG_ARGISNULL(0)) {
 
state = NULL; /* new group */
 
} else {
 
state = (State *) PG_GETARG_BYTEA_P(0);
 
}
 
if (PG_ARGISNULL(1)) {
 
/* discard NULL input values */
 
} else {
 
next = PG_GETARG_FLOAT4(1);
 
state = c_med_trans(state, next);
 
}
 
/* return the updated state */
 
if (state == NULL) {
 
PG_RETURN_NULL();
 
} else {
 
PG_RETURN_BYTEA_P(state);
 
}
 
}
 
 
Datum med_final(PG_FUNCTION_ARGS) {
 
State *state;
 
float4 ret;
 
if (!fcinfo->context || !IsA(fcinfo->context, AggState)) {
 
ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 
errmsg("med_final() - must call from aggregate")));
 
}
 
state = (State *) PG_GETARG_BYTEA_P(0);
 
ret = c_med_final(state);
 
PG_RETURN_FLOAT4(ret);
 
}
 
</source>
 
 
<source lang="perl">
 
#
 
# Makefile for building PostgreSQL extension modules
 
#
 
 
MODULE_big = statfunc
 
OBJS = statfunc.o
 
 
DATA =
 
DOCS =
 
 
REGRESS =
 
 
ifdef USE_PGXS
 
PG_CONFIG = pg_config
 
PGXS := $(shell $(PG_CONFIG) --pgxs)
 
include $(PGXS)
 
else
 
subdir = contrib/statfunc
 
top_builddir = ../..
 
include $(top_builddir)/src/Makefile.global
 
include $(top_srcdir)/contrib/contrib-global.mk
 
endif
 
</source>
 
  
 
[[Category:SQL]]
 
[[Category:SQL]]
 +
[[Category:{{{category|}}} Snippets]]

Revision as of 10:59, 25 April 2013

The statistical median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half. A median is only defined on ordered one-dimensional data, and is independent of any distance metric.

Contents

median(numeric)

Snippets

Aggregate Median

Works with PostgreSQL

8.4

Written in

SQL

Depends on

Nothing

This snippet is also part of the ulib_agg user-defined library.

CREATE OR REPLACE FUNCTION _final_median(numeric[])
   RETURNS numeric AS
$$
   SELECT AVG(val)
   FROM (
     SELECT val
     FROM unnest($1) val
     ORDER BY 1
     LIMIT  2 - MOD(array_upper($1, 1), 2)
     OFFSET CEIL(array_upper($1, 1) / 2.0) - 1
   ) sub;
$$
LANGUAGE 'sql' IMMUTABLE;
 
CREATE AGGREGATE median(numeric) (
  SFUNC=array_append,
  STYPE=numeric[],
  FINALFUNC=_final_median,
  INITCOND='{}'
);

Usage

SELECT median(num_value) AS median_value FROM t;

Caution

Unlike mode() and range(), this one will only work with PostgreSQL 8.4 and higher due to the dynamic limit and offset.

NOTE: This version does not take NULL values into consideration, e.g., median('{1,NULL,NULL}') is NULL.

median(anyelement)

Snippets

Aggregate Median

Works with PostgreSQL

8.4

Written in

SQL

Depends on

Nothing

Here's a refinement of the above with two innovations: it takes any numerical value via anyelement. It also culls nulls which requires an extra pass so will be a bit slower. Also, I switched it to return float8 which is mainly a cosmetic preference. This snippet is also part of the ulib_agg user-defined library.

CREATE FUNCTION _final_median(anyarray) RETURNS float8 AS $$ 
  WITH q AS
  (
     SELECT val
     FROM unnest($1) val
     WHERE VAL IS NOT NULL
     ORDER BY 1
  ),
  cnt AS
  (
    SELECT COUNT(*) AS c FROM q
  )
  SELECT AVG(val)::float8
  FROM 
  (
    SELECT val FROM q
    LIMIT  2 - MOD((SELECT c FROM cnt), 2)
    OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) - 1,0)  
  ) q2;
$$ LANGUAGE sql IMMUTABLE;
 
CREATE AGGREGATE median(anyelement) (
  SFUNC=array_append,
  STYPE=anyarray,
  FINALFUNC=_final_median,
  INITCOND='{}'
);

Usage

SELECT median(value) AS median_value FROM t;

Caution

If you are stuck with PostgreSQL 8.3 or lower the following modification to the above median aggregate will work (note you first have to define unnest and create the PL/pgsql language (CREATE LANGUAGE plpgsql;) if you haven't already. Also note more efficient median functions are possible on unsorted arrays, e.g., in O(N) using a selection algorithm vs this O(N lg N) that first comparison sorts the array.

CREATE OR REPLACE FUNCTION final_median(anyarray) RETURNS float8 AS
$$ 
DECLARE
  cnt INTEGER;
BEGIN
  cnt := (SELECT count(*) FROM unnest($1) val WHERE val IS NOT NULL);
  RETURN (SELECT avg(tmp.val)::float8 
            FROM (SELECT val FROM unnest($1) val
                    WHERE val IS NOT NULL 
                    ORDER BY 1 
                    LIMIT 2 - MOD(cnt, 2) 
                    OFFSET CEIL(cnt/ 2.0) - 1
                  ) AS tmp
         );
END
$$ LANGUAGE plpgsql;
 
CREATE AGGREGATE median(anyelement) (
  SFUNC=array_append,
  STYPE=anyarray,
  FINALFUNC=final_median,
  INITCOND='{}'
);


C Version

There are many compiled versions for "external" median calculus:

See Also

Aggregate Mode Aggregate Median

Personal tools