How can I return the numerical boxplot data of all results using 1 mySQL query?
-
14-04-2021 - |
Question
[tbl_votes]
- id <!-- unique id of the vote) -->
- item_id <!-- vote belongs to item <id> -->
- vote <!-- number 1-10 -->
Of course we can fix this by getting:
- the
smallest observation
(so) - the
lower quartile
(lq) - the
median
(me) - the
upper quartile
(uq) - and the
largest observation
(lo)
..one-by-one using multiple queries but I am wondering if it can be done with a single query.
In Oracle I can use COUNT OVER
and RATIO_TO_REPORT
, but this is not supported in mySQL.
For those who don't know what a boxplot is: http://en.wikipedia.org/wiki/Box_plot
Any help would be appreciated.
Solution
Here is an example of calculation of the quartiles for e256
value ranges within e32
groups, an index on (e32, e256) in this case is a must:
SELECT
@group:=IF(e32=@group, e32, GREATEST(@index:=-1, e32)) as e32_,
MIN(e256) as so,
MAX(IF(lq_i=(@index:=@index+1), e256, NULL)) as lq,
MAX(IF(me_i=@index, e256, NULL)) as me,
MAX(IF(uq_i=@index, e256, NULL)) as uq,
MAX(e256) as lo
FROM (SELECT @index:=NULL, @group:=NULL) as init, test t
JOIN (
SELECT e32,
COUNT(*) as cnt,
(COUNT(*) div 4) as lq_i, -- lq value index within the group
(COUNT(*) div 2) as me_i, -- me value index within the group
(COUNT(*) * 3 div 4) as uq_i -- uq value index within the group
FROM test
GROUP BY e32
) as cnts
USING (e32)
GROUP BY e32;
If there is no need in groupings, the query will be slightly simplier.
P.S. test
is my playground table of random values where e32
is the result of Python's int(random.expovariate(1.0) * 32)
, etc.
OTHER TIPS
I've found a solution in PostgreSQL using using PL/Python.
However, I leave the question open in case someone else comes up with a solution in mySQL.
CREATE TYPE boxplot_values AS (
min numeric,
q1 numeric,
median numeric,
q3 numeric,
max numeric
);
CREATE OR REPLACE FUNCTION _final_boxplot(strarr numeric[])
RETURNS boxplot_values AS
$$
x = strarr.replace("{","[").replace("}","]")
a = eval(str(x))
a.sort()
i = len(a)
return ( a[0], a[i/4], a[i/2], a[i*3/4], a[-1] )
$$
LANGUAGE 'plpythonu' IMMUTABLE;
CREATE AGGREGATE boxplot(numeric) (
SFUNC=array_append,
STYPE=numeric[],
FINALFUNC=_final_boxplot,
INITCOND='{}'
);
Example:
SELECT customer_id as cid, (boxplot(price)).*
FROM orders
GROUP BY customer_id;
cid | min | q1 | median | q3 | max
-------+---------+---------+---------+---------+---------
1001 | 7.40209 | 7.80031 | 7.9551 | 7.99059 | 7.99903
1002 | 3.44229 | 4.38172 | 4.72498 | 5.25214 | 5.98736
Source: http://www.christian-rossow.de/articles/PostgreSQL_boxplot_median_quartiles_aggregate_function.php
Well I can do it in two queries. Do the first query to get the positions of the quartiles and then use the limit function to get the answers in the second query.
mysql> select (select floor(count(*)/4)) as first_q, (select floor(count(*)/2) from customer_data) as mid_pos, (select floor(count(*)/4*3) from customer_data) as third_q from customer_data order by measure limit 1;
mysql> select min(measure),(select measure from customer_data order by measure limit 0,1) as firstq, (select measure from customer_data order by measure limit 5,1) as median, (select measure from customer_data order by measure limit 8,1) as last_q, max(measure) from customer_data;