Question

[tbl_votes]
- id <!-- unique id of the vote) -->
- item_id <!-- vote belongs to item <id> -->
- vote <!-- number 1-10 -->

Of course we can fix this by getting:

  • the smallest observation (so)
  • the lower quartile (lq)
  • the median (me)
  • the upper quartile (uq)
  • and the largest observation (lo)

..one-by-one using multiple queries but I am wondering if it can be done with a single query.

In Oracle I can use COUNT OVER and RATIO_TO_REPORT, but this is not supported in mySQL.

For those who don't know what a boxplot is: http://en.wikipedia.org/wiki/Box_plot

Any help would be appreciated.

Was it helpful?

Solution

Here is an example of calculation of the quartiles for e256 value ranges within e32 groups, an index on (e32, e256) in this case is a must:

SELECT
  @group:=IF(e32=@group, e32, GREATEST(@index:=-1, e32)) as e32_,
  MIN(e256) as so,
  MAX(IF(lq_i=(@index:=@index+1), e256, NULL)) as lq,
  MAX(IF(me_i=@index, e256, NULL)) as me,
  MAX(IF(uq_i=@index, e256, NULL)) as uq,
  MAX(e256) as lo
FROM (SELECT @index:=NULL, @group:=NULL) as init, test t
JOIN (
  SELECT e32,
    COUNT(*) as cnt,
    (COUNT(*) div 4) as lq_i,    -- lq value index within the group
    (COUNT(*) div 2) as me_i,    -- me value index within the group
    (COUNT(*) * 3 div 4) as uq_i -- uq value index within the group
  FROM test
  GROUP BY e32
) as cnts
USING (e32)
GROUP BY e32;

If there is no need in groupings, the query will be slightly simplier.

P.S. test is my playground table of random values where e32 is the result of Python's int(random.expovariate(1.0) * 32), etc.

OTHER TIPS

I've found a solution in PostgreSQL using using PL/Python.

However, I leave the question open in case someone else comes up with a solution in mySQL.

CREATE TYPE boxplot_values AS (
  min       numeric,
  q1        numeric,
  median    numeric,
  q3        numeric,
  max       numeric
);

CREATE OR REPLACE FUNCTION _final_boxplot(strarr numeric[])
   RETURNS boxplot_values AS
$$
    x = strarr.replace("{","[").replace("}","]")
    a = eval(str(x))

    a.sort()
    i = len(a)
    return ( a[0], a[i/4], a[i/2], a[i*3/4], a[-1] )
$$
LANGUAGE 'plpythonu' IMMUTABLE;

CREATE AGGREGATE boxplot(numeric) (
  SFUNC=array_append,
  STYPE=numeric[],
  FINALFUNC=_final_boxplot,
  INITCOND='{}'
);

Example:

SELECT customer_id as cid, (boxplot(price)).*
FROM orders
GROUP BY customer_id;

   cid |   min   |   q1    | median  |   q3    |   max
-------+---------+---------+---------+---------+---------
  1001 | 7.40209 | 7.80031 |  7.9551 | 7.99059 | 7.99903
  1002 | 3.44229 | 4.38172 | 4.72498 | 5.25214 | 5.98736

Source: http://www.christian-rossow.de/articles/PostgreSQL_boxplot_median_quartiles_aggregate_function.php

Well I can do it in two queries. Do the first query to get the positions of the quartiles and then use the limit function to get the answers in the second query.

mysql> select (select floor(count(*)/4)) as first_q, (select floor(count(*)/2) from customer_data) as mid_pos, (select floor(count(*)/4*3) from customer_data) as third_q from customer_data order by measure limit 1;

mysql> select min(measure),(select measure from customer_data order by measure limit 0,1) as firstq, (select measure from customer_data order by measure limit 5,1) as median, (select measure from customer_data order by measure limit 8,1) as last_q, max(measure) from customer_data;

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top