Вопрос

Is there a straightforward way to calculate median for data that is already aggregated by the metric? In other words, I have a table where the measurement is part of the group by, and a count of each measurement is recorded.

CREATE TABLE MEASUREMENTS AS 
SELECT 'RED'  COLOR, 4 MEASUREMENT, 5 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'RED'  COLOR, 5 MEASUREMENT, 3 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'RED'  COLOR, 6 MEASUREMENT, 1 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'BLUE' COLOR, 5 MEASUREMENT, 4 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'BLUE' COLOR, 6 MEASUREMENT, 5 MEASURE_COUNT FROM DUAL ;

╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED   ║           4 ║             5 ║
║ RED   ║           5 ║             3 ║
║ RED   ║           6 ║             1 ║
║ BLUE  ║           5 ║             4 ║
║ BLUE  ║           6 ║             5 ║
╚═══════╩═════════════╩═══════════════╝

The "natural" solution is to just explode the measurement counts into individuals rows with values, then group-by, using Oracle-delivered MEDIAN - math would look like this:

RED=>(4,4,4,4,4,5,5,5,6), median = 4
BLUE=>(5,5,5,5,6,6,6,6,6), median = 6

But (1) I'm dealing with millions of rows that would explode to 10s of millions of individual measurements and (2) that feels like I'm "undoing and redoing" the mathematically expensive work of median.

Because I'd like to have a view definition on this, and embedding analytics into views tends to cripple execution plans, I'd like to avoid anything like this:

    SELECT  COLOR,
            MIN(MEASUREMENT) MEDIAN_MEASUREMENT
    FROM 
      (        
        SELECT  COLOR, 
                MEASUREMENT, 
                SUM(MEASURE_COUNT) OVER (PARTITION BY COLOR ORDER BY MEASURE_COUNT)  / 
                    SUM(MEASURE_COUNT) OVER (PARTITION BY COLOR) PCT
        FROM    MEASUREMENTS            
      )
    WHERE PCT >=.5  
    GROUP BY COLOR               

I'd much prefer to have something that can be done with straight GROUP BY (example given for AVG), if mathematically possible:

SELECT  COLOR, 
        SUM(MEASUREMENT) / SUM(MEASURE_COUNT) AVG_MEASUREMENT
        -- MEDIAN LOGIC (???) HERE  
FROM    MEASUREMENTS
GROUP BY COLOR
Это было полезно?

Решение

If I understand you correctly, I can see a fairly straightforward way, and I think I can describe it clearly. I'm pretty sure I can't express it in SQL today, but I'll leave this tab open in my browser, and try to get to it tomorrow if nobody else has contributed.

╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED   ║           4 ║             5 ║
║ RED   ║           5 ║             3 ║
║ RED   ║           6 ║             1 ║
║ BLUE  ║           5 ║             4 ║
║ BLUE  ║           6 ║             5 ║
╚═══════╩═════════════╩═══════════════╝

First, calculate which measurement represents the median. You can do that based solely on the counts. For example, for red, there are nine measurements altogether. The median measurement will be the 5th measurement. SQL for this should be straightforward.

Second, I think you can determine which row the median measurement is in with analytic functions. For red, you determine which row the 5th measurement is in; it's in the first row. This is a little like a "running balance" problem. The value of the "measurement" column in that row is the value you're trying to determine.

The wall of code (in standard SQL, I think)

"Unwinding" the aggregates is expensive. So this might not be useful to you. I relied on common table expressions to reduce the load on my brain.

with measurements as (
  select 'red'   color, 4 measurement, 5 measure_count union all
  select 'red'   color, 5 measurement, 3 measure_count union all
  select 'red'   color, 6 measurement, 1 measure_count union all
  select 'blue'  color, 5 measurement, 4 measure_count union all
  select 'blue'  color, 6 measurement, 5 measure_count union all
  -- Added green, even number of measurements, median should be 5.5.
  select 'green' color, 5 measurement, 4 measure_count union all
  select 'green' color, 6 measurement, 4 measure_count union all
  -- Added yellow, extreme differences in measurements, median should be 6.
  select 'yellow' color, 6 measurement, 2 measure_count union all
  select 'yellow' color, 100 measurement, 1 measure_count 
)
, measurement_starts as (
  select 
    *,
    sum(measure_count) over (partition by color order by measurement) total_rows_so_far
  from measurements
)
, extended_measurements as (
  select 
    color, measurement, measure_count,
    coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + 1 measure_start_row,
    coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + measure_count measure_end_row 
  from measurement_starts
)
, median_row_range as (
  select color, 
    sum(measure_count) num_measurements, 
    ceiling(sum(measure_count)/2.0) start_measurement, 
    case 
      sum(measure_count) % 2 = 0
      when true then ceiling(sum(measure_count)/2.0)+1
      else ceiling(sum(measure_count)/2.0)
    end
    end_measurement
  from measurements
  group by color
)
, median_row_values as (
  select m.color, c.measurement
  from median_row_range m
  inner join extended_measurements c 
          on c.color = m.color 
         and m.start_measurement between c.measure_start_row and c.measure_end_row
  union all
  select m.color, c.measurement
  from median_row_range m
  inner join extended_measurements c 
          on c.color = m.color 
         and m.end_measurement between c.measure_start_row and c.measure_end_row
)
select color, avg(measurement)
from median_row_values
group by color
order by color;

blue    6.00
green   5.50
red     4.00
yellow  6.00

The CTE "extended_measurements" expands the table of measurements to include the starting "row" number and the ending "row" number that you'd find with unaggregated data.

color  measurement  measure_count  measure_start_row  measure_end_row
--
blue   5            4              1                  4
blue   6            5              5                  9
green  5            4              1                  4
green  6            4              5                  8
red    4            5              1                  5
red    5            3              6                  8
red    6            1              4                  4
yellow 6            2              1                  2
yellow 100          1              3                  3

The CTE "median_row_range" determines the starting "row" and ending "row" for the median.

color  num_measurements  start_measurement  end_measurement
--
blue   9                 5                  5
green  8                 4                  5
red    9                 5                  5
yellow 3                 2                  2

This means that the median for 'blue' can be calculated as the average of 5th "row" and the 5th "row". That is, the median for 'blue' is simply the 5th value. The median for green is the average of the 4th "row" and 5th "row".

Другие советы

The idea behind this answer is same as that of Mike's, but varies in execution.

  • First CTE extended_measurements, finds the cumulative sum of counts and the midpoint for each color. If the sum of counts is even, then you should take average of two values. So floor and ceil will give you those points.
  • Second CTE extended_measurements2, tries to find the measurement to which the midpoint corresponds to, by comparing with cumulative sum. This is done for both floor_midpoint and ceil_midpodint. Rank is assigned, since we are interested only in the first record that matches.
  • Final query, selects only the measurement with least ranks and finds the average, which is the MEDIAN value.

SQL Fiddle

Query:

--get the midpoint and cumulative sum of measure_count
with extended_measurements as(
    select color, measurement,
           floor((sum(measure_count) over 
                    (partition by color) + 1) * 0.5)            floor_midpoint,
           ceil((sum(measure_count) over 
                    (partition by color) + 1) * 0.5)            ceil_midpoint,
           sum(measure_count) over 
                    (partition by color order by measurement)   cumltv_sum
    from measurements
),
--assign rank to the measure_count where median lies
extended_measurements2 as(
    select color, measurement,
           case when floor_midpoint <= cumltv_sum
                    then row_number() over (partition by color order by measurement)
                else null
           end r1,
           case when ceil_midpoint <= cumltv_sum
                    then row_number() over (partition by color order by measurement)
                else null
           end r2
    from extended_measurements
)
--get the average of measurements that have least rank
select color, 0.5 * (
                        max(measurement) keep (dense_rank first order by r1) + 
                        max(measurement) keep (dense_rank first order by r2)
                     )  median
from extended_measurements2
group by color
order by color

Result:

|  COLOR | MEDIAN |
|--------|--------|
|   blue |      6 |
|  green |    5.5 |
|    red |      4 |
|  white |      8 |
| yellow |      6 |

Another fiddle for verifying the result of non-aggregated and aggregated data.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top