If I understand you correctly, I can see a fairly straightforward way, and I think I can describe it clearly. I'm pretty sure I can't express it in SQL today, but I'll leave this tab open in my browser, and try to get to it tomorrow if nobody else has contributed.
╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED ║ 4 ║ 5 ║
║ RED ║ 5 ║ 3 ║
║ RED ║ 6 ║ 1 ║
║ BLUE ║ 5 ║ 4 ║
║ BLUE ║ 6 ║ 5 ║
╚═══════╩═════════════╩═══════════════╝
First, calculate which measurement represents the median. You can do that based solely on the counts. For example, for red, there are nine measurements altogether. The median measurement will be the 5th measurement. SQL for this should be straightforward.
Second, I think you can determine which row the median measurement is in with analytic functions. For red, you determine which row the 5th measurement is in; it's in the first row. This is a little like a "running balance" problem. The value of the "measurement" column in that row is the value you're trying to determine.
The wall of code (in standard SQL, I think)
"Unwinding" the aggregates is expensive. So this might not be useful to you. I relied on common table expressions to reduce the load on my brain.
with measurements as (
select 'red' color, 4 measurement, 5 measure_count union all
select 'red' color, 5 measurement, 3 measure_count union all
select 'red' color, 6 measurement, 1 measure_count union all
select 'blue' color, 5 measurement, 4 measure_count union all
select 'blue' color, 6 measurement, 5 measure_count union all
-- Added green, even number of measurements, median should be 5.5.
select 'green' color, 5 measurement, 4 measure_count union all
select 'green' color, 6 measurement, 4 measure_count union all
-- Added yellow, extreme differences in measurements, median should be 6.
select 'yellow' color, 6 measurement, 2 measure_count union all
select 'yellow' color, 100 measurement, 1 measure_count
)
, measurement_starts as (
select
*,
sum(measure_count) over (partition by color order by measurement) total_rows_so_far
from measurements
)
, extended_measurements as (
select
color, measurement, measure_count,
coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + 1 measure_start_row,
coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + measure_count measure_end_row
from measurement_starts
)
, median_row_range as (
select color,
sum(measure_count) num_measurements,
ceiling(sum(measure_count)/2.0) start_measurement,
case
sum(measure_count) % 2 = 0
when true then ceiling(sum(measure_count)/2.0)+1
else ceiling(sum(measure_count)/2.0)
end
end_measurement
from measurements
group by color
)
, median_row_values as (
select m.color, c.measurement
from median_row_range m
inner join extended_measurements c
on c.color = m.color
and m.start_measurement between c.measure_start_row and c.measure_end_row
union all
select m.color, c.measurement
from median_row_range m
inner join extended_measurements c
on c.color = m.color
and m.end_measurement between c.measure_start_row and c.measure_end_row
)
select color, avg(measurement)
from median_row_values
group by color
order by color;
blue 6.00
green 5.50
red 4.00
yellow 6.00
The CTE "extended_measurements" expands the table of measurements to include the starting "row" number and the ending "row" number that you'd find with unaggregated data.
color measurement measure_count measure_start_row measure_end_row
--
blue 5 4 1 4
blue 6 5 5 9
green 5 4 1 4
green 6 4 5 8
red 4 5 1 5
red 5 3 6 8
red 6 1 4 4
yellow 6 2 1 2
yellow 100 1 3 3
The CTE "median_row_range" determines the starting "row" and ending "row" for the median.
color num_measurements start_measurement end_measurement
--
blue 9 5 5
green 8 4 5
red 9 5 5
yellow 3 2 2
This means that the median for 'blue' can be calculated as the average of 5th "row" and the 5th "row". That is, the median for 'blue' is simply the 5th value. The median for green is the average of the 4th "row" and 5th "row".