Another approach to percentiles?

https://stackoverflow.com/questions/16450779

14-04-2022
|

Pergunta

I have a dataset which essentially consists of a list of job batches, the number of jobs contained in each batch, and the duration of each job batch. Here is a sample dataset:

CREATE TABLE test_data
(
   batch_id    NUMBER,
   job_count   NUMBER,
   duration    NUMBER
);

INSERT INTO test_data VALUES (1, 37, 9);
INSERT INTO test_data VALUES (2, 47, 4);
INSERT INTO test_data VALUES (3, 66, 6);
INSERT INTO test_data VALUES (4, 46, 6);
INSERT INTO test_data VALUES (5, 54, 1);
INSERT INTO test_data VALUES (6, 35, 1);
INSERT INTO test_data VALUES (7, 55, 9);
INSERT INTO test_data VALUES (8, 82, 7);
INSERT INTO test_data VALUES (9, 12, 9);
INSERT INTO test_data VALUES (10, 52, 4);
INSERT INTO test_data VALUES (11, 3, 9);
INSERT INTO test_data VALUES (12, 90, 2);

Now, I want to calculate some percentiles for the duration field. Typically, this is done with something like the following:

SELECT
       PERCENTILE_DISC( 0.75 )
          WITHIN GROUP (ORDER BY duration ASC)
          AS third_quartile
FROM
       test_data;

(Which gives the result of 9)

My problem here is that we don't want to get the percentiles based on batches, I want to get them based on individual jobs. I can figure this out by hand quite easily by generating a running total of the job_count:

SELECT
       batch_id,
       job_count,
       SUM(
            job_count
       )
       OVER (
              ORDER BY duration
              ROWS UNBOUNDED PRECEDING
             )
          AS total_jobs,
        duration
FROM
       test_data
ORDER BY
       duration ASC;

BATCH_ID     JOB_COUNT    TOTAL_JOBS   DURATION     
6            35           35           1            
5            54           89           1            
12           90           179          2            
2            47           226          4            
10           52           278          4            
3            66           344          6            
4            46           390          6            
8            82           472          7            
9            12           484          9            
1            37           521          9            
11           3            524          9            
7            55           579          9

Since I have 579 jobs, then the 75th percentile would be job 434. Looking at the above result set, that corresponds with a duration of 7, different from what the standard function does.

Essentially, I want to consider each job in a batch as a separate observation, and determine percentiles based on those, instead on the batches.

Is there a relatively simple way to accomplish this?

Solução

I would think of this as "weighted" percentiles. I don't know if there is a built-in analytic function for this in Oracle, but it is easy enough to calculate. And you are on the way there.

The additional idea is to calculate the total number of jobs, and then use arithmetic to select the value you want. For the 75th percentile, the value is the smallest duration such that the cumulative number of jobs is greater than 0.75 times the total number of jobs.

Here is the example in SQL:

select pcs.percentile, min(case when cumjobs >= totjobs * percentile then duration end)
from (SELECT batch_id, job_count,
             SUM(job_count) OVER (ORDER BY duration) as cumjobs,
             sum(job_count) over () as totjobs,
             duration
      FROM test_data
     ) t cross join
     (select 0.25 as percentile from dual union all
      select 0.5 from dual union all
      select 0.75 from dual
     ) pcs
group by pcs.percentile;

This example gives you the percentile values (and as an added bonus, for three different percentiles) with each value on its own row. If you want the values on each row, you need to join back to your original table.

Outras dicas

OK. I think I have your answer. Idea is mine. Implementation is borrowed from this Ask Tom article

SELECT PERCENTILE_DISC( 0.75 )
       WITHIN GROUP (ORDER BY duration ASC)
       AS third_quartile
FROM(
with data as
  (select level l
   from dual, (select max(job_count) max_jobs from test_data)
   connect by level <= max_jobs
  )
  select *
  from test_data, data
  where l <= job_count
  --ORDER BY duration, batch_id
  ) inner
;

Here is SQL Fiddle.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow