Question

We have a table which have millions of entry. The table have two columns, now there is correlation between X and Y when X is beyond a value, Y tends to be B (However it is not always true, its a trend not a certainty).

Here i want to find the threshold value for X, i.e(X1) such that at least 99% of the value which are less than X1 are B.

It can be done using code easily. But is there a SQL query which can do the computation.

For the below dataset expected is 6 because below 6 more than 99% is 'B' and there is no bigger value of X for which more than 99% is 'B'. However if I change it to precision of 90% then it will become 12 because if X<12 more than 90% of the values are 'B' and there is no bigger value of X for which it holds true

So we need to find the biggest value X1 such that at least 99% of the value lesser than X1 are 'B'.

X   Y
------
2   B
3   B
3   B
4   B
5   B
5   B
5   B
6   G
7   B
7   B
7   B
8   B
8   B
8   B
12  G
12  G
12  G
12  G
12  G
12  G
12  G
12  G
13  G
13  G
13  B
13  G
13  G
13  G
13  G
13  G
14  B
14  G
14  G
Was it helpful?

Solution 2

This is mostly inspired from the previous answer, which had some flaws.

select max(next_x) from
(
    select 
        count(case when y='B' then 1 end) over (order by x) correct,
        count(case when y='G' then 1 end) over (order by x) wrong,
        lead(x) over (order by x) next_x
    from  table_name
)
where correct/(correct + wrong) > 0.99

Sample data:

create table table_name(x number, y varchar2(1));

insert into table_name
select 2,  'B' from dual union all
select 3,  'B' from dual union all
select 3,  'B' from dual union all
select 4,  'B' from dual union all
select 5,  'B' from dual union all
select 5,  'B' from dual union all
select 5,  'B' from dual union all
select 6,  'G' from dual union all
select 7,  'B' from dual union all
select 7,  'B' from dual union all
select 7,  'B' from dual union all
select 8,  'B' from dual union all
select 8,  'B' from dual union all
select 8,  'B' from dual union all
select 12, 'G' from dual union all
select 12, 'G' from dual union all
select 12, 'G' from dual union all
select 12, 'G' from dual union all
select 12, 'G' from dual union all
select 12, 'G' from dual union all
select 12, 'G' from dual union all
select 12, 'G' from dual union all
select 13, 'G' from dual union all
select 13, 'G' from dual union all
select 13, 'B' from dual union all
select 13, 'G' from dual union all
select 13, 'G' from dual union all
select 13, 'G' from dual union all
select 13, 'G' from dual union all
select 13, 'G' from dual union all
select 14, 'B' from dual union all
select 14, 'G' from dual union all
select 14, 'G' from dual;

OTHER TIPS

Ok, I think this accomplishes what you want to do, but it will not work for the data volume you are mentioning. I'm posting it anyway in case it can help someone else provide an answer.

This may be one of those cases where the most efficient way is to use a cursor with sorted data. Oracle has some builting functions for correlation analysis but I've never worked with it so I don't know how they work.

select max(x)
  from (select x
              ,y
              ,num_less
              ,num_b
              ,num_b / nullif(num_less,0) as percent_b 
          from (select x
                      ,y
                      ,(select count(*) from table b where b.x<a.x) as num_less
                      ,(select count(*) from table b where b.x<a.x and b.y = 'B') as num_b
                  from table a
               )
         where num_b / nullif(num_less,0) >= 0.99
        );

The inner select does the following:

For every value of X

  • Count the nr of values < X
  • Count the nr of 'B'

The next SELECT computes the ratio of B's and filter only the rows where the ratio is above the threshold. The outer just picks the max(x) from those remaining rows.

Edit: The non-scalable part in the above query is the semi-cartesian self-joins.

Give a try with this and share the results:

Assuming table name as table_name and columns as x and y

with TAB AS (
select (count(x) over (PARTITION BY Y order by x rows between unbounded preceding and current row))/
       (COUNT(case when y='B' then 1 end) OVER (PARTITION BY Y)) * 100 CC, x, y
  from table_name)
select x,y from (SELECT min(cc) over (partition by y) min_cc, x, cc, y
                   FROM TAB
                  where cc >= 99)
where min_cc = cc
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top