Searching for continuous range of numbers, while ignoring gaps <= 5

https://stackoverflow.com/questions/23250784

08-07-2023
|

Question

I am trying to find continuous ranges of numeric values from a dataset in MySQL. However, "gaps" in the range smaller than 5 should be ignored. Below is my current code (which works up to some point), split is smaller parts for convenience.

dataset contains a "thetime" and "number" column (both numeric). The final goal is to get all the ranges of "thetime" associated with number > 200.

(1) First I select the "gaps" in my dataset, by selecting every "thetime" that has number <= 200.

drop temporary table if exists tmp_gaps;
create temporary table tmp_gaps as 
    (select thetime
    from `dataset` 
    where number <= 200);

(2) I'm partitioning these found gaps in ranges, according to the method explained here.

drop temporary table if exists tmp_gaps_withdelta;
create temporary table tmp_gaps_withdelta as
    (select min(thetime) as start, max(thetime) as theend, max(thetime) - min(thetime) + 1 as delta
        from (select thetime, @curRow := @curRow + 1 as row_number
            from tmp_gaps v
                join (select @curRow := 0) w) v
        group by thetime - row_number);

(3) Now, I'm trying filter the gaps <= 5 by joining the orginal dataset table with tmp_gaps_withdelta. If delta <= 5 or delta is null (meaning there is no entry in tmp_gaps_withdelta corresponding with the original "thetime" in dataset), I consider "thetime" part of a range, and it gets accepted in db_tmp_ranges.

drop temporary table if exists db_tmp_ranges;
create temporary table db_tmp_ranges as 
    (select 
        case
            when gaps.delta is null 
                or gaps.delta <= 5 then edm.thetime
            else null
        end as thetime
    from `dataset` edm
        left join tmp_gaps_withdelta gaps on edm.thetime >= gaps.start
            and edm.thetime < gaps.start + gaps.delta);

Up to this point, everything works as expected. I now have a large set of "thetime" values where "number" from the original table is > 200. The data can be divided into ranges, without gaps <= 5. When I select some data from db_tmp_ranges, I'm getting what I'm expecting.

(4) The plan now is to partition, the same way as in (2).

select *
from
    (select min(thetime) as start, max(thetime) as theend, max(thetime) - min(thetime) + 1 as delta
    from (select thetime, @curRow := @curRow + 1 as row_number
        from db_tmp_ranges p
            join (select @curRow := 0) r
        where thetime is not null) p
    group by thetime - row_number) q

However, the results of this query is absolutely wrong. I honestly don't know where the fault lies, since this way of partitioning in intervals has always worked for me, up till now. Any help is greatly appreciated.

EDIT: a specific example of how the query reacts: db_tmp_ranges:

Result from last query:

...
1393001316  1393001319  4
1393001320  1393001591  272
1393001592  1393001595  4
1393001596  1393001881  286
...

As you can see, these numbers should be in 1 interval, instead of 4+. After using SQL fiddle, it appears the query itself isn't a problem.

I really don't get it. When executing...

select * 
from db_tmp_ranges 
where thetime >= 1393001313 
and thetime <= 1393001350 
order by thetime;

... I get a normal-looking list of numeric "thetime" values. But somehow the last query doesn't use db_tmp_ranges as it should.

Solution 2

After wondering a while why my last query works in SQL Fiddle, but not on my "real" MySQL database, I've found the solution.

When building the schema in SQL Fiddle, the thetime values are inserted in ascending order. However, the thetime values produced by the query in (3) are in random order. Because row_number depends on the order in which the rows are processed, the values have to be sorted before feeding them in the last query.

As a result, making the last query work requires the following change:

select *
from
    (select min(thetime) as start, max(thetime) as einde, max(thetime) - min(thetime) + 1 as delta
    from (select thetime, @curRow := @curRow + 1 as row_number
        from (select * from db_tmp_ranges where thetime is not null order by thetime) p
            join (select @curRow := 0) r) p
    group by thetime - row_number) q
order by start

OTHER TIPS

The easiest way in MySQL is to use variables (in other databases, you can make use of window/analytic functions). The following assigns a grp column to numbers based on your rules:

select ds.*,
       @grp := iff(@lastnumber - number <= 5, @grp, @grp + 1) as grp,
       @lastnumber := number
from dataset ds cross join
     (select @lastnumber := -1, @grp := 0) const
order by number;

To get the actual sequences:

select min(number), max(number), max(number) - min(number) as width,
       count(distinct number) as numNumbers
from (select ds.*,
             @grp := iff(@lastnumber - number <= 5, @grp, @grp + 1) as grp,
             @lastnumber := number
      from dataset ds cross join
           (select @lastnumber := -1, @grp := 0) const
      order by number 
     ) ds
group by grp;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow