Select all overlapping ranges from starting range

https://dba.stackexchange.com/questions/215967

08-01-2021
|

Вопрос

My question is similar to this one with (I think) significant enough difference. I have a base range and I want to find all other ranges in a table that conflict with it and each other. or rather the items that form the ranges, but it doesn't really make a difference.

The line with a asterisk is the starting range. The ranges 1,2 and 3 are the ones that are supposed to expand it. The result range should be the X.

1 | 
3 |    ===1====            ====
5 | ==2==    ====*====           ====
6 |             ====3====     =====          
--+-------------------------------------
  | |<--------X-------->|

I have written this:

WITH   cte
AS (
    SELECT DATA1.ID, DATA1.STARTDATE, DATA1.ENDDATE
    FROM DATA1
    WHERE 
    DATA1.ID = @PARID AND 
    DATA1.STARTDATE > @STARTDATE AND 
    DATA1.ENDDATE < @ENDDATE 

    UNION ALL

    SELECT DATA1.ID, DATA1.STARTDATE, DATA1.ENDDATE
    FROM cte 
    INNER JOIN DATA1 ON (DATA1.ID = cte.ID)
    WHERE 
    DATA1.ID = @PARID AND 
    (cte.STARTDATE < DATA1.ENDDATE AND cte.ENDDATE > DATA1.ENDDATE)
)
SELECT *
FROM cte

But after some fiddling realized this can't really work, since the 2nd SELECT doesn't know what STARTDATE and ENDDATE it should use from the cte. And I can't use subqueries in recursive SQL.

We have previously calculated this in client in .NET via set of recursive functions, but at O(N^2) it was stupidly slow. The intent here is to move the calculations to the server as well as optimize the calculation.

Is what I'm trying to do here at all viable?

SQL Fiddle. I'm having some trouble getting the query to run as it did on my machine (where I used less simple data structure), but at least I added example data. I will keep trying to get it to produce the same result I got on my machine.

With input range from 2018-08-24 12:00:00 to 2018-08-31 12:00:00, the correct output should be IDs 2, 3, 4, 6.

Решение

A straightforward recursive solution splits the task in two parts:

Find the earliest start date by following the chain of overlapping ranges
Find the latest end date by following the chain of overlapping ranges

Details follow:

1. Follow the chain of overlaps leading to the earliest start date

Find the starting range
Find the overlapping range with the latest earlier start date than current
Go to step 2

Example code:

DECLARE @STARTDATE DATETIME = '2018-08-24 12:00:00';
DECLARE @ENDDATE DATETIME = '2018-08-31 12:00:00';

WITH MinStart AS
(
    -- Starting range
    SELECT TOP (1) 
        D.ID,
        D.STARTDATE
    FROM dbo.DATA1 AS D
    WHERE 
        D.STARTDATE >= @STARTDATE
        AND D.ENDDATE <= @ENDDATE
    ORDER BY 
        D.ID ASC

    UNION ALL

    SELECT
        P1.ID,
        P1.STARTDATE
    FROM 
    (
        -- Overlapping ranges with an earlier start date
        -- Numbered starting with the highest start date
        -- Tie-break dates using ID
        SELECT
            RN = ROW_NUMBER() OVER (
                ORDER BY D.STARTDATE DESC, D.ID ASC),
            D.ID,
            D.STARTDATE
        FROM MinStart AS R
        JOIN dbo.DATA1 AS D
            ON D.ENDDATE >= R.STARTDATE
            AND D.STARTDATE < R.STARTDATE
    ) AS P1
    WHERE
        -- Highest start date only
        P1.RN = 1
)
SELECT
    MS.ID,
    MS.STARTDATE
FROM MinStart AS MS
OPTION (MAXRECURSION 0);

2. Follow the chain of overlaps leading to the latest end date

Find the starting range
Find the overlapping range with the earliest later end date than current
Go to step 2

Example code:

DECLARE @STARTDATE DATETIME = '2018-08-24 12:00:00';
DECLARE @ENDDATE DATETIME = '2018-08-31 12:00:00';

WITH MaxEnd AS
(
    -- Starting range
    SELECT TOP (1) 
        D.ID, 
        D.ENDDATE
    FROM dbo.DATA1 AS D
    WHERE
        D.STARTDATE >= @STARTDATE
        AND D.ENDDATE <= @ENDDATE
    ORDER BY 
        D.ID ASC

    UNION ALL

    SELECT
        P1.ID,
        P1.ENDDATE
    FROM 
    (
        -- Overlapping ranges with a later end date
        -- Numbered starting with the lowest end date
        -- Tie-break dates using ID
        SELECT
            RN = ROW_NUMBER() OVER (
                ORDER BY D.ENDDATE ASC, D.ID ASC),
            D.ID,
            D.ENDDATE
        FROM MaxEnd AS R
        JOIN dbo.DATA1 AS D
            ON D.ENDDATE > R.ENDDATE
            AND D.STARTDATE < R.ENDDATE
    ) AS P1
    WHERE 
        -- Lowest end date only
        P1.RN = 1
)
SELECT
    ME.ID,
    ME.ENDDATE
FROM MaxEnd AS ME
OPTION (MAXRECURSION 0);

With suitable indexing, both queries use an execution plan like:

The results given the sample data provided are:

╔════╦═════════════════════════╗
║ ID ║        STARTDATE        ║
╠════╬═════════════════════════╣
║  3 ║ 2018-08-24 12:00:00.000 ║
║  6 ║ 2018-08-23 12:00:00.000 ║
║  2 ║ 2018-08-22 12:00:00.000 ║
╚════╩═════════════════════════╝
╔════╦═════════════════════════╗
║ ID ║         ENDDATE         ║
╠════╬═════════════════════════╣
║  3 ║ 2018-08-29 12:00:00.000 ║
║  6 ║ 2018-08-30 12:00:00.000 ║
║  4 ║ 2018-09-02 12:00:00.000 ║
╚════╩═════════════════════════╝

So IDs 2, 3, 4, and 6 were used. The final range is 2018-08-22 12:00:00.000 (lowest start date found) to 2018-09-02 12:00:00.000 (highest end date found).

Note: The code above includes a tie-break on dates (using the ID column) due to duplicates in the sample data. This may, or may not, be required to solve the real problem.

The indexes used were:

CREATE TABLE dbo.DATA1
(
  ID integer PRIMARY KEY,
  STARTDATE datetime NOT NULL,
  ENDDATE datetime NOT NULL,
);

CREATE UNIQUE INDEX i1 ON dbo.DATA1 (STARTDATE DESC, ID ASC) INCLUDE (ENDDATE);
CREATE UNIQUE INDEX i2 ON dbo.DATA1 (ENDDATE ASC, ID DESC) INCLUDE (STARTDATE);

Demo: db<>fiddle

Note that truly optimal indexing for these types of queries can be difficult. Even so, the approach above often performs well for many practical tasks in this area.

Until proper support for interval queries and predicates is added to SQL Server, higher performance solutions require significant extra work. See:

Interval Queries in SQL Server Part 4 by Dejan Sarka
Interval Queries in SQL Server by Itzik Ben-Gan

Другие советы

Eagerly awaiting an SQL Fiddle to allow testing, please have a look at http://sqlfiddle.com/#!18/10ab4/5

DECLARE @STARTDATE DATETIME = '2018-08-24 12:00:00';
DECLARE @ENDDATE DATETIME = '2018-08-29 12:00:00';
declare @PARID int = 3;

with ghp as (
  SELECT DATA1.ID, DATA1.STARTDATE, DATA1.ENDDATE
    from DATA1
    where ID = @PARID ),
     ghp2 as (
 select 'A' class, min(data1.startdate) minstart, max(data1.enddate) maxend
    FROM DATA1, ghp
    WHERE data1.enddate >= ghp.startdate
      and data1.startdate <= ghp.enddate
 union all
 select 'B', data1.startdate, data1.enddate
    from data1, ghp2
    WHERE (    data1.enddate   >= ghp2.minstart
           and data1.startdate <  ghp2.minstart )),
     ghp3 as (
 select 'C' class, min(data1.startdate) minstart, max(data1.enddate) maxend
    FROM DATA1, ghp
    WHERE data1.enddate >= ghp.startdate
      and data1.startdate <= ghp.enddate
 union all
 select 'D', data1.startdate, data1.enddate
    from data1, ghp2
    WHERE (    data1.enddate   > ghp2.maxend
           and data1.startdate <=  ghp2.maxend )
 )
select * from ghp2
union all
select * from ghp3

I present my solution for T-SQL.

Fiddle SQL Server 2017

The problem:

Find all records of an input table that are connected in time of which at least 1 item overlaps with a requested period.

The solution:

Step 1: determine date ranges for which all concerning items are somehow connected in time (by overlap or meet).

with packed as (
    select  min(StartDate) StartDate, max(EndDate) EndDate from (
        select StartDate, EndDate, sum(GroupStart) over (order by StartDate) Grp from (
            select StartDate, EndDate
                , case when max(EndDate) 
                    over (order by EndDate, StartDate rows between unbounded preceding and 1 preceding)
                        > StartDate
                    then 0 else 1 end GroupStart
            from dbo.DATA1
        ) p0
    ) p1
    group by Grp
)

Step 2: Determine the relevant date range given your requested period

relevant as (
        select * from packed
        where StartDate < @ENDDATE
        and @STARTDATE < EndDate
)

Step 3: Output all items that are connected in the relevant date range

select d.* from dbo.DATA1 d
inner join relevant r on d.StartDate < r.EndDate and r.StartDate < d.EndDate
where d.StartDate < @ENDDATE and @STARTDATE < d.EndDate
order by d.StartDate, d.EndDate, d.Id;

Setup:

CREATE TABLE dbo.DATA1
(
  ID integer PRIMARY KEY,
  STARTDATE datetime NOT NULL,
  ENDDATE datetime NOT NULL,
);

CREATE UNIQUE INDEX i1 ON dbo.DATA1 (STARTDATE DESC, ID ASC) INCLUDE (ENDDATE);
CREATE UNIQUE INDEX i2 ON dbo.DATA1 (ENDDATE ASC, ID DESC) INCLUDE (STARTDATE);

INSERT INTO DATA1 (ID, STARTDATE, ENDDATE) VALUES 
  (1, '2018-08-17 12:00:00', '2018-08-20 12:00:00'),
  (2, '2018-08-22 12:00:00', '2018-08-25 12:00:00'),
  (3, '2018-08-24 12:00:00', '2018-08-29 12:00:00'),
  (4, '2018-08-26 12:00:00', '2018-09-02 12:00:00'),
  (5, '2018-09-05 12:00:00', '2018-09-25 12:00:00'),
  (6, '2018-08-23 12:00:00', '2018-08-30 12:00:00'),
  (7, '2018-08-17 12:00:00', '2018-08-20 12:00:00'),
  (8, '2018-08-22 12:00:00', '2018-08-25 12:00:00'),
  (9, '2018-08-24 12:00:00', '2018-08-29 12:00:00'),
  (10, '2018-08-26 12:00:00', '2018-09-02 12:00:00'),
  (11, '2018-09-05 12:00:00', '2018-09-25 12:00:00'),
  (12, '2018-08-23 12:00:00', '2018-08-30 12:00:00');

Full Query:

DECLARE @STARTDATE DATETIME = '2018-08-24 12:00:00';
DECLARE @ENDDATE DATETIME = '2018-08-31 12:00:00';

with packed as (
    select  min(StartDate) StartDate, max(EndDate) EndDate from (
        select StartDate, EndDate, sum(GroupStart) over (order by StartDate) Grp from (
            select StartDate, EndDate
                , case when max(EndDate) 
                    over (order by EndDate, StartDate rows between unbounded preceding and 1 preceding)
                        > StartDate
                    then 0 else 1 end GroupStart
            from dbo.DATA1
        ) p0
    ) p1
    group by Grp
)
, relevant as (
        select * from packed
        where StartDate < @ENDDATE
        and @STARTDATE < EndDate
)
select d.* from dbo.DATA1 d
inner join relevant r on d.StartDate < r.EndDate and r.StartDate < d.EndDate
where d.StartDate < @ENDDATE and @STARTDATE < d.EndDate
order by d.StartDate, d.EndDate, d.Id;

Лицензировано под: CC-BY-SA с атрибуция

Не связан с dba.stackexchange