Select all overlapping ranges from starting range
-
08-01-2021 - |
Вопрос
My question is similar to this one with (I think) significant enough difference. I have a base range and I want to find all other ranges in a table that conflict with it and each other. or rather the items that form the ranges, but it doesn't really make a difference.
The line with a asterisk is the starting range. The ranges 1,2 and 3 are the ones that are supposed to expand it. The result range should be the X.
1 |
3 | ===1==== ====
5 | ==2== ====*==== ====
6 | ====3==== =====
--+-------------------------------------
| |<--------X-------->|
I have written this:
WITH cte
AS (
SELECT DATA1.ID, DATA1.STARTDATE, DATA1.ENDDATE
FROM DATA1
WHERE
DATA1.ID = @PARID AND
DATA1.STARTDATE > @STARTDATE AND
DATA1.ENDDATE < @ENDDATE
UNION ALL
SELECT DATA1.ID, DATA1.STARTDATE, DATA1.ENDDATE
FROM cte
INNER JOIN DATA1 ON (DATA1.ID = cte.ID)
WHERE
DATA1.ID = @PARID AND
(cte.STARTDATE < DATA1.ENDDATE AND cte.ENDDATE > DATA1.ENDDATE)
)
SELECT *
FROM cte
But after some fiddling realized this can't really work, since the 2nd SELECT doesn't know what STARTDATE
and ENDDATE
it should use from the cte
. And I can't use subqueries in recursive SQL.
We have previously calculated this in client in .NET via set of recursive functions, but at O(N^2) it was stupidly slow. The intent here is to move the calculations to the server as well as optimize the calculation.
Is what I'm trying to do here at all viable?
SQL Fiddle. I'm having some trouble getting the query to run as it did on my machine (where I used less simple data structure), but at least I added example data. I will keep trying to get it to produce the same result I got on my machine.
With input range from 2018-08-24 12:00:00
to 2018-08-31 12:00:00
, the correct output should be IDs 2, 3, 4, 6
.
Решение
A straightforward recursive solution splits the task in two parts:
- Find the earliest start date by following the chain of overlapping ranges
- Find the latest end date by following the chain of overlapping ranges
Details follow:
1. Follow the chain of overlaps leading to the earliest start date
- Find the starting range
- Find the overlapping range with the latest earlier start date than current
- Go to step 2
Example code:
DECLARE @STARTDATE DATETIME = '2018-08-24 12:00:00';
DECLARE @ENDDATE DATETIME = '2018-08-31 12:00:00';
WITH MinStart AS
(
-- Starting range
SELECT TOP (1)
D.ID,
D.STARTDATE
FROM dbo.DATA1 AS D
WHERE
D.STARTDATE >= @STARTDATE
AND D.ENDDATE <= @ENDDATE
ORDER BY
D.ID ASC
UNION ALL
SELECT
P1.ID,
P1.STARTDATE
FROM
(
-- Overlapping ranges with an earlier start date
-- Numbered starting with the highest start date
-- Tie-break dates using ID
SELECT
RN = ROW_NUMBER() OVER (
ORDER BY D.STARTDATE DESC, D.ID ASC),
D.ID,
D.STARTDATE
FROM MinStart AS R
JOIN dbo.DATA1 AS D
ON D.ENDDATE >= R.STARTDATE
AND D.STARTDATE < R.STARTDATE
) AS P1
WHERE
-- Highest start date only
P1.RN = 1
)
SELECT
MS.ID,
MS.STARTDATE
FROM MinStart AS MS
OPTION (MAXRECURSION 0);
2. Follow the chain of overlaps leading to the latest end date
- Find the starting range
- Find the overlapping range with the earliest later end date than current
- Go to step 2
Example code:
DECLARE @STARTDATE DATETIME = '2018-08-24 12:00:00';
DECLARE @ENDDATE DATETIME = '2018-08-31 12:00:00';
WITH MaxEnd AS
(
-- Starting range
SELECT TOP (1)
D.ID,
D.ENDDATE
FROM dbo.DATA1 AS D
WHERE
D.STARTDATE >= @STARTDATE
AND D.ENDDATE <= @ENDDATE
ORDER BY
D.ID ASC
UNION ALL
SELECT
P1.ID,
P1.ENDDATE
FROM
(
-- Overlapping ranges with a later end date
-- Numbered starting with the lowest end date
-- Tie-break dates using ID
SELECT
RN = ROW_NUMBER() OVER (
ORDER BY D.ENDDATE ASC, D.ID ASC),
D.ID,
D.ENDDATE
FROM MaxEnd AS R
JOIN dbo.DATA1 AS D
ON D.ENDDATE > R.ENDDATE
AND D.STARTDATE < R.ENDDATE
) AS P1
WHERE
-- Lowest end date only
P1.RN = 1
)
SELECT
ME.ID,
ME.ENDDATE
FROM MaxEnd AS ME
OPTION (MAXRECURSION 0);
With suitable indexing, both queries use an execution plan like:
The results given the sample data provided are:
╔════╦═════════════════════════╗ ║ ID ║ STARTDATE ║ ╠════╬═════════════════════════╣ ║ 3 ║ 2018-08-24 12:00:00.000 ║ ║ 6 ║ 2018-08-23 12:00:00.000 ║ ║ 2 ║ 2018-08-22 12:00:00.000 ║ ╚════╩═════════════════════════╝ ╔════╦═════════════════════════╗ ║ ID ║ ENDDATE ║ ╠════╬═════════════════════════╣ ║ 3 ║ 2018-08-29 12:00:00.000 ║ ║ 6 ║ 2018-08-30 12:00:00.000 ║ ║ 4 ║ 2018-09-02 12:00:00.000 ║ ╚════╩═════════════════════════╝
So IDs 2, 3, 4, and 6 were used. The final range is 2018-08-22 12:00:00.000
(lowest start date found) to 2018-09-02 12:00:00.000
(highest end date found).
Note: The code above includes a tie-break on dates (using the ID
column) due to duplicates in the sample data. This may, or may not, be required to solve the real problem.
The indexes used were:
CREATE TABLE dbo.DATA1
(
ID integer PRIMARY KEY,
STARTDATE datetime NOT NULL,
ENDDATE datetime NOT NULL,
);
CREATE UNIQUE INDEX i1 ON dbo.DATA1 (STARTDATE DESC, ID ASC) INCLUDE (ENDDATE);
CREATE UNIQUE INDEX i2 ON dbo.DATA1 (ENDDATE ASC, ID DESC) INCLUDE (STARTDATE);
Demo: db<>fiddle
Note that truly optimal indexing for these types of queries can be difficult. Even so, the approach above often performs well for many practical tasks in this area.
Until proper support for interval queries and predicates is added to SQL Server, higher performance solutions require significant extra work. See:
- Interval Queries in SQL Server Part 4 by Dejan Sarka
- Interval Queries in SQL Server by Itzik Ben-Gan
Другие советы
Eagerly awaiting an SQL Fiddle to allow testing, please have a look at http://sqlfiddle.com/#!18/10ab4/5
DECLARE @STARTDATE DATETIME = '2018-08-24 12:00:00';
DECLARE @ENDDATE DATETIME = '2018-08-29 12:00:00';
declare @PARID int = 3;
with ghp as (
SELECT DATA1.ID, DATA1.STARTDATE, DATA1.ENDDATE
from DATA1
where ID = @PARID ),
ghp2 as (
select 'A' class, min(data1.startdate) minstart, max(data1.enddate) maxend
FROM DATA1, ghp
WHERE data1.enddate >= ghp.startdate
and data1.startdate <= ghp.enddate
union all
select 'B', data1.startdate, data1.enddate
from data1, ghp2
WHERE ( data1.enddate >= ghp2.minstart
and data1.startdate < ghp2.minstart )),
ghp3 as (
select 'C' class, min(data1.startdate) minstart, max(data1.enddate) maxend
FROM DATA1, ghp
WHERE data1.enddate >= ghp.startdate
and data1.startdate <= ghp.enddate
union all
select 'D', data1.startdate, data1.enddate
from data1, ghp2
WHERE ( data1.enddate > ghp2.maxend
and data1.startdate <= ghp2.maxend )
)
select * from ghp2
union all
select * from ghp3
I present my solution for T-SQL.
The problem:
Find all records of an input table that are connected in time of which at least 1 item overlaps with a requested period.
The solution:
Step 1: determine date ranges for which all concerning items are somehow connected in time (by overlap or meet).
with packed as (
select min(StartDate) StartDate, max(EndDate) EndDate from (
select StartDate, EndDate, sum(GroupStart) over (order by StartDate) Grp from (
select StartDate, EndDate
, case when max(EndDate)
over (order by EndDate, StartDate rows between unbounded preceding and 1 preceding)
> StartDate
then 0 else 1 end GroupStart
from dbo.DATA1
) p0
) p1
group by Grp
)
Step 2: Determine the relevant date range given your requested period
relevant as (
select * from packed
where StartDate < @ENDDATE
and @STARTDATE < EndDate
)
Step 3: Output all items that are connected in the relevant date range
select d.* from dbo.DATA1 d
inner join relevant r on d.StartDate < r.EndDate and r.StartDate < d.EndDate
where d.StartDate < @ENDDATE and @STARTDATE < d.EndDate
order by d.StartDate, d.EndDate, d.Id;
Setup:
CREATE TABLE dbo.DATA1
(
ID integer PRIMARY KEY,
STARTDATE datetime NOT NULL,
ENDDATE datetime NOT NULL,
);
CREATE UNIQUE INDEX i1 ON dbo.DATA1 (STARTDATE DESC, ID ASC) INCLUDE (ENDDATE);
CREATE UNIQUE INDEX i2 ON dbo.DATA1 (ENDDATE ASC, ID DESC) INCLUDE (STARTDATE);
INSERT INTO DATA1 (ID, STARTDATE, ENDDATE) VALUES
(1, '2018-08-17 12:00:00', '2018-08-20 12:00:00'),
(2, '2018-08-22 12:00:00', '2018-08-25 12:00:00'),
(3, '2018-08-24 12:00:00', '2018-08-29 12:00:00'),
(4, '2018-08-26 12:00:00', '2018-09-02 12:00:00'),
(5, '2018-09-05 12:00:00', '2018-09-25 12:00:00'),
(6, '2018-08-23 12:00:00', '2018-08-30 12:00:00'),
(7, '2018-08-17 12:00:00', '2018-08-20 12:00:00'),
(8, '2018-08-22 12:00:00', '2018-08-25 12:00:00'),
(9, '2018-08-24 12:00:00', '2018-08-29 12:00:00'),
(10, '2018-08-26 12:00:00', '2018-09-02 12:00:00'),
(11, '2018-09-05 12:00:00', '2018-09-25 12:00:00'),
(12, '2018-08-23 12:00:00', '2018-08-30 12:00:00');
Full Query:
DECLARE @STARTDATE DATETIME = '2018-08-24 12:00:00';
DECLARE @ENDDATE DATETIME = '2018-08-31 12:00:00';
with packed as (
select min(StartDate) StartDate, max(EndDate) EndDate from (
select StartDate, EndDate, sum(GroupStart) over (order by StartDate) Grp from (
select StartDate, EndDate
, case when max(EndDate)
over (order by EndDate, StartDate rows between unbounded preceding and 1 preceding)
> StartDate
then 0 else 1 end GroupStart
from dbo.DATA1
) p0
) p1
group by Grp
)
, relevant as (
select * from packed
where StartDate < @ENDDATE
and @STARTDATE < EndDate
)
select d.* from dbo.DATA1 d
inner join relevant r on d.StartDate < r.EndDate and r.StartDate < d.EndDate
where d.StartDate < @ENDDATE and @STARTDATE < d.EndDate
order by d.StartDate, d.EndDate, d.Id;