Split intervals
-
13-03-2021 - |
Вопрос
I have two tables. Each holds some attributes for a business entity and the date range for which those attributes were valid. I want to combine these tables into one, matching rows on the common business key and splitting the time ranges.
The real-world example is two source temporal tables feeding a type-2 dimension table in the data warehouse.
The entity can be present in neither, one or both of the source systems at any point in time. Once an entity is recorded in a source system the intervals are well-behaved - no gaps, duplicates or other monkey business. Membership in the sources can end at different dates.
The business rules state we only want to return intervals where the entity is present in both sources simultaneously.
What query will give this result?
This illustrates the situation:
Month J F M A M J J
Source A: <--><----------><----------><---->
Source B: <----><----><----------------><-->
Result: <----><----><----><---->
Sample Data
For simplicity I've used closed date intervals; likely any solution could be extended to half-open intervals with a little typing.
drop table if exists dbo.SourceA;
drop table if exists dbo.SourceB;
go
create table dbo.SourceA
(
BusinessKey int,
StartDate date,
EndDate date,
Attribute char(9)
);
create table dbo.SourceB
(
BusinessKey int,
StartDate date,
EndDate date,
Attribute char(9)
);
GO
insert dbo.SourceA(BusinessKey, StartDate, EndDate, Attribute)
values
(1, '19990101', '19990113', 'black'),
(1, '19990114', '19990313', 'red'),
(1, '19990314', '19990513', 'blue'),
(1, '19990514', '19990613', 'green'),
(2, '20110714', '20110913', 'pink'),
(2, '20110914', '20111113', 'white'),
(2, '20111114', '20111213', 'gray');
insert dbo.SourceB(BusinessKey, StartDate, EndDate, Attribute)
values
(1, '19990214', '19990313', 'left'),
(1, '19990314', '19990413', 'right'),
(1, '19990414', '19990713', 'centre'),
(1, '19990714', '19990730', 'back'),
(2, '20110814', '20110913', 'top'),
(2, '20110914', '20111013', 'middle'),
(2, '20111014', '20120113', 'bottom');
Desired output
BusinessKey StartDate EndDate a_Colour b_Placement
----------- ---------- ---------- --------- -----------
1 1999-02-14 1999-03-13 red left
1 1999-03-14 1999-04-13 blue right
1 1999-04-14 1999-05-13 blue centre
1 1999-05-14 1999-06-13 green centre
2 2011-08-14 2011-09-13 pink top
2 2011-09-14 2011-10-13 white middle
2 2011-10-14 2011-11-13 white bottom
2 2011-11-14 2011-12-13 gray bottom
Решение
I may have misunderstood your question, but the results seem to be according to your question:
select a.businesskey
-- greatest(a.startdate, b.startdate)
, case when a.startdate > b.startdate
then a.startdate
else b.startdate
end as startdate
-- least(a.enddate, b.enddate)
, case when a.enddate < b.enddate
then a.enddate
else b.enddate
end as enddate
, a.attribute as a_color
, b.attribute as b_placement
from dbo.SourceA a
join dbo.SourceB b
on a.businesskey = b.businesskey
and (a.startdate between b.startdate and b.enddate
or b.startdate between a.startdate and a.enddate)
order by 1,2
Since intervals need to overlap most of the work can be done with a join with that as the predicate. Then it's just a matter of choosing the intersection of the intervals.
LEAST and GREATEST seem to be missing as functions, so I used a case expression instead.
Другие советы
This solution deconstructs the source intervals to just their starting dates. By combining these two list a set of output interval start dates are obtained. From these the corresponding output end dates are calculated by a window function. As the final output interval must end when either of the two input intervals end there is special processing to determine this value.
;with Dates as
(
select BusinessKey, StartDate
from dbo.SourceA
union
select BusinessKey, StartDate
from dbo.SourceB
union
select x.BusinessKey, DATEADD(DAY, 1, MIN(x.EndDate))
from
(
select BusinessKey, EndDate = MAX(EndDate)
from dbo.SourceA
group by BusinessKey
union all
select BusinessKey, EndDate = MAX(EndDate)
from dbo.SourceB
group by BusinessKey
) as x
group by x.BusinessKey
),
Intervals as
(
select
dt.BusinessKey,
dt.StartDate,
EndDate = lead (DATEADD(DAY, -1, dt.StartDate), 1)
over (partition by dt.BusinessKey order by dt.StartDate)
from Dates as dt
)
select
i.BusinessKey,
i.StartDate,
i.EndDate,
a_Colour = a.Attribute,
b_Placement = b.Attribute
from Intervals as i
inner join dbo.SourceA as a
on i.BusinessKey = a.BusinessKey
and i.StartDate between a.StartDate and a.EndDate
inner join dbo.SourceB as b
on i.BusinessKey = b.BusinessKey
and i.StartDate between b.StartDate and b.EndDate
where i.EndDate is not NULL
order by
i.BusinessKey,
i.StartDate;
The "Dates" CTE uses UNION rather than UNION ALL to eliminate duplicates. If both sources change on the same date we want only one corresponding output row.
As we want to close output when either source closes the third query in "Dates" adds the earliest end date i.e. the MIN of the MAX of EndDates. As it is an EndDate masquerading as a StartDate it must have another day added to it. It's purpose is to allow the window function to calculate the end of the preceding interval. It will be eliminated in the final predicate.
Using inner joins for the final query eliminates those source intervals for which there is no corresponding value in the other source.