Question

I have two tables. Each holds some attributes for a business entity and the date range for which those attributes were valid. I want to combine these tables into one, matching rows on the common business key and splitting the time ranges.

The real-world example is two source temporal tables feeding a type-2 dimension table in the data warehouse.

The entity can be present in neither, one or both of the source systems at any point in time. Once an entity is recorded in a source system the intervals are well-behaved - no gaps, duplicates or other monkey business. Membership in the sources can end at different dates.

The business rules state we only want to return intervals where the entity is present in both sources simultaneously.

What query will give this result?

This illustrates the situation:

Month          J     F     M     A     M     J     J
Source A:  <--><----------><----------><---->
Source B:            <----><----><----------------><-->
               
Result:              <----><----><----><---->

Sample Data

For simplicity I've used closed date intervals; likely any solution could be extended to half-open intervals with a little typing.

drop table if exists dbo.SourceA;
drop table if exists dbo.SourceB;
go

create table dbo.SourceA
(
    BusinessKey int,
    StartDate   date,
    EndDate     date,
    Attribute   char(9)
);

create table dbo.SourceB
(
    BusinessKey int,
    StartDate   date,
    EndDate     date,
    Attribute   char(9)
);
GO


insert dbo.SourceA(BusinessKey, StartDate, EndDate, Attribute)
values
    (1, '19990101', '19990113', 'black'),
    (1, '19990114', '19990313', 'red'),
    (1, '19990314', '19990513', 'blue'),
    (1, '19990514', '19990613', 'green'),
    (2, '20110714', '20110913', 'pink'),
    (2, '20110914', '20111113', 'white'),
    (2, '20111114', '20111213', 'gray');

insert dbo.SourceB(BusinessKey, StartDate, EndDate, Attribute)
values
    (1, '19990214', '19990313', 'left'),
    (1, '19990314', '19990413', 'right'),
    (1, '19990414', '19990713', 'centre'),
    (1, '19990714', '19990730', 'back'),
    (2, '20110814', '20110913', 'top'),
    (2, '20110914', '20111013', 'middle'),
    (2, '20111014', '20120113', 'bottom');

Desired output

BusinessKey StartDate   EndDate     a_Colour  b_Placement
----------- ----------  ----------  --------- -----------
1           1999-02-14  1999-03-13  red       left     
1           1999-03-14  1999-04-13  blue      right    
1           1999-04-14  1999-05-13  blue      centre   
1           1999-05-14  1999-06-13  green     centre   
2           2011-08-14  2011-09-13  pink      top      
2           2011-09-14  2011-10-13  white     middle   
2           2011-10-14  2011-11-13  white     bottom   
2           2011-11-14  2011-12-13  gray      bottom    
Was it helpful?

Solution

I may have misunderstood your question, but the results seem to be according to your question:

select a.businesskey
     -- greatest(a.startdate, b.startdate)
     , case when a.startdate > b.startdate 
            then a.startdate 
            else b.startdate 
       end as startdate
     -- least(a.enddate, b.enddate)
     , case when a.enddate < b.enddate 
            then a.enddate 
            else b.enddate 
       end as enddate
     , a.attribute as a_color
     , b.attribute as b_placement
from dbo.SourceA a 
join dbo.SourceB b 
        on a.businesskey = b.businesskey
       and (a.startdate between b.startdate and b.enddate 
          or b.startdate between a.startdate and a.enddate)
order by 1,2

Since intervals need to overlap most of the work can be done with a join with that as the predicate. Then it's just a matter of choosing the intersection of the intervals.

LEAST and GREATEST seem to be missing as functions, so I used a case expression instead.

Fiddle

OTHER TIPS

This solution deconstructs the source intervals to just their starting dates. By combining these two list a set of output interval start dates are obtained. From these the corresponding output end dates are calculated by a window function. As the final output interval must end when either of the two input intervals end there is special processing to determine this value.

;with Dates as
(
    select BusinessKey, StartDate
    from dbo.SourceA

    union

    select BusinessKey, StartDate
    from dbo.SourceB

    union

    select x.BusinessKey, DATEADD(DAY, 1, MIN(x.EndDate))
    from
    (
        select BusinessKey, EndDate = MAX(EndDate) 
        from dbo.SourceA
        group by BusinessKey

        union all

        select BusinessKey, EndDate = MAX(EndDate) 
        from dbo.SourceB
        group by BusinessKey
    ) as x
    group by x.BusinessKey
),
Intervals as
(
    select
        dt.BusinessKey,
        dt.StartDate,
        EndDate = lead (DATEADD(DAY, -1, dt.StartDate), 1)
                  over (partition by dt.BusinessKey order by dt.StartDate)
    from Dates as dt
)
select
    i.BusinessKey,
    i.StartDate,
    i.EndDate, 
    a_Colour = a.Attribute,
    b_Placement = b.Attribute
from Intervals as i
inner join dbo.SourceA as a
    on i.BusinessKey = a.BusinessKey
    and i.StartDate between a.StartDate and a.EndDate
inner join dbo.SourceB as b
    on i.BusinessKey = b.BusinessKey
    and i.StartDate between b.StartDate and b.EndDate
where i.EndDate is not NULL
order by
    i.BusinessKey,
    i.StartDate;

The "Dates" CTE uses UNION rather than UNION ALL to eliminate duplicates. If both sources change on the same date we want only one corresponding output row.

As we want to close output when either source closes the third query in "Dates" adds the earliest end date i.e. the MIN of the MAX of EndDates. As it is an EndDate masquerading as a StartDate it must have another day added to it. It's purpose is to allow the window function to calculate the end of the preceding interval. It will be eliminated in the final predicate.

Using inner joins for the final query eliminates those source intervals for which there is no corresponding value in the other source.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top