Вопрос

Can someone code review this slowly changing dimension for our Kimball Data warehouse? This is Type 2 dimension. I reviewed a lot of code on the internet, however I want to use non-TSQL specific commands (trying to prevent usage of OUTPUT and $Action). If you can find a way to make this more efficient/optimal, feel free to edit. We may in the future transfer to MySQL or PostgreSQL.

https://www.mssqltips.com/sqlservertip/2883/using-the-sql-server-merge-statement-to-process-type-2-slowly-changing-dimensions/

https://sqlblogcasts.com/blogs/atulthakor/archive/2011/01/10/t-sql-scd-slowly-changing-dimension-type-2-using-a-merge-statement.aspx

Tables:

create table dbo.Stagingfood
(
    FoodNaturalId int primary key identity(1,1),
    FoodName varchar(255),
    FoodCategory varchar(255)
)

create table dbo.Dimfood
(
    DimFoodId int primary key identity(1,1),
    FoodNaturalId int,
    FoodName varchar(255),
    FoodCategory varchar(255),
    begindate datetime,
    enddate datetime
)

Code:

create procedure dbo.DimFoodImport
as

declare @newdatetime datetime

begin transaction


    -- Insert new records which do not exist

    insert into dbo.DimFood
    (
        FoodNaturalId,
        FoodName,
        FoodCategory,
        begindate,
        enddate
    )
    select 
        stgfood.FoodNaturalId,
        stgfood.FoodName,
        stgfood.FoodCategory,
        getdate() as begindate,
        '12/31/9999' as enddate
    from dbo.Stagingfood stgfood 
    left join dbo.DimFood df
        on df.FoodNaturalId = stgfood.FoodNaturalId
    where df.FoodNaturalId is null


    -- Close off existing records that have changed
    set @newdatetime = getdate()

    update dbo.DimFood
    set EndDate = @newdatetime
    from dbo.DimFood  df 
    left join dbo.Stagingfood stgfood 
        on df.FoodNaturalId = stgfood.FoodNaturalId
        and df.enddate = '12/31/9999'
    where 
        stgfood.FoodName <> df.FoodName
        or stgfood.FoodCategory <> df.FoodCategory

    -- Insert new updated records

    insert into dbo.DimFood
    (
        FoodNaturalId,
        FoodName,
        FoodCategory,
        begindate,
        enddate
    )
    select 
        stgfood.FoodNaturalId,
        stgfood.FoodName,
        stgfood.FoodCategory,
        @newdatetime as begindate,
        '12/31/9999' as enddate
    from dbo.Stagingfood stgfood 
    left join dbo.DimFood df
        on df.FoodNaturalId = stgfood.FoodNaturalId
        and df.EndDate = @newdatetime
    where 
        stgfood.FoodName <> df.FoodName
        or stgfood.FoodCategory <> df.FoodCategory

commit transaction

Lets assume I have proper indexes, and future SCD tables will have 10-15 columns. Above is just example.

Это было полезно?

Решение

f you can find a way to make this more efficient/optimal, feel free to edit

Without knowing anything about indexes etc, I tried filling up the staging table with some sample data:

    (FoodName,FoodCategory)
values('Chicken','Meat')
go 1000

insert into dbo.Stagingfood 
(FoodName,FoodCategory)
values('Veal','Meat')
go 10000


insert into dbo.Stagingfood 
(FoodName,FoodCategory)
values('Porc','Meat')
go 20000

First part: The Insert

One of the things that catches my eye is the first insert:

-- Insert new records which do not exist

insert into dbo.DimFood
(
    FoodNaturalId,
    FoodName,
    FoodCategory,
    begindate,
    enddate
)
select 
    stgfood.FoodNaturalId,
    stgfood.FoodName,
    stgfood.FoodCategory,
    getdate() as begindate,
    '12/31/9999' as enddate
from dbo.Stagingfood stgfood 
left join dbo.DimFood df
    on df.FoodNaturalId = stgfood.FoodNaturalId
where df.FoodNaturalId is null;

This left join will cause problems when your data gets bigger. The left join will make it so that your query will always need to join all the records, before it can apply the filter.

My rewrite + added index would be this:

CREATE NONCLUSTERED INDEX IX_DimFood_FoodNaturalId
ON [dbo].[Dimfood] ([FoodNaturalId]);

   insert into dbo.DimFood
    (
        FoodNaturalId,
        FoodName,
        FoodCategory,
        begindate,
        enddate
    )
    select 
        stgfood.FoodNaturalId,
        stgfood.FoodName,
        stgfood.FoodCategory,
        getdate() as begindate,
        '12/31/9999' as enddate
    from dbo.Stagingfood stgfood 
    where not exists 
   (select * from  dbo.DimFood df where df.FoodNaturalId = stgfood.FoodNaturalId);

You can check the difference on the two, the one with the left join uses a Merge join (LEFT OUTER JOIN) into a filter, and the rewrite does a LEFT ANTI SEMI JOIN without a filter.

paste the plan:

https://www.brentozar.com/pastetheplan/?id=HyVZigsjm

Second Part: The update

This one is mostly something that needs to be tested with the actual data set. But doing the left join where an inner join could be used is again an unneeded extra step. The other thing is, is that or clauses generally don't go down well in the optimizer.

This could be a rewrite, but the or clause might need to be kept in, again, this needs to be tested.

  -- Close off existing records that have changed

This left join can be changed because you are checking for existing records.

   set @newdatetime = getdate()

This rewrite + Index gives better results on my dataset

    CREATE NONCLUSTERED INDEX IX_Enddate_FoodNaturalId
    on [dbo].[Dimfood] (Enddate,[FoodNaturalId])
    include(foodname,foodcategory);
    set @newdatetime = getdate();

    update dbo.DimFood
    set EndDate = @newdatetime
    from dbo.DimFood  df 
    inner join   dbo.Stagingfood stgfood 
    on df.FoodNaturalId = stgfood.FoodNaturalId
    where( df.enddate = '12/31/9999' 
    and (stgfood.FoodName <> df.FoodName or stgfood.FoodCategory <> df.FoodCategory));

If the 'or' clause gives trouble on big datasets, you might have to spread them out over two updates:

update dbo.DimFood
set EndDate = @newdatetime
from dbo.DimFood  df      inner join   dbo.Stagingfood stgfood 
on df.FoodNaturalId = stgfood.FoodNaturalId
where( df.enddate = '12/31/9999' and stgfood.FoodName <> df.FoodName)


update dbo.DimFood
set EndDate = @newdatetime
from dbo.DimFood  df      inner join   dbo.Stagingfood stgfood 
on df.FoodNaturalId = stgfood.FoodNaturalId
where( df.enddate = '12/31/9999' and stgfood.FoodCategory <> df.FoodCategory);

Plan without cutting the 'or' clause in two parts: (first one original, second one rewrite)

https://www.brentozar.com/pastetheplan/?id=BJDvNZjjm

Third part: inserting new updated records

Again here, if you are only going to be checking the new updated records, an inner join could be used to improve the query. You would also not need the where clause.

   -- Insert new updated records
insert into dbo.DimFood
(
    FoodNaturalId,
    FoodName,
    FoodCategory,
    begindate,
    enddate
)
select 
    stgfood.FoodNaturalId,
    stgfood.FoodName,
    stgfood.FoodCategory,
    @newdatetime as begindate,
    '12/31/9999' as enddate
from dbo.Stagingfood stgfood 
inner join dbo.DimFood df
on df.FoodNaturalId = stgfood.FoodNaturalId
and df.EndDate = @newdatetime;

Paste the plan (first plan original, second plan rewrite):

https://www.brentozar.com/pastetheplan/?id=BJk2V-sim

Лицензировано под: CC-BY-SA с атрибуция
Не связан с dba.stackexchange
scroll top