SQL Server Stored Procedure Slowly Changing Dimension
문제
Can someone code review this slowly changing dimension for our Kimball Data warehouse? This is Type 2 dimension. I reviewed a lot of code on the internet, however I want to use non-TSQL specific commands (trying to prevent usage of OUTPUT and $Action). If you can find a way to make this more efficient/optimal, feel free to edit. We may in the future transfer to MySQL or PostgreSQL.
Tables:
create table dbo.Stagingfood
(
FoodNaturalId int primary key identity(1,1),
FoodName varchar(255),
FoodCategory varchar(255)
)
create table dbo.Dimfood
(
DimFoodId int primary key identity(1,1),
FoodNaturalId int,
FoodName varchar(255),
FoodCategory varchar(255),
begindate datetime,
enddate datetime
)
Code:
create procedure dbo.DimFoodImport
as
declare @newdatetime datetime
begin transaction
-- Insert new records which do not exist
insert into dbo.DimFood
(
FoodNaturalId,
FoodName,
FoodCategory,
begindate,
enddate
)
select
stgfood.FoodNaturalId,
stgfood.FoodName,
stgfood.FoodCategory,
getdate() as begindate,
'12/31/9999' as enddate
from dbo.Stagingfood stgfood
left join dbo.DimFood df
on df.FoodNaturalId = stgfood.FoodNaturalId
where df.FoodNaturalId is null
-- Close off existing records that have changed
set @newdatetime = getdate()
update dbo.DimFood
set EndDate = @newdatetime
from dbo.DimFood df
left join dbo.Stagingfood stgfood
on df.FoodNaturalId = stgfood.FoodNaturalId
and df.enddate = '12/31/9999'
where
stgfood.FoodName <> df.FoodName
or stgfood.FoodCategory <> df.FoodCategory
-- Insert new updated records
insert into dbo.DimFood
(
FoodNaturalId,
FoodName,
FoodCategory,
begindate,
enddate
)
select
stgfood.FoodNaturalId,
stgfood.FoodName,
stgfood.FoodCategory,
@newdatetime as begindate,
'12/31/9999' as enddate
from dbo.Stagingfood stgfood
left join dbo.DimFood df
on df.FoodNaturalId = stgfood.FoodNaturalId
and df.EndDate = @newdatetime
where
stgfood.FoodName <> df.FoodName
or stgfood.FoodCategory <> df.FoodCategory
commit transaction
Lets assume I have proper indexes, and future SCD tables will have 10-15 columns. Above is just example.
해결책
f you can find a way to make this more efficient/optimal, feel free to edit
Without knowing anything about indexes etc, I tried filling up the staging table with some sample data:
(FoodName,FoodCategory)
values('Chicken','Meat')
go 1000
insert into dbo.Stagingfood
(FoodName,FoodCategory)
values('Veal','Meat')
go 10000
insert into dbo.Stagingfood
(FoodName,FoodCategory)
values('Porc','Meat')
go 20000
First part: The Insert
One of the things that catches my eye is the first insert:
-- Insert new records which do not exist
insert into dbo.DimFood
(
FoodNaturalId,
FoodName,
FoodCategory,
begindate,
enddate
)
select
stgfood.FoodNaturalId,
stgfood.FoodName,
stgfood.FoodCategory,
getdate() as begindate,
'12/31/9999' as enddate
from dbo.Stagingfood stgfood
left join dbo.DimFood df
on df.FoodNaturalId = stgfood.FoodNaturalId
where df.FoodNaturalId is null;
This left join will cause problems when your data gets bigger. The left join will make it so that your query will always need to join all the records, before it can apply the filter.
My rewrite + added index would be this:
CREATE NONCLUSTERED INDEX IX_DimFood_FoodNaturalId
ON [dbo].[Dimfood] ([FoodNaturalId]);
insert into dbo.DimFood
(
FoodNaturalId,
FoodName,
FoodCategory,
begindate,
enddate
)
select
stgfood.FoodNaturalId,
stgfood.FoodName,
stgfood.FoodCategory,
getdate() as begindate,
'12/31/9999' as enddate
from dbo.Stagingfood stgfood
where not exists
(select * from dbo.DimFood df where df.FoodNaturalId = stgfood.FoodNaturalId);
You can check the difference on the two, the one with the left join uses a Merge join (LEFT OUTER JOIN) into a filter, and the rewrite does a LEFT ANTI SEMI JOIN without a filter.
paste the plan:
https://www.brentozar.com/pastetheplan/?id=HyVZigsjm
Second Part: The update
This one is mostly something that needs to be tested with the actual data set. But doing the left join where an inner join could be used is again an unneeded extra step. The other thing is, is that or clauses generally don't go down well in the optimizer.
This could be a rewrite, but the or clause might need to be kept in, again, this needs to be tested.
-- Close off existing records that have changed
This left join can be changed because you are checking for existing records.
set @newdatetime = getdate()
This rewrite + Index gives better results on my dataset
CREATE NONCLUSTERED INDEX IX_Enddate_FoodNaturalId
on [dbo].[Dimfood] (Enddate,[FoodNaturalId])
include(foodname,foodcategory);
set @newdatetime = getdate();
update dbo.DimFood
set EndDate = @newdatetime
from dbo.DimFood df
inner join dbo.Stagingfood stgfood
on df.FoodNaturalId = stgfood.FoodNaturalId
where( df.enddate = '12/31/9999'
and (stgfood.FoodName <> df.FoodName or stgfood.FoodCategory <> df.FoodCategory));
If the 'or' clause gives trouble on big datasets, you might have to spread them out over two updates:
update dbo.DimFood set EndDate = @newdatetime from dbo.DimFood df inner join dbo.Stagingfood stgfood on df.FoodNaturalId = stgfood.FoodNaturalId where( df.enddate = '12/31/9999' and stgfood.FoodName <> df.FoodName) update dbo.DimFood set EndDate = @newdatetime from dbo.DimFood df inner join dbo.Stagingfood stgfood on df.FoodNaturalId = stgfood.FoodNaturalId where( df.enddate = '12/31/9999' and stgfood.FoodCategory <> df.FoodCategory);
Plan without cutting the 'or' clause in two parts: (first one original, second one rewrite)
https://www.brentozar.com/pastetheplan/?id=BJDvNZjjm
Third part: inserting new updated records
Again here, if you are only going to be checking the new updated records, an inner join could be used to improve the query. You would also not need the where clause.
-- Insert new updated records
insert into dbo.DimFood
(
FoodNaturalId,
FoodName,
FoodCategory,
begindate,
enddate
)
select
stgfood.FoodNaturalId,
stgfood.FoodName,
stgfood.FoodCategory,
@newdatetime as begindate,
'12/31/9999' as enddate
from dbo.Stagingfood stgfood
inner join dbo.DimFood df
on df.FoodNaturalId = stgfood.FoodNaturalId
and df.EndDate = @newdatetime;
Paste the plan (first plan original, second plan rewrite):