Create Slowly Changing Dimension table from Repeated Data
Question
I am receiving transaction snapshot data from a file, it contains a history of repeated data. Currently trying to find Slowly Changing Dimensions on table with [ProductId] Business Key. Many attributes exist: ProductTitle, Category, this is a sample table, actually have around 10 more attributes. How do I create a Product Slowly Changing Dimension table query?
Searching for a performance optimized way, if I have 10 columns, not sure if Group By on 10 columns is optimal
With SQL 2016, is there a function to obtain this data? Should I use Lead/Lag Function? FirstValue/Last Value? New analytics syntax? An attempted query is below.
Note: Data comes from a 1970 legacy file system containing historical data.
Data:
create table dbo.Product
(
ProductId int,
ProductTitle varchar(55),
ProductCategory varchar(255),
Loaddate datetime
)
insert into dbo.Product
values
(1,'Table','ABCD','3/4/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/6/2018')
,(1,'Table','XYZ','3/7/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/9/2018')
,(1,'Table-Dinner', 'GHI','3/10/2018')
,(1,'Table-Dinner', 'GHI','3/11/2018')
....more data with ProductId =2,3,4, etc
Current Repeated Data in File:
Expected Output:
Attempted Query
(seems to be inefficient, especially when having 10 attribute columns)
select
product.Productid
,product.ProductTitle
,product.ProductCategory
,min(product.LoadDate) as BeginDate
,case when max(product.LoadDate) = (select max(subproduct.LoadDate) from dbo.Product subproduct where subproduct.productid = product.productid) then '12/31/9999' else max(product.loadDate) end as EndDate
from dbo.Product product
group by Productid, ProductTitle, ProductCategory
Solution
if I have 10 columns, not sure if Group By on 10 columns is optimal
Its true,Group By on so many columns is sub optimal.But there is no other way as per your data and requirement.
Window function with partition
is worse than Group By
.
See as per my understanding Group by
on those columns is necessary to get correct output,so if you use Window Function then you have to use those columns in Partition by
too .
Therefore Partition by
on so any column is worse and also you have to use few more select like in example above.
So what you are already doing is nearly correct except that subquery part.
Once try this,
SELECT product.productid,
product.producttitle,
product.productcategory,
Min(product.loaddate) AS BeginDate
-- ,max(product.LoadDate) as BeginDate1
,
CASE
WHEN Max(product.loaddate) = Max(oa.enddate1) THEN '12/31/9999'
ELSE Max(product.loaddate)
END AS EndDate
FROM dbo.product product
CROSS apply(SELECT Max(subproduct.loaddate) EndDate1
FROM dbo.product subproduct
WHERE subproduct.productid = product.productid)oa
GROUP BY productid,
producttitle,
productcategory
If suppose your main query is really very very slow because of subquery or my cross apply then Can you divide your query in 2 steps ?
I think you should open more about your requirement.
How many rows will be updated at a time ?
If those selected rows are inserted/updated in new table then those rows should again not be selected next time when Insert happen.
What are you doing for this ?
If I am wrong about your requirement then let me know so that I correct my answer.
OTHER TIPS
The problem here is the way you're loading data. With a Type 2 SCD (Effective Date) you want to add a new row only when there is a change to the data. The first four rows in your dataset do not change except for the load date.
You need to ETL your data from the source files into your database where you can more easily identify if records have been changed and only add new rows for the changed records. Begin Date would be the date that row is first loaded and End Date would be when one of the values changes marking the original row as no longer current. For the current row, End Date is NULL.
The query then becomes much simpler:
select
product.Productid
,product.ProductTitle
,product.ProductCategory
,BeginDate
,CASE EndDate
WHEN NULL THEN '12/31/9999'
ELSE EndDate
END AS EndDate
from dbo.Product product
group by Productid, ProductTitle, ProductCategory
The end result is your table would only have 3 rows rather than 10. If you're importing directly using TSQL scripts, look at using MERGE to UPDATE\INSERT as appropriate. If you're using an ETL tool like SSIS, they have native functions for handling SCDs. There are lots of ways to solve this, but the key is importing your data in a better way.
See if this gives you what you want. It seems to work against your test data.
--demo setup
drop table if exists dbo.product
go
create table dbo.Product
(
ProductId int,
ProductTitle varchar(55),
ProductCategory varchar(255),
Loaddate datetime
)
insert into dbo.Product
values
(1,'Table','ABCD','3/4/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/6/2018')
,(1,'Table','XYZ','3/7/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/9/2018')
,(1,'Table-Dinner', 'GHI','3/10/2018')
,(1,'Table-Dinner', 'GHI','3/11/2018')
Using a common table expression and lag
against the other columns you are interested in, I populate a new column called ischange
;WITH BaseDataAndIsChanged
AS (
SELECT *
,CASE
WHEN (
lag(ProductTitle, 1, '') OVER (
PARTITION BY ProductId ORDER BY LoadDate
)
) = ProductTitle
AND (
lag(ProductCategory, 1, '') OVER (
PARTITION BY ProductId ORDER BY LoadDate
)
) = ProductCategory
THEN 0
ELSE 1
END ischange
FROM dbo.product
)
select * from BaseDataAndIsChanged
| ProductId | ProductTitle | ProductCategory | Loaddate | ischange |
|-----------|--------------|-----------------|-------------------------|----------|
| 1 | Table | ABCD | 2018-03-04 00:00:00.000 | 1 |
| 1 | Table | ABCD | 2018-03-05 00:00:00.000 | 0 |
| 1 | Table | ABCD | 2018-03-05 00:00:00.000 | 0 |
| 1 | Table | ABCD | 2018-03-06 00:00:00.000 | 0 |
| 1 | Table | XYZ | 2018-03-07 00:00:00.000 | 1 |
| 1 | Table | XYZ | 2018-03-08 00:00:00.000 | 0 |
| 1 | Table | XYZ | 2018-03-08 00:00:00.000 | 0 |
| 1 | Table | XYZ | 2018-03-09 00:00:00.000 | 0 |
| 1 | Table-Dinner | GHI | 2018-03-10 00:00:00.000 | 1 |
| 1 | Table-Dinner | GHI | 2018-03-11 00:00:00.000 | 0 |
I then use another common table expression to generate two additional columns (BeginDate
and EndDate
).
;WITH BaseDataAndIsChanged
AS (
SELECT *
,CASE
WHEN (
lag(ProductTitle, 1, '') OVER (
PARTITION BY ProductId ORDER BY LoadDate
)
) = ProductTitle
AND (
lag(ProductCategory, 1, '') OVER (
PARTITION BY ProductId ORDER BY LoadDate
)
) = ProductCategory
THEN 0
ELSE 1
END ischange
FROM dbo.product
)
--select * from BaseDataAndIsChanged
,BaseDataAndBeginEndDates
AS (
SELECT *
,CASE
WHEN ischange = 1
THEN Loaddate
END AS BeginDate
,CASE
WHEN lead(ischange, 1, '') OVER (
PARTITION BY ProductId ORDER BY LoadDate
) = 1
THEN Loaddate
END AS EndDate
FROM BaseDataAndIsChanged
)
select * from BaseDataAndBeginEndDates
| ProductId | ProductTitle | ProductCategory | Loaddate | ischange | BeginDate | EndDate |
|-----------|--------------|-----------------|-------------------------|----------|-------------------------|-------------------------|
| 1 | Table | ABCD | 2018-03-04 00:00:00.000 | 1 | 2018-03-04 00:00:00.000 | NULL |
| 1 | Table | ABCD | 2018-03-05 00:00:00.000 | 0 | NULL | NULL |
| 1 | Table | ABCD | 2018-03-05 00:00:00.000 | 0 | NULL | NULL |
| 1 | Table | ABCD | 2018-03-06 00:00:00.000 | 0 | NULL | 2018-03-06 00:00:00.000 |
| 1 | Table | XYZ | 2018-03-07 00:00:00.000 | 1 | 2018-03-07 00:00:00.000 | NULL |
| 1 | Table | XYZ | 2018-03-08 00:00:00.000 | 0 | NULL | NULL |
| 1 | Table | XYZ | 2018-03-08 00:00:00.000 | 0 | NULL | NULL |
| 1 | Table | XYZ | 2018-03-09 00:00:00.000 | 0 | NULL | 2018-03-09 00:00:00.000 |
| 1 | Table-Dinner | GHI | 2018-03-10 00:00:00.000 | 1 | 2018-03-10 00:00:00.000 | NULL |
| 1 | Table-Dinner | GHI | 2018-03-11 00:00:00.000 | 0 | NULL | NULL |
Here's the complete solution where we pull everything together and find the min(BeginDate) and Max(EndDate) group by the ProductId and other columns you're interested in
--demo setup
drop table if exists dbo.product
go
create table dbo.Product
(
ProductId int,
ProductTitle varchar(55),
ProductCategory varchar(255),
Loaddate datetime
)
insert into dbo.Product
values
(1,'Table','ABCD','3/4/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/6/2018')
,(1,'Table','XYZ','3/7/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/9/2018')
,(1,'Table-Dinner', 'GHI','3/10/2018')
,(1,'Table-Dinner', 'GHI','3/11/2018')
;WITH BaseDataAndIsChanged
AS (
SELECT *
,CASE
WHEN (
lag(ProductTitle, 1, '') OVER (
PARTITION BY ProductId ORDER BY LoadDate
)
) = ProductTitle
AND (
lag(ProductCategory, 1, '') OVER (
PARTITION BY ProductId ORDER BY LoadDate
)
) = ProductCategory
THEN 0
ELSE 1
END ischange
FROM dbo.product
)
--select * from BaseDataAndIsChanged
,BaseDataAndBeginEndDates
AS (
SELECT *
,CASE
WHEN ischange = 1
THEN Loaddate
END AS BeginDate
,CASE
WHEN lead(ischange, 1, '') OVER (
PARTITION BY ProductId ORDER BY LoadDate
) = 1
THEN Loaddate
END AS EndDate
FROM BaseDataAndIsChanged
)
--select * from BaseDataAndBeginEndDates
SELECT productid
,ProductTitle
,ProductCategory
,min(begindate) AS BeginDate
,isnull(max(EndDate), '9999-12-31') AS EndDate
FROM BaseDataAndBeginEndDates
GROUP BY productid
,ProductTitle
,ProductCategory
| productid | ProductTitle | ProductCategory | BeginDate | EndDate |
|-----------|--------------|-----------------|-------------------------|-------------------------|
| 1 | Table | ABCD | 2018-03-04 00:00:00.000 | 2018-03-06 00:00:00.000 |
| 1 | Table | XYZ | 2018-03-07 00:00:00.000 | 2018-03-09 00:00:00.000 |
| 1 | Table-Dinner | GHI | 2018-03-10 00:00:00.000 | 9999-12-31 00:00:00.000 |
I'd also be curious to know if a simple cursor
solution gives you the performance you want. Try this example against your dataset.
--demo setup
drop table if exists dbo.product
go
create table dbo.Product
(
ProductId int,
ProductTitle varchar(55),
ProductCategory varchar(255),
Loaddate datetime
)
insert into dbo.Product
values
(1,'Table','ABCD','3/4/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/6/2018')
,(1,'Table','XYZ','3/7/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/9/2018')
,(1,'Table-Dinner', 'GHI','3/10/2018')
,(1,'Table-Dinner', 'GHI','3/11/2018')
----------------
DECLARE @ProductId INT
DECLARE @ProductTitle VARCHAR(55)
DECLARE @ProductCategory VARCHAR(255)
DECLARE @LoadDate DATETIME
DECLARE @PrevProductId INT
DECLARE @PrevProductTitle VARCHAR(55)
DECLARE @PrevProductCategory VARCHAR(255)
DECLARE @PrevLoadDate DATETIME
DECLARE @BeginDate DATETIME
DECLARE @EndDate DATETIME
DECLARE @ResultTable TABLE (
ProductId INT
,ProductTitle VARCHAR(55)
,ProductCategory VARCHAR(255)
,BeginDate DATETIME
,EndDate DATETIME
)
DECLARE _CURSOR CURSOR LOCAL FORWARD_ONLY STATIC READ_ONLY
FOR
SELECT ProductId
,ProductTitle
,ProductCategory
,LoadDate
FROM dbo.Product
ORDER BY Productid
,Loaddate
OPEN _CURSOR
FETCH NEXT
FROM _CURSOR
INTO @ProductId
,@ProductTitle
,@ProductCategory
,@LoadDate
SET @PrevProductId = @ProductId
SET @PrevProductTitle = @ProductTitle
SET @PrevProductCategory = @ProductCategory
SET @BeginDate = @LoadDate
SET @PrevLoadDate = @LoadDate
WHILE @@FETCH_STATUS = 0
BEGIN
IF @ProductId <> @PrevProductId
OR @ProductTitle <> @PrevProductTitle
OR @ProductCategory <> @PrevProductCategory
BEGIN
IF @ProductId <> @PrevProductId
SET @EndDate = '9999-12-31'
ELSE
SET @EndDate = @PrevLoadDate
INSERT INTO @ResultTable (
ProductId
,ProductTitle
,ProductCategory
,BeginDate
,EndDate
)
VALUES (
@PrevProductId
,@PrevProductTitle
,@PrevProductCategory
,@BeginDate
,@EndDate
)
SET @BeginDate = @LoadDate
END
SET @PrevProductId = @ProductId
SET @PrevProductTitle = @ProductTitle
SET @PrevProductCategory = @ProductCategory
SET @PrevLoadDate = @LoadDate
FETCH NEXT
FROM _CURSOR
INTO @ProductId
,@ProductTitle
,@ProductCategory
,@LoadDate
END --End While
SET @EndDate = '9999-12-31'
INSERT INTO @ResultTable (
ProductId
,ProductTitle
,ProductCategory
,BeginDate
,EndDate
)
VALUES (
@PrevProductId
,@PrevProductTitle
,@PrevProductCategory
,@BeginDate
,@EndDate
)
CLOSE _CURSOR
DEALLOCATE _CURSOR
SELECT *
FROM @ResultTable