Question

I am receiving transaction snapshot data from a file, it contains a history of repeated data. Currently trying to find Slowly Changing Dimensions on table with [ProductId] Business Key. Many attributes exist: ProductTitle, Category, this is a sample table, actually have around 10 more attributes. How do I create a Product Slowly Changing Dimension table query?

Searching for a performance optimized way, if I have 10 columns, not sure if Group By on 10 columns is optimal

With SQL 2016, is there a function to obtain this data? Should I use Lead/Lag Function? FirstValue/Last Value? New analytics syntax? An attempted query is below.

Note: Data comes from a 1970 legacy file system containing historical data.

Data:

create table dbo.Product
(
    ProductId int,
    ProductTitle varchar(55),
    ProductCategory varchar(255),
    Loaddate datetime
)

insert into dbo.Product
values 
 (1,'Table','ABCD','3/4/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/6/2018')
,(1,'Table','XYZ','3/7/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/9/2018')
,(1,'Table-Dinner', 'GHI','3/10/2018')
,(1,'Table-Dinner', 'GHI','3/11/2018')
....more data with ProductId =2,3,4, etc

Current Repeated Data in File:

enter image description here

Expected Output:

enter image description here

Attempted Query

(seems to be inefficient, especially when having 10 attribute columns)

select
    product.Productid
    ,product.ProductTitle
    ,product.ProductCategory
    ,min(product.LoadDate) as BeginDate
    ,case when max(product.LoadDate)  = (select max(subproduct.LoadDate) from dbo.Product subproduct where subproduct.productid = product.productid) then '12/31/9999' else max(product.loadDate) end as EndDate
from dbo.Product product
group by Productid, ProductTitle, ProductCategory
Was it helpful?

Solution

if I have 10 columns, not sure if Group By on 10 columns is optimal

Its true,Group By on so many columns is sub optimal.But there is no other way as per your data and requirement.

Window function with partition is worse than Group By.

See as per my understanding Group by on those columns is necessary to get correct output,so if you use Window Function then you have to use those columns in Partition by too .

Therefore Partition by on so any column is worse and also you have to use few more select like in example above.

So what you are already doing is nearly correct except that subquery part.

Once try this,

SELECT product.productid, 
       product.producttitle, 
       product.productcategory, 
       Min(product.loaddate) AS BeginDate 
       -- ,max(product.LoadDate) as BeginDate1 
       , 
       CASE 
         WHEN Max(product.loaddate) = Max(oa.enddate1) THEN '12/31/9999' 
         ELSE Max(product.loaddate) 
       END                   AS EndDate 
FROM   dbo.product product 
       CROSS apply(SELECT Max(subproduct.loaddate) EndDate1 
                   FROM   dbo.product subproduct 
                   WHERE  subproduct.productid = product.productid)oa 
GROUP  BY productid, 
          producttitle, 
          productcategory 

If suppose your main query is really very very slow because of subquery or my cross apply then Can you divide your query in 2 steps ?

I think you should open more about your requirement.

How many rows will be updated at a time ?

If those selected rows are inserted/updated in new table then those rows should again not be selected next time when Insert happen.

What are you doing for this ?

If I am wrong about your requirement then let me know so that I correct my answer.

OTHER TIPS

The problem here is the way you're loading data. With a Type 2 SCD (Effective Date) you want to add a new row only when there is a change to the data. The first four rows in your dataset do not change except for the load date.

You need to ETL your data from the source files into your database where you can more easily identify if records have been changed and only add new rows for the changed records. Begin Date would be the date that row is first loaded and End Date would be when one of the values changes marking the original row as no longer current. For the current row, End Date is NULL.

The query then becomes much simpler:

select
    product.Productid
    ,product.ProductTitle
    ,product.ProductCategory
    ,BeginDate
    ,CASE EndDate
        WHEN NULL THEN '12/31/9999'
        ELSE EndDate
     END AS EndDate
from dbo.Product product
group by Productid, ProductTitle, ProductCategory

The end result is your table would only have 3 rows rather than 10. If you're importing directly using TSQL scripts, look at using MERGE to UPDATE\INSERT as appropriate. If you're using an ETL tool like SSIS, they have native functions for handling SCDs. There are lots of ways to solve this, but the key is importing your data in a better way.

See if this gives you what you want. It seems to work against your test data.

--demo setup 
drop table if exists dbo.product
go
create table dbo.Product
(
    ProductId int,
    ProductTitle varchar(55),
    ProductCategory varchar(255),
    Loaddate datetime
)

insert into dbo.Product
values 
 (1,'Table','ABCD','3/4/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/6/2018')
,(1,'Table','XYZ','3/7/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/9/2018')
,(1,'Table-Dinner', 'GHI','3/10/2018')
,(1,'Table-Dinner', 'GHI','3/11/2018')

Using a common table expression and lag against the other columns you are interested in, I populate a new column called ischange

;WITH BaseDataAndIsChanged
AS (
    SELECT *
        ,CASE 
            WHEN (
                    lag(ProductTitle, 1, '') OVER (
                        PARTITION BY ProductId ORDER BY LoadDate
                        )
                    ) = ProductTitle
                AND (
                    lag(ProductCategory, 1, '') OVER (
                        PARTITION BY ProductId ORDER BY LoadDate
                        )
                    ) = ProductCategory
                THEN 0
            ELSE 1
            END ischange
    FROM dbo.product
    )
select * from BaseDataAndIsChanged

| ProductId | ProductTitle | ProductCategory | Loaddate                | ischange |
|-----------|--------------|-----------------|-------------------------|----------|
| 1         | Table        | ABCD            | 2018-03-04 00:00:00.000 | 1        |
| 1         | Table        | ABCD            | 2018-03-05 00:00:00.000 | 0        |
| 1         | Table        | ABCD            | 2018-03-05 00:00:00.000 | 0        |
| 1         | Table        | ABCD            | 2018-03-06 00:00:00.000 | 0        |
| 1         | Table        | XYZ             | 2018-03-07 00:00:00.000 | 1        |
| 1         | Table        | XYZ             | 2018-03-08 00:00:00.000 | 0        |
| 1         | Table        | XYZ             | 2018-03-08 00:00:00.000 | 0        |
| 1         | Table        | XYZ             | 2018-03-09 00:00:00.000 | 0        |
| 1         | Table-Dinner | GHI             | 2018-03-10 00:00:00.000 | 1        |
| 1         | Table-Dinner | GHI             | 2018-03-11 00:00:00.000 | 0        |

I then use another common table expression to generate two additional columns (BeginDate and EndDate).

;WITH BaseDataAndIsChanged
AS (
    SELECT *
        ,CASE 
            WHEN (
                    lag(ProductTitle, 1, '') OVER (
                        PARTITION BY ProductId ORDER BY LoadDate
                        )
                    ) = ProductTitle
                AND (
                    lag(ProductCategory, 1, '') OVER (
                        PARTITION BY ProductId ORDER BY LoadDate
                        )
                    ) = ProductCategory
                THEN 0
            ELSE 1
            END ischange
    FROM dbo.product
    )
--select * from BaseDataAndIsChanged
    ,BaseDataAndBeginEndDates
AS (
    SELECT *
        ,CASE 
            WHEN ischange = 1
                THEN Loaddate
            END AS BeginDate
        ,CASE 
            WHEN lead(ischange, 1, '') OVER (
                    PARTITION BY ProductId ORDER BY LoadDate
                    ) = 1
                THEN Loaddate
            END AS EndDate
    FROM BaseDataAndIsChanged
    )
select * from BaseDataAndBeginEndDates

| ProductId | ProductTitle | ProductCategory | Loaddate                | ischange | BeginDate               | EndDate                 |
|-----------|--------------|-----------------|-------------------------|----------|-------------------------|-------------------------|
| 1         | Table        | ABCD            | 2018-03-04 00:00:00.000 | 1        | 2018-03-04 00:00:00.000 | NULL                    |
| 1         | Table        | ABCD            | 2018-03-05 00:00:00.000 | 0        | NULL                    | NULL                    |
| 1         | Table        | ABCD            | 2018-03-05 00:00:00.000 | 0        | NULL                    | NULL                    |
| 1         | Table        | ABCD            | 2018-03-06 00:00:00.000 | 0        | NULL                    | 2018-03-06 00:00:00.000 |
| 1         | Table        | XYZ             | 2018-03-07 00:00:00.000 | 1        | 2018-03-07 00:00:00.000 | NULL                    |
| 1         | Table        | XYZ             | 2018-03-08 00:00:00.000 | 0        | NULL                    | NULL                    |
| 1         | Table        | XYZ             | 2018-03-08 00:00:00.000 | 0        | NULL                    | NULL                    |
| 1         | Table        | XYZ             | 2018-03-09 00:00:00.000 | 0        | NULL                    | 2018-03-09 00:00:00.000 |
| 1         | Table-Dinner | GHI             | 2018-03-10 00:00:00.000 | 1        | 2018-03-10 00:00:00.000 | NULL                    |
| 1         | Table-Dinner | GHI             | 2018-03-11 00:00:00.000 | 0        | NULL                    | NULL                    |

Here's the complete solution where we pull everything together and find the min(BeginDate) and Max(EndDate) group by the ProductId and other columns you're interested in

--demo setup 
drop table if exists dbo.product
go
create table dbo.Product
(
    ProductId int,
    ProductTitle varchar(55),
    ProductCategory varchar(255),
    Loaddate datetime
)

insert into dbo.Product
values 
 (1,'Table','ABCD','3/4/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/6/2018')
,(1,'Table','XYZ','3/7/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/9/2018')
,(1,'Table-Dinner', 'GHI','3/10/2018')
,(1,'Table-Dinner', 'GHI','3/11/2018')

;WITH BaseDataAndIsChanged
AS (
    SELECT *
        ,CASE 
            WHEN (
                    lag(ProductTitle, 1, '') OVER (
                        PARTITION BY ProductId ORDER BY LoadDate
                        )
                    ) = ProductTitle
                AND (
                    lag(ProductCategory, 1, '') OVER (
                        PARTITION BY ProductId ORDER BY LoadDate
                        )
                    ) = ProductCategory
                THEN 0
            ELSE 1
            END ischange
    FROM dbo.product
    )
--select * from BaseDataAndIsChanged
    ,BaseDataAndBeginEndDates
AS (
    SELECT *
        ,CASE 
            WHEN ischange = 1
                THEN Loaddate
            END AS BeginDate
        ,CASE 
            WHEN lead(ischange, 1, '') OVER (
                    PARTITION BY ProductId ORDER BY LoadDate
                    ) = 1
                THEN Loaddate
            END AS EndDate
    FROM BaseDataAndIsChanged
    )
--select * from BaseDataAndBeginEndDates
SELECT productid
    ,ProductTitle
    ,ProductCategory
    ,min(begindate) AS BeginDate
    ,isnull(max(EndDate), '9999-12-31') AS EndDate
FROM BaseDataAndBeginEndDates
GROUP BY productid
    ,ProductTitle
    ,ProductCategory

| productid | ProductTitle | ProductCategory | BeginDate               | EndDate                 |
|-----------|--------------|-----------------|-------------------------|-------------------------|
| 1         | Table        | ABCD            | 2018-03-04 00:00:00.000 | 2018-03-06 00:00:00.000 |
| 1         | Table        | XYZ             | 2018-03-07 00:00:00.000 | 2018-03-09 00:00:00.000 |
| 1         | Table-Dinner | GHI             | 2018-03-10 00:00:00.000 | 9999-12-31 00:00:00.000 |

I'd also be curious to know if a simple cursor solution gives you the performance you want. Try this example against your dataset.

--demo setup 
drop table if exists dbo.product
go
create table dbo.Product
(
    ProductId int,
    ProductTitle varchar(55),
    ProductCategory varchar(255),
    Loaddate datetime
)

insert into dbo.Product
values 
 (1,'Table','ABCD','3/4/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/6/2018')
,(1,'Table','XYZ','3/7/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/9/2018')
,(1,'Table-Dinner', 'GHI','3/10/2018')
,(1,'Table-Dinner', 'GHI','3/11/2018')

----------------

DECLARE @ProductId INT
DECLARE @ProductTitle VARCHAR(55)
DECLARE @ProductCategory VARCHAR(255)
DECLARE @LoadDate DATETIME
DECLARE @PrevProductId INT
DECLARE @PrevProductTitle VARCHAR(55)
DECLARE @PrevProductCategory VARCHAR(255)
DECLARE @PrevLoadDate DATETIME
DECLARE @BeginDate DATETIME
DECLARE @EndDate DATETIME
DECLARE @ResultTable TABLE (
    ProductId INT
    ,ProductTitle VARCHAR(55)
    ,ProductCategory VARCHAR(255)
    ,BeginDate DATETIME
    ,EndDate DATETIME
    )

DECLARE _CURSOR CURSOR LOCAL FORWARD_ONLY STATIC READ_ONLY
FOR
SELECT ProductId
    ,ProductTitle
    ,ProductCategory
    ,LoadDate
FROM dbo.Product
ORDER BY Productid
    ,Loaddate

OPEN _CURSOR

FETCH NEXT
FROM _CURSOR
INTO @ProductId
    ,@ProductTitle
    ,@ProductCategory
    ,@LoadDate

SET @PrevProductId = @ProductId
SET @PrevProductTitle = @ProductTitle
SET @PrevProductCategory = @ProductCategory
SET @BeginDate = @LoadDate
SET @PrevLoadDate = @LoadDate

WHILE @@FETCH_STATUS = 0
BEGIN
    IF @ProductId <> @PrevProductId
        OR @ProductTitle <> @PrevProductTitle
        OR @ProductCategory <> @PrevProductCategory
    BEGIN
        IF @ProductId <> @PrevProductId
            SET @EndDate = '9999-12-31'
        ELSE
            SET @EndDate = @PrevLoadDate

        INSERT INTO @ResultTable (
            ProductId
            ,ProductTitle
            ,ProductCategory
            ,BeginDate
            ,EndDate
            )
        VALUES (
            @PrevProductId
            ,@PrevProductTitle
            ,@PrevProductCategory
            ,@BeginDate
            ,@EndDate
            )

        SET @BeginDate = @LoadDate
    END

    SET @PrevProductId = @ProductId
    SET @PrevProductTitle = @ProductTitle
    SET @PrevProductCategory = @ProductCategory
    SET @PrevLoadDate = @LoadDate

    FETCH NEXT
    FROM _CURSOR
    INTO @ProductId
        ,@ProductTitle
        ,@ProductCategory
        ,@LoadDate
END --End While

SET @EndDate = '9999-12-31'

INSERT INTO @ResultTable (
    ProductId
    ,ProductTitle
    ,ProductCategory
    ,BeginDate
    ,EndDate
    )
VALUES (
    @PrevProductId
    ,@PrevProductTitle
    ,@PrevProductCategory
    ,@BeginDate
    ,@EndDate
    )

CLOSE _CURSOR

DEALLOCATE _CURSOR

SELECT *
FROM @ResultTable
Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top