SQL Server 2008 filling gaps with dimension

https://stackoverflow.com/questions/16692819

30-05-2022
|

Question

I have a data table as below

#data
---------------
Account AccountType
---------------
1       2
2       0
3       5
4       2
5       1
6       5

AccountType 2 is headers and 5 is totals. Meaning accounts of type 2 have to look after the next 1 or 0 to determin if its Dim value is 1 or 0. Totals of type 5 have to look up at nearest 1 or 0 to determin its Dim value. Accounts of type 1 or 0 have there type as Dim.

Accounts of type 2 appear as islands so its not enough to just check RowNumber + 1 and same goes for accounsts of type 5.

I have arrived at the following table using CTE's. But can't find a quick way to go from here to my final result of Account, AccountType, Dim for all accounts

T3
-------------------
StartRow  EndRow AccountType  Dim
-------------------
1           1         2        0
2           2         0        0
3           3         5        0
4           4         2        1
5           5         0        1
6           6         5        1

Below code is MS TSQL copy paste it all and see it run. The final join on the CTE select statement is extremly slow for even 500 rows it takes 30 sec. I have 100.000 rows i need to handle. I done a cursor based solution which do it in 10-20 sec thats workable and a fast recursive CTE solution that do it in 5 sec for 100.000 rows, but it dependent on the fragmentation of the #data table. I should add this is simplified the real problem have alot more dimension that need to be taking into account. But it will work the same for this simple problem.

Anyway is there a fast way to do this using joins or another set based solution.

SET NOCOUNT ON

IF OBJECT_ID('tempdb..#data') IS NOT NULL
    DROP TABLE #data

CREATE TABLE #data
(
Account INTEGER IDENTITY(1,1),
AccountType INTEGER,
)

BEGIN -- TEST DATA
DECLARE @Counter INTEGER = 0
DECLARE @MaxDataRows INTEGER = 50 -- Change here to check performance
DECLARE @Type INTEGER
    WHILE(@Counter < @MaxDataRows)
    BEGIN
    SET @Type = CASE 
        WHEN @Counter % 10 < 3 THEN 2 
        WHEN @Counter % 10 >= 8 THEN 5 
        WHEN @Counter % 10 >= 3 THEN (CASE WHEN @Counter < @MaxDataRows / 2.0 THEN 0 ELSE 1 END )
        ELSE 0 
        END
    INSERT INTO #data VALUES(@Type)
    SET @Counter = @Counter + 1
    END
END -- TEST DATA END



;WITH groupIds_cte AS
(
    SELECT *,
    ROW_NUMBER() OVER (PARTITION BY AccountType ORDER BY Account) - Account AS GroupId  
    FROM #data
),

islandRanges_cte AS
(
SELECT
    MIN(Account) AS StartRow,
    MAX(Account) AS EndRow,
    AccountType
FROM groupIds_cte
GROUP BY GroupId,AccountType
),

T3 AS
(
SELECT I.*, J.AccountType AS Dim
FROM islandRanges_cte I
INNER JOIN islandRanges_cte J
ON (I.EndRow + 1 = J.StartRow AND I.AccountType = 2)
UNION ALL
SELECT I.*, J.AccountType AS Dim
FROM islandRanges_cte I
INNER JOIN islandRanges_cte J
ON (I.StartRow - 1 = J.EndRow AND I.AccountType = 5)
UNION ALL
SELECT *, AccountType AS Dim
FROM islandRanges_cte
WHERE AccountType = 0 OR AccountType = 1
),

T4 AS 
(
SELECT Account, Dim
    FROM (
    SELECT FlattenRow AS Account, StartRow, EndRow, Dim
    FROM T3 I   
    CROSS APPLY (VALUES(StartRow),(EndRow)) newValues (FlattenRow)
    ) T
)

--SELECT * FROM T3 ORDER BY StartRow
--SELECT * FROM T4 ORDER BY Account

-- Final correct result but very very slow
SELECT D.Account, D.AccountType, I.Dim FROM T3 I
INNER JOIN #data D
ON D.Account BETWEEN I.StartRow AND I.EndRow
ORDER BY Account

EDIT with some time testing

SET NOCOUNT ON

IF OBJECT_ID('tempdb..#data') IS NULL
CREATE TABLE #times
(
RecId INTEGER IDENTITY(1,1),
Batch INTEGER,
Method NVARCHAR(255),
MethodDescription NVARCHAR(255),
RunTime INTEGER
)

IF OBJECT_ID('tempdb..#batch') IS NULL
CREATE TABLE #batch 
(
Batch INTEGER IDENTITY(1,1),
Bit BIT
)

INSERT INTO #batch VALUES(0)

IF OBJECT_ID('tempdb..#data') IS NOT NULL
    DROP TABLE #data

CREATE TABLE #data
(
Account INTEGER
)

CREATE NONCLUSTERED INDEX data_account_index ON #data (Account)

IF OBJECT_ID('tempdb..#islands') IS NOT NULL
    DROP TABLE #islands

CREATE TABLE #islands
(
AccountFrom INTEGER ,
AccountTo INTEGER,
Dim INTEGER,
)

CREATE NONCLUSTERED INDEX islands_from_index ON #islands (AccountFrom, AccountTo, Dim)

BEGIN -- TEST DATA
    INSERT INTO #data
    SELECT TOP 100000 ROW_NUMBER() OVER(ORDER BY t1.number) AS N
    FROM master..spt_values t1 
    CROSS JOIN master..spt_values t2

    INSERT INTO #islands
    SELECT MIN(Account) AS Start, MAX(Account), Grp
    FROM (SELECT *, NTILE(10) OVER (ORDER BY Account) AS Grp FROM #data) T
    GROUP BY Grp ORDER BY Start
END -- TEST DATA END

--SELECT * FROM #data
--SELECT * FROM #islands

--PRINT CONVERT(varchar(20),DATEDIFF(MS,@RunDate,GETDATE()))+' ms Sub Query'
DECLARE @RunDate datetime
SET @RunDate=GETDATE()

SELECT Account, (SELECT Dim From #islands WHERE Account BETWEEN AccountFrom AND AccountTo) AS Dim
FROM #data

INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'subquery','',DATEDIFF(MS,@RunDate,GETDATE()))
SET @RunDate=GETDATE()

SELECT D.Account, V.Dim
FROM #data D
CROSS APPLY
(
SELECT Dim From #islands V
WHERE D.Account BETWEEN V.AccountFrom AND V.AccountTo
) V

INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'crossapply','',DATEDIFF(MS,@RunDate,GETDATE()))
SET @RunDate=GETDATE()

SELECT D.Account, I.Dim 
FROM #data D
JOIN #islands I
ON D.Account BETWEEN I.AccountFrom AND I.AccountTo

INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'join','',DATEDIFF(MS,@RunDate,GETDATE()))
SET @RunDate=GETDATE()

;WITH cte AS
(
SELECT Account, AccountFrom, AccountTo, Dim, 1 AS Counting
FROM #islands
CROSS APPLY (VALUES(AccountFrom),(AccountTo)) V (Account)
UNION ALL
SELECT Account + 1 ,AccountFrom, AccountTo, Dim, Counting + 1
FROM cte
WHERE (Account + 1) > AccountFrom AND (Account + 1) < AccountTo
)
SELECT Account, Dim, Counting FROM cte OPTION(MAXRECURSION 32767)

INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'recursivecte','',DATEDIFF(MS,@RunDate,GETDATE()))

You can select from the #times table to see the run times :)

Solution

I think you want a join, but using an inequality rather than an equality:

select tt.id, tt.dim1, it.dim2
from TallyTable tt join
     IslandsTable it
     on tt.id between it."from" and it."to"

This works for the data that you provide in the question.

Here is another idea that might work. Here is the query:

select d.*,
       (select top 1 AccountType from #data d2 where d2.Account > d.Account and d2.AccountType not in (2, 5)
       ) nextAccountType
from #data d 
order by d.account;

I just ran this on 50,000 rows and this version took 17 seconds on my system. Changing the table to:

CREATE TABLE #data (
    Account INTEGER IDENTITY(1,1) primary key,
    AccountType INTEGER,
);

Has actually slowed it down to about 1:33 -- quite to my surprise. Perhaps one of these will help you.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow