SQL Server 2008 Optimize FULL JOIN with ISNULL statements

https://stackoverflow.com/questions/3812595

26-09-2019
|

Question

HI All

I was hoping someone could help me improve a query I have to run periodically. At the moment it takes more than 40 minutes to execute. It uses the full allocated memory during this time, but CPU usage mostly meanders at 2% - 5%, every now and then jumping to 40% for a few seconds.

I have this table (simplified example):

    CREATE TABLE [dbo].[dataTable]
    (
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [dteEffectiveDate] [date] NULL,
    [dtePrevious] [date] NULL,
    [dteNext] [date] NULL,
    [Age] [int] NULL,
    [Count] [int] NULL
    ) ON [PRIMARY]

    GO

Here are some input values:

INSERT INTO [YourDB].[dbo].[dataTable]
           ([dteEffectiveDate]
           ,[dtePrevious]
           ,[dteNext]
           ,[Age]
           ,[Count])
     VALUES
('2009-01-01',NULL,'2010-01-01',40,300),
('2010-01-01','2009-01-01', NULL,40,200),
('2009-01-01',NULL, '2010-01-01',20,100),
('2010-01-01','2009-01-01', NULL,20,50),
('2009-01-01',NULL,'2010-01-01',30,10)
GO

Each entry has a dteEffectiveDate field. In addition, each has a dtePrevious and dteNext, which reflects the dates of the nearest previous/next effective date. Now what I want is a query that will calculate the mid-value on the Count fields between successive periods, within a specific age.

So for example, in the data above, for age 40 we have 300 at 2009/01/01 and 200 at 2010/01/01 so the query should produce 250.

Note that age 30 has only one entry, 10. This is at 2009/01/01. There is no entry at 2010/01/01, but we know that data was captured at this point, so the fact that there is nothing means that 30 is 0 at this date. Hence the query should produce 5.

In order to achieve this I use a FULL JOIN of the table on itself, and use ISNULL to select values. Here is my code:

SELECT

    ISNULL(T1.dteEffectiveDate,T2.dtePrevious) as [Start Date]
    ,ISNULL(T1.dteNext,T2.dteEffectiveDate)  as [End Date]
    ,ISNULL(T1.Age,T2.Age) as Age 
    ,ISNULL(T1.[Count],0) as [Count Start]
    ,ISNULL(T2.[Count],0)   as [Count End]
    ,(ISNULL(T1.[Count],0)+ISNULL(T2.[Count],0))/2 as [Mid Count]

    FROM
    [ExpDBClient].[dbo].[dataTable] as T1
    FULL JOIN [ExpDBClient].[dbo].[dataTable] as T2

    ON 
    T2.dteEffectiveDate = T1.dteNext
    AND T2.Age = T1.Age

    WHERE ISNULL(T1.dteEffectiveDate,T2.dtePrevious) is not null
    AND ISNULL(T1.dteNext,T2.dteEffectiveDate) is not null

GO

which outputs:

Start Date  End Date    Age Count Start Count End   Mid Lives
2009-01-01  2010-01-01  40  300         200         250
2009-01-01  2010-01-01  20  100         50          75
2009-01-01  2010-01-01  30  10          0           5

It works perfectly, but when I run this on the actual data, which is about 7m records, it takes painfully long to execute.

Does anyone have any suggestions?

Thanks
Karl

Solution

It's hard to make a lot of recommendations.

One thing I'd would definitely recommend is indices on those columns that you use as foreign keys in your JOIN conditions, e.g.

Age
dteEffectiveDate
dteNext

Create a NONCLUSTERED index on each of those columns separately and measure again. With just a few data rows, there's no improvement measurable - but with millions of rows, it might make a difference.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow