Most efficient way to generate a diff

https://dba.stackexchange.com/questions/111223

29-09-2020
|

Question

I have a table in SQL server that looks like this:

Id    |Version  |Name    |date    |fieldA   |fieldB ..|fieldZ
1     |1        |Foo     |20120101|23       |       ..|25334123
2     |2        |Foo     |20120101|23       |NULL   ..|NULL
3     |2        |Bar     |20120303|24       |123......|NULL
4     |2        |Bee     |20120303|34       |-34......|NULL

I am working on a stored procedure to diff, that takes input data and a version number. The input data has columns from Name uptil fieldZ. Most of the field columns are expected to be NULL, i.e., each row usually has data for only the first few fields, the rest are NULL. The name, date and version form a unique constraint on the table.

I need to diff the data that is input with respect to this table, for a given version. Each row needs to be diffed - a row is identified by the name, date and version, and any change in any of the values in the field columns will need to show in the diff.

Update: all the fields need not be of type decimal. Some of them may be nvarchars. I would prefer the diff to happen without converting the type, although the diff output could convert everything to nvarchar since it is to be used only for display purposed.

Suppose the input is the following, and the requested version is 2,:

Name    |date    |fieldA   |fieldB|..|fieldZ
Foo     |20120101|25       |NULL  |.. |NULL
Foo     |20120102|26       |27    |.. |NULL
Bar     |20120303|24       |126   |.. |NULL
Baz     |20120101|15       |NULL  |.. |NULL

The diff needs to be in the following format:

name    |date    |field    |oldValue    |newValue
Foo     |20120101|FieldA   |23          |25
Foo     |20120102|FieldA   |NULL        |26
Foo     |20120102|FieldB   |NULL        |27
Bar     |20120303|FieldB   |123         |126
Baz     |20120101|FieldA   |NULL        |15

My solution so far is to first generate a diff, using EXCEPT and UNION. Then convert the diff to the desired output format using a JOIN and CROSS APPLY. Although this seems to be working, I am wondering if there is a cleaner and more efficient way to do this. The number of fields is close to a 100, and each place in the code that has a ... is actually a large number of lines. Both the input table and existing table are expected to be quite large over time. I am new to SQL and am still trying to learn performance tuning.

Here is the SQL for it:

CREATE TABLE #diff
(   [change] [nvarchar](50) NOT NULL,
    [name] [nvarchar](50) NOT NULL,
    [date] [int] NOT NULL,
    [FieldA] [decimal](38, 10) NULL,
    [FieldB] [decimal](38, 10) NULL,
    .....
    [FieldZ] [decimal](38, 10) NULL
)

--Generate the diff in a temporary table
INSERT INTO #diff
SELECT * FROM
(

(
    SELECT
        'old' as change,
        name,
        date,
        FieldA,
        FieldB,
        ...,
        FieldZ
    FROM 
        myTable mt 
    WHERE 
        version = @version
        AND mt.name + '_' + CAST(mt.date AS VARCHAR) IN (SELECT name + '_' + CAST(date AS VARCHAR) FROM @diffInput) 
    EXCEPT
    SELECT 'old' as change,* FROM @diffInput
)
UNION

(
    SELECT 'new' as change, * FROM @diffInput
    EXCEPT
    SELECT
        'new' as change,
        name,
        date,
        FieldA, 
        FieldB,
        ...,
        FieldZ
    FROM 
        myTable mt 
    WHERE 
        version = @version 
        AND mt.name + '_' + CAST(mt.date AS VARCHAR) IN (SELECT name + '_' + CAST(date AS VARCHAR) FROM @diffInput) 
) 
) AS myDiff

SELECT 
d3.name, d3.date, CrossApplied.field, CrossApplied.oldValue, CrossApplied.newValue
FROM
(
    SELECT 
        d2.name, d2.date, 
        d1.FieldA AS oldFieldA, d2.FieldA AS newFieldA, 
        d1.FieldB AS oldFieldB, d2.FieldB AS newFieldB,
        ...
        d1.FieldZ AS oldFieldZ, d2.FieldZ AS newFieldZ,
    FROM #diff AS d1
    RIGHT OUTER JOIN #diff AS d2
    ON 
        d1.name = d2.name
        AND d1.date = d2.date
        AND d1.change = 'old'
    WHERE d2.change = 'new'
) AS d3
CROSS APPLY (VALUES ('FieldA', oldFieldA, newFieldA), 
                ('FieldB', oldFieldB, newFieldB),
                ...
                ('FieldZ', oldFieldZ, newFieldZ))
                CrossApplied (field, oldValue, newValue)
WHERE 
    crossApplied.oldValue != crossApplied.newValue 
    OR (crossApplied.oldValue IS NULL AND crossApplied.newValue IS NOT NULL) 
    OR (crossApplied.oldValue IS NOT NULL AND crossApplied.newValue IS NULL)

Thank you!

Solution

Here is another approach:

SELECT
  di.name,
  di.date,
  x.field,
  x.oldValue,
  x.newValue
FROM
  @diffInput AS di
  LEFT JOIN dbo.myTable AS mt ON
    mt.version = @version
    AND mt.name = di.name
    AND mt.date = di.date
  CROSS APPLY
  (
    SELECT
      'fieldA',
      mt.fieldA,
      di.fieldA
    WHERE
      NOT EXISTS (SELECT mt.fieldA INTERSECT SELECT di.fieldA)

    UNION ALL

    SELECT
      'fieldB',
      mt.fieldB,
      di.fieldB
    WHERE
      NOT EXISTS (SELECT mt.fieldB INTERSECT SELECT di.fieldB)

    UNION ALL

    SELECT
      'fieldC',
      mt.fieldC,
      di.fieldC
    WHERE
      NOT EXISTS (SELECT mt.fieldC INTERSECT SELECT di.fieldC)

    UNION ALL

    ...
  ) AS x (field, oldValue, newValue)
;

This is how it works:

The two tables are joined using an outer join, @diffInput being on the outer side to match your right join.
The result of the join is conditionally unpivoted using CROSS APPLY, where "conditionally" means that each pair of columns is tested individually and returned only if the columns differ.
The pattern of each test condition
```
NOT EXISTS (SELECT oldValue INTERSECT SELECT newValue)
```
is equivalent to your
```
oldValue != newValue
OR (oldValue IS NULL AND newValue IS NOT NULL)
OR (oldValue IS NOT NULL AND newValue IS NULL)
```
only more concise. You can read more about this use of INTERSECT in detail in Paul White's article Undocumented Query Plans: Equality Comparisons.

On a different note, since you are saying,

Both the input table and existing table are expected to be quite large over time

you may want to consider replacing the table variable you are using for the input table with a temporary table. There is a very comprehensive answer by Martin Smith that explores differences between the two:

What's the difference between a temp table and table variable in SQL Server?

In short, certain properties of table variables, like e.g. absence of column statistics, may make them less query optimiser-friendly for your scenario than temporary tables.

OTHER TIPS

Edit regarding fields having different types, not just decimal.

You can try to use sql_variant type. I never used it personally, but it may be a good solution for your case. To try it just replace all [decimal](38, 10) with sql_variant in the SQL script. The query itself remains exactly as it is, no explicit conversion is needed for performing the comparison. The end result would have a column with values of different types in it. Most likely, eventually you would have to know somehow which type is in which field to process the results in your application, but the query itself should work fine without conversions.

By the way, it is a bad idea to store dates as int.

Instead of using EXCEPT and UNION to calculate the diff, I'd use FULL JOIN. For me, personally, it is difficult to follow the logic behind EXCEPT and UNION approach.

I would start with unpivoting the data, rather than doing it last (using CROSS APPLY(VALUES) as you do). You can get rid of unpivoting of the input, if you do it in advance, on the caller side.

You would have to list all 100 columns only in CROSS APPLY(VALUES).

The final query is pretty simple, so temp table is not really needed. I think it is easier to write and maintain than your version. Here is SQL Fiddle.

Set up sample data

DECLARE @TMain TABLE (
    [ID] [int] NOT NULL,
    [Version] [int] NOT NULL,
    [Name] [nvarchar](50) NOT NULL,
    [dt] [date] NOT NULL,
    [FieldA] [decimal](38, 10) NULL,
    [FieldB] [decimal](38, 10) NULL,
    [FieldZ] [decimal](38, 10) NULL
);

INSERT INTO @TMain ([ID],[Version],[Name],[dt],[FieldA],[FieldB],[FieldZ]) VALUES
(1,1,'Foo','20120101',23,23  ,25334123),
(2,2,'Foo','20120101',23,NULL,NULL),
(3,2,'Bar','20120303',24,123 ,NULL),
(4,2,'Bee','20120303',34,-34 ,NULL);

DECLARE @TInput TABLE (
    [Name] [nvarchar](50) NOT NULL,
    [dt] [date] NOT NULL,
    [FieldA] [decimal](38, 10) NULL,
    [FieldB] [decimal](38, 10) NULL,
    [FieldZ] [decimal](38, 10) NULL
);

INSERT INTO @TInput ([Name],[dt],[FieldA],[FieldB],[FieldZ]) VALUES
('Foo','20120101',25,NULL,NULL),
('Foo','20120102',26,27  ,NULL),
('Bar','20120303',24,126 ,NULL),
('Baz','20120101',15,NULL,NULL);

DECLARE @VarVersion int = 2;

Main query

CTE_Main is unpivoted original data filtered to the given Version. CTE_Input is input table, which could be provided already in this format. Main query uses FULL JOIN, which adds to result rows with Bee. I think they should be returned, but if you don't want to see them, you can filter them out by adding AND CTE_Input.FieldValue IS NOT NULL or maybe using LEFT JOIN instead of FULL JOIN, I didn't look into details there, because I think they should be returned.

WITH
CTE_Main
AS
(
    SELECT
        Main.ID
        ,Main.Version
        ,Main.Name
        ,Main.dt
        ,FieldName
        ,FieldValue
    FROM
        @TMain AS Main
        CROSS APPLY
        (
            VALUES
                ('FieldA', Main.FieldA),
                ('FieldB', Main.FieldB),
                ('FieldZ', Main.FieldZ)
        ) AS CA(FieldName, FieldValue)
    WHERE
        Main.Version = @VarVersion
)
,CTE_Input
AS
(
    SELECT
        Input.Name
        ,Input.dt
        ,FieldName
        ,FieldValue
    FROM
        @TInput AS Input
        CROSS APPLY
        (
            VALUES
                ('FieldA', Input.FieldA),
                ('FieldB', Input.FieldB),
                ('FieldZ', Input.FieldZ)
        ) AS CA(FieldName, FieldValue)
)

SELECT
    ISNULL(CTE_Main.Name, CTE_Input.Name) AS FullName
    ,ISNULL(CTE_Main.dt, CTE_Input.dt) AS FullDate
    ,ISNULL(CTE_Main.FieldName, CTE_Input.FieldName) AS FullFieldName
    ,CTE_Main.FieldValue AS OldValue
    ,CTE_Input.FieldValue AS NewValue
FROM
    CTE_Main
    FULL JOIN CTE_Input ON 
        CTE_Input.Name = CTE_Main.Name
        AND CTE_Input.dt = CTE_Main.dt
        AND CTE_Input.FieldName = CTE_Main.FieldName
WHERE
    (CTE_Main.FieldValue <> CTE_Input.FieldValue)
    OR (CTE_Main.FieldValue IS NULL AND CTE_Input.FieldValue IS NOT NULL)
    OR (CTE_Main.FieldValue IS NOT NULL AND CTE_Input.FieldValue IS NULL)
--ORDER BY FullName, FullDate, FullFieldName;

Result

FullName    FullDate    FullFieldName   OldValue        NewValue
Foo         2012-01-01  FieldA          23.0000000000   25.0000000000
Foo         2012-01-02  FieldA          NULL            26.0000000000
Foo         2012-01-02  FieldB          NULL            27.0000000000
Bar         2012-03-03  FieldB          123.0000000000  126.0000000000
Baz         2012-01-01  FieldA          NULL            15.0000000000
Bee         2012-03-03  FieldB          -34.0000000000  NULL
Bee         2012-03-03  FieldA          34.0000000000   NULL

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange