Consolidating a row of data, based on previous rows

https://dba.stackexchange.com/questions/148641

03-10-2020
|

Question

I'm trying to build a history table out of an audit log (ultimately to build out a type 2 dimension table). Unfortunately, the audit log only records the specific fields being changes. Here's a rough example of what I'm talking about;

CREATE TABLE Staff(
  [ID] int, 
  [Surname] varchar(5), 
  [FirstName] varchar(4), 
  [Office] varchar(9), 
  [Date] varchar(10)
);

INSERT INTO Staff ([ID], [Surname], [FirstName], [Office], [Date])
VALUES
  (001, 'Smith', 'Bill', 'Melbourne', '2015-01-01'),
  (001, NULL, NULL, 'Sydney', '2015-03-01'),
  (002, 'Brown', 'Mary', 'Melbourne', '2014-04-01'),
  (002, 'Jones', NULL, 'Adelaide', '2014-05-01'),
  (002, NULL, NULL, 'Sydney', '2015-01-01'),
  (002, NULL, NULL, 'Perth', '2015-03-01');

The first entry for a particular staff member is for when their record is created, and each subsequent record is an update... but only shows the update to the field that was updated^*. I want to "fill out" the update row with the rest of the employee record as it currently stands. ie, a result like this;

001, Smith, Bill, Melbourne, 2015-01-01
001, Smith, Bill, Sydney, 2015-03-01
002, Brown, Mary, Melbourne, 2014-04-01
002, Jones, Mary, Adelaide, 2014-05-01
002, Jones, Mary, Sydney, 2015-01-01
002, Jones, Mary, Perth, 2015-03-01

I know I can do this using a while loop or a cursor but I suspect there is probably a more performant option.

_{^*A NULL always means "value didn't change" rather than "value changed to NULL".}

Solution

I was able to do this with a recursive CTE so it's not that different from a cursor. Also these do not tend to scale well over large volumes. Have a look through the code and see what you think.

;WITH cte AS (
SELECT 0 x, Change, ID, Surname, FirstName, Office, [Date]
FROM dbo.Staff
WHERE Change = 0

UNION ALL

SELECT x + 1, s.Change, c.ID, ISNULL( s.Surname, c.Surname ) , ISNULL( s.FirstName, c.FirstName ), ISNULL( s.Office, c.Office ), s.[Date]
FROM cte c
    INNER JOIN dbo.Staff s ON c.ID = s.ID
WHERE s.Change = c.x
)
SELECT Change, ID, Surname, FirstName, Office, [Date]
FROM  cte
WHERE x > 0
ORDER BY ID, x

OTHER TIPS

Sample data with the Date column typed as date:

CREATE TABLE dbo.Staff
(
  [ID] integer NOT NULL, 
  [Surname] varchar(5) NULL, 
  [FirstName] varchar(4) NULL, 
  [Office] varchar(9) NULL, 
  [Date] date NOT NULL,

  PRIMARY KEY (ID, [Date])
);

INSERT INTO Staff ([ID], [Surname], [FirstName], [Office], [Date])
VALUES
  (001, 'Smith', 'Bill', 'Melbourne', '2015-01-01'),
  (001, NULL, NULL, 'Sydney', '2015-03-01'),
  (002, 'Brown', 'Mary', 'Melbourne', '2014-04-01'),
  (002, 'Jones', NULL, 'Adelaide', '2014-05-01'),
  (002, NULL, NULL, 'Sydney', '2015-01-01'),
  (002, NULL, NULL, 'Perth', '2015-03-01');

The idea of the following solution is to lag back as many rows from the current row as there are preceding nulls:

SELECT
    G.ID,
    Surname = LAG(G.Surname, G.SurnameLag) OVER (
        PARTITION BY G.ID 
        ORDER BY G.[Date]),
    FirstName = LAG(G.FirstName, G.FirstNameLag) OVER (
        PARTITION BY G.ID 
        ORDER BY G.[Date]),
    Office = LAG(G.Office, G.OfficeLag) OVER (
        PARTITION BY G.ID 
        ORDER BY G.[Date]),
    G.[Date]
FROM 
(
    -- Find the LAG offset per column
    SELECT
        S.ID,
        S.Surname,
        SurnameLag = SUM(IIF(S.Surname IS NULL, 1, 0)) OVER (
            PARTITION BY S.ID
            ORDER BY S.[Date]
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
        S.FirstName,
        FirstNameLag = SUM(IIF(S.FirstName IS NULL, 1, 0)) OVER (
            PARTITION BY S.ID
            ORDER BY S.[Date]
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
        S.Office,
        OfficeLag = SUM(IIF(S.Office IS NULL, 1, 0)) OVER (
            PARTITION BY S.ID
            ORDER BY S.[Date]
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
        S.[Date]
    FROM dbo.Staff AS S
) AS G
ORDER BY
    G.ID, G.[Date];

For versions of SQL Server prior to 2012, the IIF expressions can be written as CASE WHEN <column> IS NULL THEN 1 ELSE 0 END.

Output:

This will all become much easier when Microsoft implements LAG with the IGNORE NULLS option.

For more options, see The Last non NULL Puzzle by Itzik Ben-Gan.

You could use a window MAX() to find the date of the last column value so far, for each column:

SELECT
  *,
  LastSurnameDate   = MAX(CASE WHEN Surname   IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
  LastFirstNameDate = MAX(CASE WHEN FirstName IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
  LastOfficeDate    = MAX(CASE WHEN Office    IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC)
FROM
  dbo.Staff

The next, and last, step would be to use a window MAX() again to look up the values from the corresponding rows:

SELECT
  ID,
  Surname   = MAX(Surname  ) OVER (PARTITION BY ID, LastSurnameDate   ORDER BY [Date] ASC),
  FirstName = MAX(FirstName) OVER (PARTITION BY ID, LastFirstNameDate ORDER BY [Date] ASC),
  Office    = MAX(Office   ) OVER (PARTITION BY ID, LastOfficeDate    ORDER BY [Date] ASC),
  [Date]
FROM
  (
    SELECT
      *,
      LastSurnameDate   = MAX(CASE WHEN Surname   IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
      LastFirstNameDate = MAX(CASE WHEN FirstName IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
      LastOfficeDate    = MAX(CASE WHEN Office    IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC)
    FROM
      dbo.Staff
  ) AS s
;

ORDER BY is not necessary in the second step but it can reduce the search range.

Same method but using a CTE instead of a derived table:

WITH LastValues AS
  (
    SELECT
      *,
      LastSurnameDate   = MAX(CASE WHEN Surname   IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
      LastFirstNameDate = MAX(CASE WHEN FirstName IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
      LastOfficeDate    = MAX(CASE WHEN Office    IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC)
    FROM
      dbo.Staff
  )
SELECT
  ID,
  Surname   = MAX(Surname  ) OVER (PARTITION BY ID, LastSurnameDate   ORDER BY [Date] ASC),
  FirstName = MAX(FirstName) OVER (PARTITION BY ID, LastFirstNameDate ORDER BY [Date] ASC),
  Office    = MAX(Office   ) OVER (PARTITION BY ID, LastOfficeDate    ORDER BY [Date] ASC),
  [Date]
FROM
  LastValues
;

The difference would be only in syntax and not in performance or results.

Output:

ID          Surname FirstName Office    Date
----------- ------- --------- --------- ----------
1           Smith   Bill      Melbourne 2015-01-01
1           Smith   Bill      Sydney    2015-03-01
2           Brown   Mary      Melbourne 2014-04-01
2           Jones   Mary      Adelaide  2014-05-01
2           Jones   Mary      Sydney    2015-01-01
2           Jones   Mary      Perth     2015-03-01

FYI, here's the answer I have at the moment using a while loop. Note, I've also added a column with the change id (which can easily be added using something like ROW_NUMBER() OVER (PARTITION BY ID ORDER BY DATE) or similar.

drop table if exists Staff;
CREATE TABLE Staff(
  [Change] int,
  [ID] int,
  [Surname] varchar(5), 
  [FirstName] varchar(4), 
  [Office] varchar(9), 
  [Date] varchar(10)
);

drop    table if exists Results;
select  *
into    Results
from    Staff;

INSERT INTO Staff ([Change], [ID], [Surname], [FirstName], [Office, [Date])
VALUES
  (0, 001, 'Smith', 'Bill', 'Melbourne', '2015-01-01'),
  (1, 001, NULL, NULL, 'Sydney', '2015-03-01'),
  (0, 002, 'Brown', 'Mary', 'Melbourne', '2014-04-01'),
  (1, 002, 'Jones', NULL, 'Adelaide', '2014-05-01'),
  (2, 002, NULL, NULL, 'Sydney', '2015-01-01'),
  (3, 002, NULL, NULL, 'Perth', '2015-03-01');


declare @max int = (select max(Change) from Staff);
declare @i int = 0;

while @i <= @max
BEGIN

  insert  into Results (Change, ID, Surname, FirstName, Office, Date)
  select  cur.Change, 
          cur.ID,
          case 
            when cur.Surname is null then prev.Surname 
            else cur.Surname 
          end,
          case 
            when cur.FirstName is null then prev.FirstName 
            else cur.FirstName 
          end,
          case 
            when cur.Office is null then prev.Office 
            else cur.Office 
          end,
          cur.Date
  from    Staff cur
            left outer join Results prev on cur.Change  = prev.Change + 1 and cur.ID = prev.ID
  where cur.Change = @i;

  set @i = @i + 1;

END;

select * from Results
order by ID, Date;

While this does use a loop, the number of iterations is limited based on the maximum number of changes any one employee record has ever had... which isn't likely to be that many.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange