Consolidating a row of data, based on previous rows
-
03-10-2020 - |
Question
I'm trying to build a history table out of an audit log (ultimately to build out a type 2 dimension table). Unfortunately, the audit log only records the specific fields being changes. Here's a rough example of what I'm talking about;
CREATE TABLE Staff(
[ID] int,
[Surname] varchar(5),
[FirstName] varchar(4),
[Office] varchar(9),
[Date] varchar(10)
);
INSERT INTO Staff ([ID], [Surname], [FirstName], [Office], [Date])
VALUES
(001, 'Smith', 'Bill', 'Melbourne', '2015-01-01'),
(001, NULL, NULL, 'Sydney', '2015-03-01'),
(002, 'Brown', 'Mary', 'Melbourne', '2014-04-01'),
(002, 'Jones', NULL, 'Adelaide', '2014-05-01'),
(002, NULL, NULL, 'Sydney', '2015-01-01'),
(002, NULL, NULL, 'Perth', '2015-03-01');
The first entry for a particular staff member is for when their record is created, and each subsequent record is an update... but only shows the update to the field that was updated*. I want to "fill out" the update row with the rest of the employee record as it currently stands. ie, a result like this;
001, Smith, Bill, Melbourne, 2015-01-01
001, Smith, Bill, Sydney, 2015-03-01
002, Brown, Mary, Melbourne, 2014-04-01
002, Jones, Mary, Adelaide, 2014-05-01
002, Jones, Mary, Sydney, 2015-01-01
002, Jones, Mary, Perth, 2015-03-01
I know I can do this using a while
loop or a cursor
but I suspect there is probably a more performant option.
*A NULL always means "value didn't change" rather than "value changed to NULL".
Solution
I was able to do this with a recursive CTE so it's not that different from a cursor. Also these do not tend to scale well over large volumes. Have a look through the code and see what you think.
;WITH cte AS (
SELECT 0 x, Change, ID, Surname, FirstName, Office, [Date]
FROM dbo.Staff
WHERE Change = 0
UNION ALL
SELECT x + 1, s.Change, c.ID, ISNULL( s.Surname, c.Surname ) , ISNULL( s.FirstName, c.FirstName ), ISNULL( s.Office, c.Office ), s.[Date]
FROM cte c
INNER JOIN dbo.Staff s ON c.ID = s.ID
WHERE s.Change = c.x
)
SELECT Change, ID, Surname, FirstName, Office, [Date]
FROM cte
WHERE x > 0
ORDER BY ID, x
OTHER TIPS
Sample data with the Date
column typed as date
:
CREATE TABLE dbo.Staff
(
[ID] integer NOT NULL,
[Surname] varchar(5) NULL,
[FirstName] varchar(4) NULL,
[Office] varchar(9) NULL,
[Date] date NOT NULL,
PRIMARY KEY (ID, [Date])
);
INSERT INTO Staff ([ID], [Surname], [FirstName], [Office], [Date])
VALUES
(001, 'Smith', 'Bill', 'Melbourne', '2015-01-01'),
(001, NULL, NULL, 'Sydney', '2015-03-01'),
(002, 'Brown', 'Mary', 'Melbourne', '2014-04-01'),
(002, 'Jones', NULL, 'Adelaide', '2014-05-01'),
(002, NULL, NULL, 'Sydney', '2015-01-01'),
(002, NULL, NULL, 'Perth', '2015-03-01');
The idea of the following solution is to lag back as many rows from the current row as there are preceding nulls:
SELECT
G.ID,
Surname = LAG(G.Surname, G.SurnameLag) OVER (
PARTITION BY G.ID
ORDER BY G.[Date]),
FirstName = LAG(G.FirstName, G.FirstNameLag) OVER (
PARTITION BY G.ID
ORDER BY G.[Date]),
Office = LAG(G.Office, G.OfficeLag) OVER (
PARTITION BY G.ID
ORDER BY G.[Date]),
G.[Date]
FROM
(
-- Find the LAG offset per column
SELECT
S.ID,
S.Surname,
SurnameLag = SUM(IIF(S.Surname IS NULL, 1, 0)) OVER (
PARTITION BY S.ID
ORDER BY S.[Date]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
S.FirstName,
FirstNameLag = SUM(IIF(S.FirstName IS NULL, 1, 0)) OVER (
PARTITION BY S.ID
ORDER BY S.[Date]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
S.Office,
OfficeLag = SUM(IIF(S.Office IS NULL, 1, 0)) OVER (
PARTITION BY S.ID
ORDER BY S.[Date]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
S.[Date]
FROM dbo.Staff AS S
) AS G
ORDER BY
G.ID, G.[Date];
For versions of SQL Server prior to 2012, the IIF
expressions can be written as CASE WHEN <column> IS NULL THEN 1 ELSE 0 END
.
Output:
This will all become much easier when Microsoft implements LAG
with the IGNORE NULLS
option.
For more options, see The Last non NULL Puzzle by Itzik Ben-Gan.
You could use a window MAX()
to find the date of the last column value so far, for each column:
SELECT
*,
LastSurnameDate = MAX(CASE WHEN Surname IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
LastFirstNameDate = MAX(CASE WHEN FirstName IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
LastOfficeDate = MAX(CASE WHEN Office IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC)
FROM
dbo.Staff
The next, and last, step would be to use a window MAX()
again to look up the values from the corresponding rows:
SELECT
ID,
Surname = MAX(Surname ) OVER (PARTITION BY ID, LastSurnameDate ORDER BY [Date] ASC),
FirstName = MAX(FirstName) OVER (PARTITION BY ID, LastFirstNameDate ORDER BY [Date] ASC),
Office = MAX(Office ) OVER (PARTITION BY ID, LastOfficeDate ORDER BY [Date] ASC),
[Date]
FROM
(
SELECT
*,
LastSurnameDate = MAX(CASE WHEN Surname IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
LastFirstNameDate = MAX(CASE WHEN FirstName IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
LastOfficeDate = MAX(CASE WHEN Office IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC)
FROM
dbo.Staff
) AS s
;
ORDER BY
is not necessary in the second step but it can reduce the search range.
Same method but using a CTE instead of a derived table:
WITH LastValues AS
(
SELECT
*,
LastSurnameDate = MAX(CASE WHEN Surname IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
LastFirstNameDate = MAX(CASE WHEN FirstName IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC),
LastOfficeDate = MAX(CASE WHEN Office IS NOT NULL THEN [Date] END) OVER (PARTITION BY ID ORDER BY [Date] ASC)
FROM
dbo.Staff
)
SELECT
ID,
Surname = MAX(Surname ) OVER (PARTITION BY ID, LastSurnameDate ORDER BY [Date] ASC),
FirstName = MAX(FirstName) OVER (PARTITION BY ID, LastFirstNameDate ORDER BY [Date] ASC),
Office = MAX(Office ) OVER (PARTITION BY ID, LastOfficeDate ORDER BY [Date] ASC),
[Date]
FROM
LastValues
;
The difference would be only in syntax and not in performance or results.
Output:
ID Surname FirstName Office Date
----------- ------- --------- --------- ----------
1 Smith Bill Melbourne 2015-01-01
1 Smith Bill Sydney 2015-03-01
2 Brown Mary Melbourne 2014-04-01
2 Jones Mary Adelaide 2014-05-01
2 Jones Mary Sydney 2015-01-01
2 Jones Mary Perth 2015-03-01
FYI, here's the answer I have at the moment using a while loop. Note, I've also added a column with the change id (which can easily be added using something like ROW_NUMBER() OVER (PARTITION BY ID ORDER BY DATE)
or similar.
drop table if exists Staff;
CREATE TABLE Staff(
[Change] int,
[ID] int,
[Surname] varchar(5),
[FirstName] varchar(4),
[Office] varchar(9),
[Date] varchar(10)
);
drop table if exists Results;
select *
into Results
from Staff;
INSERT INTO Staff ([Change], [ID], [Surname], [FirstName], [Office, [Date])
VALUES
(0, 001, 'Smith', 'Bill', 'Melbourne', '2015-01-01'),
(1, 001, NULL, NULL, 'Sydney', '2015-03-01'),
(0, 002, 'Brown', 'Mary', 'Melbourne', '2014-04-01'),
(1, 002, 'Jones', NULL, 'Adelaide', '2014-05-01'),
(2, 002, NULL, NULL, 'Sydney', '2015-01-01'),
(3, 002, NULL, NULL, 'Perth', '2015-03-01');
declare @max int = (select max(Change) from Staff);
declare @i int = 0;
while @i <= @max
BEGIN
insert into Results (Change, ID, Surname, FirstName, Office, Date)
select cur.Change,
cur.ID,
case
when cur.Surname is null then prev.Surname
else cur.Surname
end,
case
when cur.FirstName is null then prev.FirstName
else cur.FirstName
end,
case
when cur.Office is null then prev.Office
else cur.Office
end,
cur.Date
from Staff cur
left outer join Results prev on cur.Change = prev.Change + 1 and cur.ID = prev.ID
where cur.Change = @i;
set @i = @i + 1;
END;
select * from Results
order by ID, Date;
While this does use a loop, the number of iterations is limited based on the maximum number of changes any one employee record has ever had... which isn't likely to be that many.