Remove duplicate from a staging file
-
03-07-2019 - |
Question
I have a staging table which contains a who series of rows of data which where taken from a data file.
Each row details a change to a row in a remote system, the rows are effectively snapshots of the source row taken after every change. Each row contains meta data timestamps for creation and updates.
I am now trying to build an update table from these data files which contain all of the update. I require a way to remove rows with duplicate keys keeping only the row with the latest "update" timestamp.
I am aware I can use the SSIS "sort" transform to remove duplicates by sorting on the key field and telling it to remove duplicates, but how do I ensure that the row it keeps is the one with the latest time stamp?
Solution
This will remove rows with match on Col1, Col2 etc and have an UpdateDate that is NOT the most recent:
DELETE D
FROM MyTable AS D
JOIN MyTable AS T
ON T.Col1 = D.Col1
AND T.Col2 = D.Col2
...
AND T.UpdateDate > D.UpdateDate
If Col1 and Col2 need to be considered "matching" if they are both NULL then you would need to use:
ON (T.Col1 = D.Col1 OR (T.Col1 IS NULL AND D.Col1 IS NULL))
AND (T.Col2 = D.Col2 OR (T.Col2 IS NULL AND D.Col2 IS NULL))
...
Edit: If you need to make a Case Sensitive test on a Case INsensitive database then on VARCHAR and TEXT columns use:
ON (T.Col1 = D.Col1 COLLATE Latin1_General_BIN
OR (T.Col1 IS NULL AND D.Col1 IS NULL))
...
OTHER TIPS
You can use the Sort Transform in SSIS to sort your data set by more than one column. Simply sort by your primary key (or ID field) followed by your timestamp column in descending order.
See the following article for more details on working with the sort Transformation?
http://msdn.microsoft.com/en-us/library/ms140182.aspx
Make sense?
Cheers, John
Does it make sense to just ignore the duplicates when moving from staging to final table?
You have to do this anyway, so why not issue one query against the staging table rather than two?
INSERT final
(key, col1, col2)
SELECT
key, col1, col2
FROM
staging s
JOIN
(SELECT key, MAX(datetimestamp) maxdt FROM staging ms ON s.key = ms.key AND s.datetimestamp = ms.maxdt