Question

I have a staging table which contains a who series of rows of data which where taken from a data file.

Each row details a change to a row in a remote system, the rows are effectively snapshots of the source row taken after every change. Each row contains meta data timestamps for creation and updates.

I am now trying to build an update table from these data files which contain all of the update. I require a way to remove rows with duplicate keys keeping only the row with the latest "update" timestamp.

I am aware I can use the SSIS "sort" transform to remove duplicates by sorting on the key field and telling it to remove duplicates, but how do I ensure that the row it keeps is the one with the latest time stamp?

Was it helpful?

Solution

This will remove rows with match on Col1, Col2 etc and have an UpdateDate that is NOT the most recent:

DELETE D
FROM   MyTable AS D
       JOIN MyTable AS T
           ON T.Col1 = D.Col1
          AND T.Col2 = D.Col2
          ...
          AND T.UpdateDate > D.UpdateDate

If Col1 and Col2 need to be considered "matching" if they are both NULL then you would need to use:

       ON (T.Col1 = D.Col1 OR (T.Col1 IS NULL AND D.Col1 IS NULL))
      AND (T.Col2 = D.Col2 OR (T.Col2 IS NULL AND D.Col2 IS NULL))
      ...

Edit: If you need to make a Case Sensitive test on a Case INsensitive database then on VARCHAR and TEXT columns use:

       ON (T.Col1 = D.Col1  COLLATE Latin1_General_BIN 
           OR (T.Col1 IS NULL AND D.Col1 IS NULL))
       ...

OTHER TIPS

You can use the Sort Transform in SSIS to sort your data set by more than one column. Simply sort by your primary key (or ID field) followed by your timestamp column in descending order.

See the following article for more details on working with the sort Transformation?

http://msdn.microsoft.com/en-us/library/ms140182.aspx

Make sense?

Cheers, John

Does it make sense to just ignore the duplicates when moving from staging to final table?

You have to do this anyway, so why not issue one query against the staging table rather than two?

INSERT final
    (key, col1, col2)
SELECT
    key, col1, col2
FROM
    staging s
    JOIN
    (SELECT key, MAX(datetimestamp) maxdt FROM staging ms ON s.key = ms.key AND s.datetimestamp = ms.maxdt
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top