سؤال

I'm having data flow from source tables to destination table. To simplify the question, I'll say there are two merge joined source tables and one destination table. Also, there are primary keys helping me identify each record

The package is running everyday, and if one record is deleted from source table, how could I know which one is deleted so that I could delete that in destination table?

(FYI~~ I've dong checking to see if a record exists in destination table and if so update else insert, but don't know how to find deleted data)

هل كانت مفيدة؟

المحلول

Another possible approach:

Assuming you receive all records from source, not just imports and updates:

  1. Amend package to stamp records that have been inserted or updated using a unique id or run datetime

  2. Following the package run, process the destination table where records weren't inserted or updated in the last package run. By a process of elimination, any records that weren't provided in the source file should be deleted.

Again, assuming that all records are sent, not just imports and updates. But then again, if you don't receive all records, it's going to be physically impossible to detect if a record has been deleted.

نصائح أخرى

The problem with comparing source to destination is that you have to compare every source row to the destination in every load, and as the number of rows increases that takes up more and more time.

As a result, the best way to handle this is probably on the source side. Two common approaches are a 'soft delete' where you set a flag column to mark the row as deleted; or a trigger that records the PK of the deleted row in a log table (or moves the entire row to an archive log table). Your ETL process then looks at the flags or the log/archive table to determine which rows were deleted since the last load.

Another possibility is that the source platform offers some built-in feature you can use to track deleted rows, e.g. CDC in SQL Server. But if you have no control at all over the source database (if it even is a database) then there may be no alternative to comparing the full data set.

One possible approach:

  1. Prior to running package, delete the destination table records (using a stored procedure)
  2. Just import all records in to destination table

Pros:

Your destination table will always mirror the incoming data, no need to check for deletions

Cons:

You won't have any historical information (if that is required)

I had the same problem, as in how to mark my old/archive records as being "deleted" because they no longer exist in the original data source.

Basically, I built two tables, where one is the main table containing all the records that came in from the original data source, and a temporary table I kept to store the original data source every time I ran my scripts.

MAIN TABLE

ID, NAME, SURNAME, DATE_MODIFIED, ORDERS_COUNT, etc
plus a STATUS column (1 for Active, 0 for Deleted)

TEMP TABLE same as the original, but without STATUS column

ID, NAME, SURNAME, DATE_MODIFIED, ORDERS_COUNT, etc

The key was to update the MAIN TABLE with STATUS = 0 if the ID of the MAIN table was no longer in the Temp table. ie: The source records have been deleted.

I did it like this:

UPDATE m
SET m.Status = 0
FROM tblMAIN AS m
    LEFT JOIN tblTEMP AS t
        ON t.ID = m.ID
WHERE t.ID IS NULL
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top