Custom Script Component SSIS to filter and merge duplicates

https://stackoverflow.com/questions/23685804

23-07-2023
|

質問

I have a list of contacts with their information from two different databases that I have performed a Union All on and sorted. Now I have an aggregated list which looks like the following:

contactid   add1            add2     city     phone        fullname     source
-----      -----            -----    -----    -----         -----        ----- 
BOOG1     1598 Tree Drive  Apt:215    NYC  718-888-9989   Andrew Sample    DB1
NULL      NULL             Apt:215         718-888-9989   Andrew Sample    DB2

BOOG6     1598 Tree Drive  Apt:215    NYC  718-888-8888   Andria Toefield  DB1
NULL      NULL  Apt:215                     718-888-9888   Andria Toefield  DB2
....
....
....

Basically, I want to use a script component that will compare the Rows Andrew Samples and if columns are empty then select the one that isn't and if there is conflicting data then select data from db2 as the one to replace it with. So the end result should be like the following:

contactid   add1            add2     city     phone        fullname  
-----      -----            -----    -----    -----         -----          
BOOG1   1598 Tree Drive   Apt:215    NYC  718-888-9989   Andrew Sample    

BOOG6   1598 Tree Drive   Apt:215    NYC  718-888-9888    Andria Toefield  
....
....
....

I'm not sure how to start scripting this in C#. I don't know how to select the row and then compare certain columns from the row.

解決

I would not attempt this with a Script Task - it is too difficult to compare across rows.

I would add a Fuzzy Grouping Transformation to group on the name columns. This will add a _key_out column (amongst others). I would drop the results into a SQL table.

Then I would write a complex SQL query featuring a GROUP BY on the _key_out column, and CASE statements for each of the other columns to resolve your "is missing" and "conflicting" requirements.

Once you point this at a real-world dataset of any scale, the benefits of this design will really pay off. You will undoubtably encounter more complex scenarios that your examples above e.g. DB1 has 2 "John Smith" rows and DB2 has 3 "John Smith" rows. You will be able to tweak the Fuzzy Grouping parameters and/or add secondary Fuzzy Groupings to break ties.

Along the way you can interrogate the results in the intermediate SQL table to optimize the handling of these issues.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow