Informatica - Duplicate elimination
-
04-12-2019 - |
Question
I am having a flat file with 10 records, and out of it 5 records are duplicate records (unique key column: Customer_Id
and source_system
). This flat file has to load to a Oracle table and this is the first load.
How can I eliminate the duplicates in it?
Solution
There are quite a few ways to handle this:
- Load the data to a temp table and then just select the distinct records
- Use a sorter or aggregator transformation and then select the distinct option (In the properties tab if I remember correctly)
- Use the dynamic lookup which will only insert records into the cache if they have not been inserted already.
- Use variable port in expression transformation and by using filter we can delete the duplicate records even if it is relational table or flat files.since dynamic lookup is costly, this way is preferable.
OTHER TIPS
You can always use an Aggregator transform and group by all the data you want to keep it distinct for. So if you group by all the columns only those who are distinct will come in the end.
You can use sorter and check load distinct rows.
SRC-->SQ-->SRT-->EXP-->RTR-->TGT
You inout from source is
Col1 Col2
1 A
1 B
2 C
2 D
1 E
1 F
3 G
4 H
5 I
6 J
4 K
3 L
In sorter sort data by col1 and after sorter data looks like this
Col1 Col2
1 A
1 B
1 E
1 F
2 C
2 D
3 G
3 L
4 K
4 H
5 I
6 J
In exp you have two input ports
in_col1
in_col2
create variable ports and output ports in the order like below
v_FLAG= IIF(v_col1=in_col1,2,1)
v_col1=in_col1
out_FLAG=v_FLAG
Then out put of expression data looks like this
Col1 Col2 FLAG
1,A, 1
1, B, 2
1, E, 2
1, F, 2
2, C, 1
2, D, 2
3, G, 1
3, L, 2
4, K,1
4, H, 2
5, I, 1
6, J, 1
In router create two groups one for unique records and another one for duplicate records.
unique=(FLAG=1)
duplicate=(FLAG=2)
connect two groups to two targets.SRC-->SQ-->SRT-->EXP-->RTR-->TGT
You inout from source is
Col1 Col2
1 A
1 B
2 C
2 D
1 E
1 F
3 G
4 H
5 I
6 J
4 K
3 L
In sorter sort data by col1 and after sorter data looks like this
Col1 Col2
1 A
1 B
1 E
1 F
2 C
2 D
3 G
3 L
4 K
4 H
5 I
6 J
In exp you have two input ports
in_col1
in_col2
create variable ports and output ports in the order like below
v_FLAG= IIF(v_col1=in_col1,2,1)
v_col1=in_col1
out_FLAG=v_FLAG
Then out put of expression data looks like this
Col1 Col2 FLAG
1,A, 1
1, B, 2
1, E, 2
1, F, 2
2, C, 1
2, D, 2
3, G, 1
3, L, 2
4, K,1
4, H, 2
5, I, 1
6, J, 1
In router create two groups one for unique records and another one for duplicate records.
unique=(FLAG=1)
duplicate=(FLAG=2)
connect two groups to two targets.