Question

I am having a flat file with 10 records, and out of it 5 records are duplicate records (unique key column: Customer_Id and source_system). This flat file has to load to a Oracle table and this is the first load.

How can I eliminate the duplicates in it?

Was it helpful?

Solution

There are quite a few ways to handle this:

  1. Load the data to a temp table and then just select the distinct records
  2. Use a sorter or aggregator transformation and then select the distinct option (In the properties tab if I remember correctly)
  3. Use the dynamic lookup which will only insert records into the cache if they have not been inserted already.
  4. Use variable port in expression transformation and by using filter we can delete the duplicate records even if it is relational table or flat files.since dynamic lookup is costly, this way is preferable.

OTHER TIPS

You can always use an Aggregator transform and group by all the data you want to keep it distinct for. So if you group by all the columns only those who are distinct will come in the end.

You can use sorter and check load distinct rows.

SRC-->SQ-->SRT-->EXP-->RTR-->TGT

You inout from source is

Col1 Col2

1 A

1 B

2 C

2 D

1 E

1 F

3 G

4 H

5 I

6 J

4 K

3 L

In sorter sort data by col1 and after sorter data looks like this

Col1 Col2

1 A

1 B

1 E

1 F

2 C

2 D

3 G

3 L

4 K

4 H

5 I

6 J

In exp you have two input ports

in_col1

in_col2

create variable ports and output ports in the order like below

v_FLAG= IIF(v_col1=in_col1,2,1)

v_col1=in_col1

out_FLAG=v_FLAG

Then out put of expression data looks like this

Col1 Col2 FLAG

1,A, 1

1, B, 2

1, E, 2

1, F, 2

2, C, 1

2, D, 2

3, G, 1

3, L, 2

4, K,1

4, H, 2

5, I, 1

6, J, 1

In router create two groups one for unique records and another one for duplicate records.

unique=(FLAG=1)

duplicate=(FLAG=2)

connect two groups to two targets.SRC-->SQ-->SRT-->EXP-->RTR-->TGT

You inout from source is

Col1 Col2

1 A

1 B

2 C

2 D

1 E

1 F

3 G

4 H

5 I

6 J

4 K

3 L

In sorter sort data by col1 and after sorter data looks like this

Col1 Col2

1 A

1 B

1 E

1 F

2 C

2 D

3 G

3 L

4 K

4 H

5 I

6 J

In exp you have two input ports

in_col1

in_col2

create variable ports and output ports in the order like below

v_FLAG= IIF(v_col1=in_col1,2,1)

v_col1=in_col1

out_FLAG=v_FLAG

Then out put of expression data looks like this

Col1 Col2 FLAG

1,A, 1

1, B, 2

1, E, 2

1, F, 2

2, C, 1

2, D, 2

3, G, 1

3, L, 2

4, K,1

4, H, 2

5, I, 1

6, J, 1

In router create two groups one for unique records and another one for duplicate records.

unique=(FLAG=1)

duplicate=(FLAG=2)

connect two groups to two targets.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top