Question

I have collected the following file:

20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2
20130308;343380;7591;NA

This is a ; seperated file with 4 columns. The combination of column 2 and 3 however must be unique. Since this dataset has millions of rows I'm looking for an efficient way to get the first occurence of every duplicate. I therefore need to partial match the combination of column 2 and 3 and then select the first one.

The expected outcome should be:

20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9 #REMOVED
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2 #REMOVED
20130308;343380;7591;NA #$REMOVED

I have made a few attempts myself. The first one is:

grep -oE "\;(.*);" orders_20130304to20140219_v3.txt | uniq 

However this selects only column 2 and 3 and removes all other data. Furthermore it does not take into account a match that occurs later. I can fix that by adding sort, but I prefer not to sort.

Another attempt is:

awk '!x[$0]++' test.txt

This does not require any sorting, but matches the complete line.

I think the second attempt is close, but that needs to be changed in order to only look at the second and third column instead of the whole line. Does anyone know how to incorporate this?

Was it helpful?

Solution

here you go:

awk -F';' '!a[$2 FS $3]++' file

test with your data:

kent$  awk -F';' '!a[$2 FS $3]++' f 
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top