Question

I would like to remove all the lines in my data file that contain a value in column 2 that is repeated in column 2 in other lines.

I've sorted by the value in column 2, but can't figure out how to use uniq for just the values in one field as the values are not necessarily of the same length.

Alternately, I can remove lines with the duplicate using an awk one-liner like

awk -F"[,]" '!_[$2]++'

but this retains the line with the first incidence of the repeated value in col 2.

As an example, if my data is

a,b,c
c,b,a
d,e,f
h,i,j
j,b,h

I would like to remove ALL lines (including the first) where b occurs in the second column. Like this:

d,e,f
h,i,j

Thanks for any advice!!

Was it helpful?

Solution

If the order is not important then the following should work:

awk -F, '
!seen[$2]++ {
    line[$2] = $0
}
END { 
    for(val in seen)
        if(seen[val]==1) 
          print line[val]
}' file

Output

h,i,j
d,e,f

OTHER TIPS

Solution with grep:

grep -v -E '\b,b,\b' text.txt

Content of the file:

$ cat text.txt 
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
a,n,b
b,c,f

$ grep -v -E '\b,b,\b' text.txt 
d,e,f
h,i,j
a,n,b
b,c,f

Hope it helps

Some different awk:

awk -F, '
   BEGIN {f=0}
   FNR==NR {_[$2]++;next}
   f==0 {
      f=1
      for(j in _)if(_[j]>1)delete _[j]
   }
   $2 in _
' file file

Explanation

The awk passes through the file twice - that's why it appears twice at the end. On the first pass (when FNR==NR) I count the number of times each column 2 appears in array _[]. At the end of the first pass, I then delete all elements of _[] where that element has been seen more than once. Then, on the second pass, I print lines whose second field appears in _[].

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top