sort | uniq | xargs grep … where lines contain spaces

https://stackoverflow.com/questions/612439

03-07-2019
|

Question

I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)

I'm using a bash shell via cygwin for WinXP

$ cut -d, -f 5 myfile.csv | sort | uniq -d

correctly returns a list of the duplicate dates

01/01/2005 00:22
01/01/2005 00:37
[snip]    
02/29/2009 23:54

But I cannot figure out how to feed this to grep to give me all the rows. Obviously, I can't use xargs straight up since the output contains spaces. I thought I could do uniq -z -d but for some reason, combining those flags causes uniq to (apparently) return nothing.

So, given that

 $ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv

doesn't work... what can I do?

I know that I could do this in perl or another scripting language... but my stubborn nature insists that I should be able to do it in bash using standard commandline tools like sort, uniq, find, grep, cut, etc.

Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?

Solution

sort -k5,5 will do the sort on fields and avoid the cut;
uniq -f 4 will ignore the first 4 fields for the uniq;
Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.

Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.

So:

tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16

OTHER TIPS

The -z option of uniq needs the input to be NUL separated. You can filter the output of cut through:

tr '\n' '\000'

To get zero separated rows. Then sort, uniq and xargs have options to handle that. Try something like:

cut -d, -f 5 myfile.csv | tr '\n' '\000' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv

Edit: the position of tr in the pipe was wrong.

You can tell xargs to use each line as an argument in its entirety using the -d option. Try:

cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv

Try escaping the spaces with sed:

echo 01/01/2005 00:37 | sed 's/ /\\ /g'
cut -d, -f 5 myfile.csv | sort | uniq -d | sed 's/ /\\ /g' | xargs -I '{}' grep '{}' myfile.csv

(Yet another way would be to read the duplicate date lines into an IFS=$'\n' array and iterate over it in a for loop.)

This is a good candidate for awk:

BEGIN { FS="," }
{ split($5,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }

Set field seperator to ',' (CSV).
Split fifth field on the space, stick result in A.
Concatenate the line number to the list of what we have already stored for that date.
Print out the line numbers for each date.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow