sort | uniq | xargs grep … where lines contain spaces
-
03-07-2019 - |
Question
I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)
I'm using a bash shell via cygwin for WinXP
$ cut -d, -f 5 myfile.csv | sort | uniq -d
correctly returns a list of the duplicate dates
01/01/2005 00:22
01/01/2005 00:37
[snip]
02/29/2009 23:54
But I cannot figure out how to feed this to grep to give me all the rows.
Obviously, I can't use xargs
straight up since the output contains spaces. I thought I could do uniq -z -d
but for some reason, combining those flags causes uniq to (apparently) return nothing.
So, given that
$ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
doesn't work... what can I do?
I know that I could do this in perl
or another scripting language... but my stubborn nature insists that I should be able to do it in bash
using standard commandline tools like sort
, uniq
, find
, grep
, cut
, etc.
Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?
Solution
- sort -k5,5 will do the sort on fields and avoid the cut;
- uniq -f 4 will ignore the first 4 fields for the uniq;
- Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
- but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.
Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.
So:
tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16
OTHER TIPS
The -z
option of uniq
needs the input to be NUL separated. You can filter the output of cut
through:
tr '\n' '\000'
To get zero separated rows. Then sort
, uniq
and xargs
have options to handle that. Try something like:
cut -d, -f 5 myfile.csv | tr '\n' '\000' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
Edit: the position of tr
in the pipe was wrong.
You can tell xargs to use each line as an argument in its entirety using the -d option. Try:
cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv
Try escaping the spaces with sed:
echo 01/01/2005 00:37 | sed 's/ /\\ /g'
cut -d, -f 5 myfile.csv | sort | uniq -d | sed 's/ /\\ /g' | xargs -I '{}' grep '{}' myfile.csv
(Yet another way would be to read the duplicate date lines into an IFS=$'\n' array and iterate over it in a for loop.)
This is a good candidate for awk:
BEGIN { FS="," }
{ split($5,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }
- Set field seperator to ',' (CSV).
- Split fifth field on the space, stick result in A.
- Concatenate the line number to the list of what we have already stored for that date.
- Print out the line numbers for each date.