cat infile.txt |awk -F\t '{print $4}' |sort |uniq -c |sort -nr |awk {'print $2'} |xargs -I % grep % infile.txt > outfile.txt
Sorting a column by number of occurrences
Question
I have some data separated by tabs
8/1/12 15:22 622070509 Pig 123123123
8/1/12 15:27 569038096 Monkey 123123123
8/1/12 15:21 389549550 CatDog 123123
8/1/12 15:26 558161100 Monkey 1231245
8/1/12 15:28 274990777 CatDog 112312
8/1/12 15:22 274990777 CatDog 12341
I want to sort column four by number of occurrences, in decending order so the output would look like this:
8/1/12 15:22 274990777 CatDog 12341
8/1/12 15:28 274990777 CatDog 112312
8/1/12 15:21 389549550 CatDog 123123
8/1/12 15:26 558161100 Monkey 1231245
8/1/12 15:27 569038096 Monkey 123123123
8/1/12 15:22 622070509 Pig 123123123
So far:
sort -t$'\t' -k4 file.txt
Sorts by alphabetical order just fine, but I'm not seeing a parameter for sort by # of occurrences.
Solution 2
OTHER TIPS
Learn to think algorithmically. How would you process the data by hand?
- Count the number of occurrences of each value in the fourth column, giving you a pair {Name, Count}.
- Join the main data with the {Name, Count} data, giving you an extra column that tells you the number of occurrences.
- Sort the augmented data by descending Count, and within equal counts by Name.
- Drop the Count column from the output.
There are Unix tools to support all those operations with greater or less degrees of difficulty. There are, indeed, multiple ways to do each step. You can do it all in Perl or Python (or, indeed, awk
). Or you can do it in stages, using awk
, join
, sort
, and perhaps sed
.
You have to set the flag for numerical comparison (-n):
sort -t$'\t' -k 4 -n file.txt
You can also define the second sorting column like this:
sort -t$'\t' -k4n,4 -k3,3 file.txt
This will sort first by the 4th column numerically, and when it finds equal items, it will sort by the 3rd column alphabetically.