Sorting a column by number of occurrences

https://stackoverflow.com/questions/23160249

05-07-2023
|

Question

I have some data separated by tabs

 8/1/12 15:22   622070509   Pig 123123123
 8/1/12 15:27   569038096   Monkey  123123123
 8/1/12 15:21   389549550   CatDog  123123
 8/1/12 15:26   558161100   Monkey  1231245
 8/1/12 15:28   274990777   CatDog  112312
 8/1/12 15:22   274990777   CatDog  12341

I want to sort column four by number of occurrences, in decending order so the output would look like this:

8/1/12 15:22    274990777   CatDog  12341
8/1/12 15:28    274990777   CatDog  112312
8/1/12 15:21    389549550   CatDog  123123
8/1/12 15:26    558161100   Monkey  1231245
8/1/12 15:27    569038096   Monkey  123123123
8/1/12 15:22    622070509   Pig 123123123

So far:

sort -t$'\t' -k4 file.txt

Sorts by alphabetical order just fine, but I'm not seeing a parameter for sort by # of occurrences.

Solution 2

cat infile.txt |awk -F\t '{print $4}' |sort |uniq -c |sort -nr |awk {'print $2'} |xargs -I % grep % infile.txt > outfile.txt

OTHER TIPS

Learn to think algorithmically. How would you process the data by hand?

Count the number of occurrences of each value in the fourth column, giving you a pair {Name, Count}.
Join the main data with the {Name, Count} data, giving you an extra column that tells you the number of occurrences.
Sort the augmented data by descending Count, and within equal counts by Name.
Drop the Count column from the output.

There are Unix tools to support all those operations with greater or less degrees of difficulty. There are, indeed, multiple ways to do each step. You can do it all in Perl or Python (or, indeed, awk). Or you can do it in stages, using awk, join, sort, and perhaps sed.

You have to set the flag for numerical comparison (-n):

sort -t$'\t' -k 4 -n file.txt

You can also define the second sorting column like this:

sort -t$'\t' -k4n,4 -k3,3 file.txt

This will sort first by the 4th column numerically, and when it finds equal items, it will sort by the 3rd column alphabetically.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow