Question

I have a multi-field text file. I'd like to have a command that would combine both the behavior of both sort -n -u -k and uniq -c - that is, sort the file on a certain key filed and the provide the number of duplicates prepended or postponed to the original line. At the moment, I either sort on the certain key and obtain the first of the duplicated lines without the number of duplicates with sort -n -u -k or count the number of duplicates with uniq -c by extracting the key field.

Can you suggest a command with both behavior implemented?

An example of the file (the key column can be any of the specified):

       4549              1       22656489       63452157           3235           1116            612         532275        6009800         534075        6012488         477375        5995844         533175        6011144        8388615            236
       4549              2       22656489       63452158           3214           1116            613         532275        6009825         534075        6012488         477375        5995831         533175        6011157        8388615            236
       4549              3       22656489       63452159           3193           1116            614         532275        6009850         534075        6012488         477375        5995819         533175        6011169        8388615            236
       4549              4       22656489       63452160           3173           1116            615         532275        6009875         534075        6012488         477375        5995806         533175        6011182        8388615            235
       4549              5       22656489       63452161           3152           1116            616         532275        6009900         534075        6012488         477375        5995794         533175        6011194        8388615            235
       4549              6       22656489       63452162           3131           1116            617         532275        6009925         534075        6012488         477375        5995781         533175        6011207        8388615            235
       4549              7       22656489       63452163           3111           1116            618         532275        6009950         534075        6012488         477375        5995769         533175        6011219        8388615            235
       4549              8       22656489       63452164           3091           1116            619         532275        6009975         534075        6012488         477375        5995756         533175        6011232        8388615            234
       4549              9       22656489       63452165           3070           1116            620         532275        6010000         534075        6012488         477375        5995744         533175        6011244        8388615            234
       4549             10       22656489       63452166           3050           1116            621         532275        6010025         534075        6012488         477375        5995731         533175        6011257        8388615            234
       4549             11       22656489       63452167           3030           1116            622         532275        6010050         534075        6012488         477375        5995719         533175        6011269        8388615            234
Was it helpful?

Solution

Using decorate-sort-undecorate, you can append to the data the fields you want to base your processing in, do the processing, and remove the extra fields. E.g to sort on fields 17 and 5:

awk '{print $0 OFS $17 OFS $5}' test_s  | sort -n -k18 -k19  | uniq -c -f17 | awk '{NF=18;print}'

You first append the key fields, then sort and uniq on them, and then only preserve the count added by uniq and the original fields.

OTHER TIPS

As I currently understand it, you want to specify one or more columns to use as a key and obtain a result with each output line showing the multiplicity for that key. In that case, suppose your data is in a file called "data" and we want column 17 as the key:

$ awk '{print $17}' data  | sort -n | uniq -c
  4 234
  4 235
  3 236

Thus, the value of 236 appears in column 17 a total of 3 times in your test data. Or, suppose you wanted columns 6, 8, 1, and 3 as the key (and in that order):

$ awk '{print $6,$8,$1,$3}' data  | sort -n | uniq -c
 11 1116 532275 4549 22656489

For this key, all 11 lines are dups.

This approach has three steps. First, we have awk select the columns you want in the order you want. Second, sort -n sorts them numerically on the keys. Lastly, uniq counts duplicates.

UPDATE: Suppose, as above, we want to use columns 6, 8, 1, and 3 as the key but, as per your comment, we want keep one of the original lines. In this case we instruct awk to put the original 17 columns before the key, we tell sort to sort on the key (columns 18+) and then we instruct uniq to ignore those first 17 columns:

awk '{print $0,$6,$8,$1,$3}' data  | sort -k18 -n | uniq -f 17 -c

For your sample data, this results in:

     11        4549             10       22656489       63452166           3050           1116            621         532275        6010025         534075        6012488         477375        5995731         533175        6011257        8388615            234 1116 532275 4549 22656489

If you only want the original 17 columns printed, then we can use perl to show just the first 17 columns and crop off the key:

awk '{print $0,$6,$8,$1,$3}' data  | sort -k18 -n | uniq -f 17 -c | perl -nle '@a=split;print join " ", @a[0..17]'

which results in:

11 4549 10 22656489 63452166 3050 1116 621 532275 6010025 534075 6012488 477375 5995731 533175 6011257 8388615 234
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top