summation or sorting

https://stackoverflow.com/questions/13092263

awk
grep

14-07-2021
|

Question

I have a data file like:

82 DEX26_28_h
82 DEX26_28_h
873 DEX34_h
89 DEX37_h
1 DEX34_h

And I intend to sort via $2 so that each item (17 uniques) in column to are next to each other. Then I would like to sum up all #'s in $1 while tethered to $2

ideal result of test file above :

164 DEX26_28_h
874 DEX34_h
89 DEX27_h

Make sense? Basically need to sum up the total number of sequences ($1) that occur for each sample ($2) and uniq only $2 while keeping the sum. So that the end result becomes 17 total lines.

Should I just grep out by each of the 17 Identifies in $2 and then sum them using awk ?

What do you guys think?

Solution

You can use an array in awk to do the summation:

awk '{arr[$2]+=$1} END {for (i in arr) {print arr[i],i}}'

Then you can pipe it to sort afterwards.

What this does:

arr[$2]+=$1: Adds the value of $1 to the element in the array arr with index (key) $2. (Previously undefined elements are conveniently defined as 0 for you so yes, you can do a += here without worrying if the key "exists" or not.) If you're unfamiliar with arrays, this is basically creating a lookup table in memory based on your $2 field.
END...: Do this once at the end of processing
for (i in arr): For every key in the array arr, assign that element to i and run the code in the following block.
print arr[i],i: Prints first the value in arr with key i, then the key i itself.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow