Question
I have a data file like:
82 DEX26_28_h
82 DEX26_28_h
873 DEX34_h
89 DEX37_h
1 DEX34_h
And I intend to sort via $2 so that each item (17 uniques) in column to are next to each other. Then I would like to sum up all #'s in $1 while tethered to $2
ideal result of test file above :
164 DEX26_28_h
874 DEX34_h
89 DEX27_h
Make sense? Basically need to sum up the total number of sequences ($1) that occur for each sample ($2) and uniq only $2 while keeping the sum. So that the end result becomes 17 total lines.
Should I just grep out by each of the 17 Identifies in $2 and then sum them using awk ?
What do you guys think?
Solution
You can use an array in awk to do the summation:
awk '{arr[$2]+=$1} END {for (i in arr) {print arr[i],i}}'
Then you can pipe it to sort afterwards.
What this does:
arr[$2]+=$1
: Adds the value of$1
to the element in the arrayarr
with index (key)$2
. (Previously undefined elements are conveniently defined as 0 for you so yes, you can do a+=
here without worrying if the key "exists" or not.) If you're unfamiliar with arrays, this is basically creating a lookup table in memory based on your$2
field.END...
: Do this once at the end of processingfor (i in arr)
: For every key in the arrayarr
, assign that element toi
and run the code in the following block.print arr[i],i
: Prints first the value inarr
with keyi
, then the keyi
itself.