Question

I have a data file like:

82 DEX26_28_h
82 DEX26_28_h
873 DEX34_h
89 DEX37_h
1 DEX34_h

And I intend to sort via $2 so that each item (17 uniques) in column to are next to each other. Then I would like to sum up all #'s in $1 while tethered to $2

ideal result of test file above :

164 DEX26_28_h
874 DEX34_h
89 DEX27_h

Make sense? Basically need to sum up the total number of sequences ($1) that occur for each sample ($2) and uniq only $2 while keeping the sum. So that the end result becomes 17 total lines.

Should I just grep out by each of the 17 Identifies in $2 and then sum them using awk ?

What do you guys think?

Was it helpful?

Solution

You can use an array in awk to do the summation:

awk '{arr[$2]+=$1} END {for (i in arr) {print arr[i],i}}'

Then you can pipe it to sort afterwards.


What this does:

  • arr[$2]+=$1: Adds the value of $1 to the element in the array arr with index (key) $2. (Previously undefined elements are conveniently defined as 0 for you so yes, you can do a += here without worrying if the key "exists" or not.) If you're unfamiliar with arrays, this is basically creating a lookup table in memory based on your $2 field.

  • END...: Do this once at the end of processing

  • for (i in arr): For every key in the array arr, assign that element to i and run the code in the following block.

  • print arr[i],i: Prints first the value in arr with key i, then the key i itself.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top