Well, this happened to be very challenging. I couldn't find a way to use an unique awk
command, though.
awk -v const=5000000 -v max=150
'{a[$1,int($4/const)]++; b[$1]}
END{for (i in b)
{for (j=0; j<max; j++)
print i, j*const +1, (j+1)*const, a[i,j]
}
}' file
And then to get only the results:
awk 'NF==4'
Explanation
-v const=5000000 -v max=150
give the variables.const
is the 5 million value to split the results.max
is the biggest number up to which we will look for info in theEND
block.a[$1,int($4/const)]++
create an array with(1st field, 4th field)
as index. Note the second isint($4/const)
is to get from 23432 --> 0, 6000000 --> 1, etc. That is, to see in which block of values is every 4th column.b[$1]
keep track of the first columns that have been processed.END{for (i in b) {for (j=0; j<max; j++) print j, j*const +1, (j+1)*const, a[i,j]}}'
print the values.awk 'NF==4'
just print those lines that have 4 columns. This way it just outputs those cases in which there were matches.
In case you want to store the values into a new file, you can do
awk 'NF==4 {print > "OutputChr"$1".txt}'
Sample output
$ awk -v const=5000000 -v max=150 '{a[$1,int($4/const)]++; b[$1]} END{for (i in b) {for (j=0; j<max; j++) print i, j*const +1, (j+1)*const, a[i,j]}}' a | awk 'NF==4'
1 1 5000000 2
1 20000001 25000000 3
2 155000001 160000000 2
2 255000001 260000000 1
2 355000001 360000000 1
2 455000001 460000000 1