Question

I have a file that looks like this:

1   rs531842    503939  61733   G   A
1   rs10494103  35025   114771  C   T
1   rs17038458  254490  21116837    G   A
1   rs616378    525783  21127670    T   C
1   rs3845293   432526  21199392    A   C
2   rs16840461  233620  157112959   A   G
2   rs1560628   224228  157113214   T   C
2   rs17200880  269314  257145829   C   T
2   rs10497165  35844   357156412   C   T
2   rs7607531   624696  457156575   T   C

...with column 1 stretching on to 22, and several thousand entries in total.

I want to create a file that lists bins of 5 million from column 4 which have data, separating by column 1.

Basically, all but column 1 and 4 can be discarded. A simple imput would look like this:

InputChr1:

61733
114771
21116837
21127670
21199392

InputChr2:

157112959
157113214
257145829 
357156412
457156575

So, for the example above, I would want to get two files that look like this:

OutputChr1.txt

Start End Occurrences
1 5000000 2
20000001 25000000 3

OutputChr2.txt

Start End Occurrences
155000001 160000000 2 
255000001 260000000 1
355000001 360000000 1
455000001 460000000 1

Any ideas? It seems like something that should be doable with lapply in R, but I can't get the for loops to work...

EDIT: Actually, I made this look much harder than it needed to be - basically, I want to split the original file by column 1, extract the data in column 4, and then count the instances in bins of 5 million.

(Apologies for slightly random tags, just trying to think of which tools might be best!)

Was it helpful?

Solution

Well, this happened to be very challenging. I couldn't find a way to use an unique awk command, though.

awk -v const=5000000 -v max=150 
    '{a[$1,int($4/const)]++; b[$1]}
      END{for (i in b)
             {for (j=0; j<max; j++)
                   print i, j*const +1, (j+1)*const, a[i,j]
             }
         }' file

And then to get only the results:

awk 'NF==4'

Explanation

  • -v const=5000000 -v max=150 give the variables. const is the 5 million value to split the results. max is the biggest number up to which we will look for info in the END block.
  • a[$1,int($4/const)]++ create an array with (1st field, 4th field) as index. Note the second is int($4/const) is to get from 23432 --> 0, 6000000 --> 1, etc. That is, to see in which block of values is every 4th column.
  • b[$1] keep track of the first columns that have been processed.
  • END{for (i in b) {for (j=0; j<max; j++) print j, j*const +1, (j+1)*const, a[i,j]}}' print the values.
  • awk 'NF==4' just print those lines that have 4 columns. This way it just outputs those cases in which there were matches.

In case you want to store the values into a new file, you can do

awk 'NF==4 {print > "OutputChr"$1".txt}'

Sample output

$ awk -v const=5000000 -v max=150 '{a[$1,int($4/const)]++; b[$1]} END{for (i in b) {for (j=0; j<max; j++) print i, j*const +1, (j+1)*const, a[i,j]}}' a | awk 'NF==4'
1 1 5000000 2
1 20000001 25000000 3
2 155000001 160000000 2
2 255000001 260000000 1
2 355000001 360000000 1
2 455000001 460000000 1

OTHER TIPS

All in one

awk '{ v=int($4/const)
       a[$1 FS v]++
       min[$1]=min[$1]<v?min[$1]:v    # get the Minimum of column $4 for group $1
       max[$1]=max[$1]>v?max[$1]:v    # get the Minimum of column $4 for group $1
     }END{  for (i in min)
               for (j=min[i];j<=max[i];j++)  # set the for loop, and use the min and max value. 
                      if (a[i FS j]!="") print j*const+1,(j+1)*const,a[i FS j] > "OutputChr" i ".txt"      # if the data is exist, print to file "OutputChr" i ".txt" 
          }' const=5000000 file

result:

$ cat OutputChr1.txt
1 5000000 2
20000001 25000000 3

$ cat OutputChr2.txt
155000001 160000000 2
255000001 260000000 1
355000001 360000000 1
455000001 460000000 1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top