Question

I reckon my question wasn't all clear. So, time for another approach in explaining my quest.

  • I've got a single data file that contains about 27,500 data points. Every datapoint has a unique integers between 0 and 46041637 in the first column and a description in column 2 - 6.
  • I would like to know how the integers are distributed across the 46M possibilities. For instance, how many (and which) datapoints fall within the range 1-1000, how many between 1001-2000 and so on until 46M is reached.
  • In the previous example, roughly 46K (46M/1000) smaller datasets (or bins) will be created. Some bins will contain several datapoints while a lot of them won't contain any datapoints at all.
  • In the example above, the size/length of the bin (I would call it windowsize) is 1000. I do not know yet what the best windowsize is to suit my purposes, so, I would like to be able to have a script that has an 'adjustable' windowsize.
  • Furthermore, the example 1-1000, 1001, 2001, [...], clearly doesn't show any overlap in the bins. However, that means that I lose a lot of 'sensitivity' and therefore knowledge/information. So I would like to be able to create windows/bins that overlap. For instance, 1-1000,501-1500, 1501-2500, 2001-3000. These bins have an overlap of 500. I would to be able to set the amount of overlap between the bins.
  • I would like to write every bin to its own file, even if the bin doesn't contain any datapoints.

Here is the explanation I gave before, which wasn't very good apparently


Using AWK I'm trying to 'slide a window' over a list of integers. If I would split up this dataset how many datapoints would there be in every (possible overlapping) bin? I like to set the binsize (or windowsize) and overlap between the bins. This approach enables me to get an idea of the local datapoint density. --> I have a little bit of AWK experience and I've been told tha AWK should be able to do the job, I prefer to use AWK. However, I'm also open to other ideas (Python for example).

  • I've got a single data file that contains about 27,500 integers between 0 and 46041637 (46 million) and a description for every data point.
  • As I would like to have an idea of the effects of 'changing' the resolution I would like to play a little with different window-sizes and the overlap between individual windows.
  • I like to write all the contents of every single 'window' to a separate file and name the file according to the 'windows-startpoint'.

I've prepared some example files which are attached below. However, to make it easier to get an idea here's another very very simplified example:

Datarange 1 - 10
Integers in dataset: 2,4,5,6,9
####
####Example 1: Windowsize=5,overlap=2####
####
file name ="1" contents are: (range is 1 - 5)
2
4
5
file name="3" contents are: (range is 4 - 8, that is, two overlap with previous range)
4
5
6
file name="7" contents are: (range is 7 - 10, if the range was larger, it would be 7 - 11)
9
####
####Example 2: Windowsize=3,overlap=0####
####
file name="1" contents are (range 1 - 3)
2
file name="4" contents are (range 4 - 6)
4
5
6
file name="7" contents are (range 7 - 9)
9
file name="9" contents are (range 10 - 10)
<none>

Example input file

    3579                    
    3661                    
    3752    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    3947    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    6734    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6865    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6915    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    8961                    
    13471   EXON    13449   13532   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    13561   INTRON  13533   13710   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    22226   EXON    22106   22261   +   Solyc06g005030.1.1.1Solyc06g005030.1.1
    22516                   
    22556                   
    36903   INTRON  36836   36915   +   Solyc06g005060.2.1.1Solyc06g005060.2.1
    37377   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37605   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37935   3P_UTR  37801   38132   +   Solyc06g005060.2.1.0Solyc06g005060.2.1
    167942  5P_UTR  167930  167956  -   Solyc06g005140.2.1.0Solyc06g005140.2.1
    168020  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1
    168153  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1

Example output file with different windowsizes and overlap

> AWK -v windowsize=50000 -v overlap=0 -f awkscript input.file
> ls
1      50001
100001 150001
> cat 1
    3579                    
    3661                                        
    3752    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    3947    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    6734    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6865    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6915    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    8961                    
    13471   EXON    13449   13532   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    13561   INTRON  13533   13710   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    22226   EXON    22106   22261   +   Solyc06g005030.1.1.1Solyc06g005030.1.1
    22516                   
    22556                   
    36903   INTRON  36836   36915   +   Solyc06g005060.2.1.1Solyc06g005060.2.1
    37377   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37605   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37935   3P_UTR  37801   38132   +   Solyc06g005060.2.1.0Solyc06g005060.2.1                  
    3752    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    3947    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    6734    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6865    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6915    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    8961                    
    13471   EXON    13449   13532   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    13561   INTRON  13533   13710   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    22226   EXON    22106   22261   +   Solyc06g005030.1.1.1Solyc06g005030.1.1
    22516                   
    22556                   
    36903   INTRON  36836   36915   +   Solyc06g005060.2.1.1Solyc06g005060.2.1
    37377   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37605   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37935   3P_UTR  37801   38132   +   Solyc06g005060.2.1.0Solyc06g005060.2.1
 > cat 50001

 > cat 100001

 > cat 150001
    167942  5P_UTR  167930  167956  -   Solyc06g005140.2.1.0Solyc06g005140.2.1
    168020  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1
    168153  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1

> #And with some different paramenters
> AWK -v windowsize=160000 -v overlap=10000 -f awkscript input.file
> ls
1    10001
> cat 1
    3579                    
    3661                    
    3752    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    3947    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    6734    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6865    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6915    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    8961                    
    13471   EXON    13449   13532   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    13561   INTRON  13533   13710   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    22226   EXON    22106   22261   +   Solyc06g005030.1.1.1Solyc06g005030.1.1
    22516                   
    22556                   
    36903   INTRON  36836   36915   +   Solyc06g005060.2.1.1Solyc06g005060.2.1
    37377   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37605   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37935   3P_UTR  37801   38132   +   Solyc06g005060.2.1.0Solyc06g005060.2.1
> cat 10001
    13471   EXON    13449   13532   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    13561   INTRON  13533   13710   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    22226   EXON    22106   22261   +   Solyc06g005030.1.1.1Solyc06g005030.1.1
    22516                   
    22556                   
    36903   INTRON  36836   36915   +   Solyc06g005060.2.1.1Solyc06g005060.2.1
    37377   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37605   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37935   3P_UTR  37801   38132   +   Solyc06g005060.2.1.0Solyc06g005060.2.1
    167942  5P_UTR  167930  167956  -   Solyc06g005140.2.1.0Solyc06g005140.2.1
    168020  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1
    168153  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1

Thank you so much for all your help!


Little adjustment of my initial question because I requires way more computing time than I anticipated for.

Is it possible that, instead of writing all the records that fall in a particular window to its own file, write the 'statistics' of each window to a row in a table? With statistics I mean, how many records does a particular window contain and how many of each type. Applied to the example above this would look like this:

> python script.py 160000 10000 file (using the script from sidharth c nadhan)
> cat result
window | total | exons | intron | 3P_UTR | 5P_UTR
1      |  17   |   6   |   5    |    1   |   0
10001  |  12   |   4   |   4    |    1   |   1 
Was it helpful?

Solution

Try this :

import sys,os,collections
list1,set1=list(),set()
dict1 = collections.defaultdict(list)
dict2 = collections.defaultdict(int)
wind , overl, maxim= int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[4])
for line in open(sys.argv[3]):
  try : set1.add(line.split()[1])
  except : pass
  for i in xrange(1,maxim,wind-overl):
    if int(line.split()[0]) in xrange(i,wind+i): dict1[i].append(line)
print "Window\tTotal\t",
for entri in set1 : print entri,"\t",
print "\n",
for j in xrange(1,maxim,wind-overl):
  dict2.clear()
  for line in dict1[j]:
    try : dict2[line.split()[1]] +=1
    except: pass
  print j,"\t",len(dict1[j]),"\t",
  for entri in set1:
    print dict2[entri],"\t",
  print "\n",

Usage : python script.py window overlap file maxim

Where maxim refers to the largest integer in the input file. maxim = 168153 in the sample input file. Giving it as a command line argument improves computational speed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top