I reckon my question wasn't all clear. So, time for another approach in explaining my quest.
- I've got a single data file that contains about 27,500 data points. Every datapoint has a unique integers between 0 and 46041637 in the first column and a description in column 2 - 6.
- I would like to know how the integers are distributed across the 46M possibilities. For instance, how many (and which) datapoints fall within the range 1-1000, how many between 1001-2000 and so on until 46M is reached.
- In the previous example, roughly 46K (46M/1000) smaller datasets (or bins) will be created. Some bins will contain several datapoints while a lot of them won't contain any datapoints at all.
- In the example above, the size/length of the bin (I would call it windowsize) is 1000. I do not know yet what the best windowsize is to suit my purposes, so, I would like to be able to have a script that has an 'adjustable' windowsize.
- Furthermore, the example 1-1000, 1001, 2001, [...], clearly doesn't show any overlap in the bins. However, that means that I lose a lot of 'sensitivity' and therefore knowledge/information. So I would like to be able to create windows/bins that overlap. For instance, 1-1000,501-1500, 1501-2500, 2001-3000. These bins have an overlap of 500. I would to be able to set the amount of overlap between the bins.
- I would like to write every bin to its own file, even if the bin doesn't contain any datapoints.
Here is the explanation I gave before, which wasn't very good apparently
Using AWK I'm trying to 'slide a window' over a list of integers. If I would split up this dataset how many datapoints would there be in every (possible overlapping) bin? I like to set the binsize (or windowsize) and overlap between the bins. This approach enables me to get an idea of the local datapoint density.
--> I have a little bit of AWK experience and I've been told tha AWK should be able to do the job, I prefer to use AWK. However, I'm also open to other ideas (Python for example).
- I've got a single data file that contains about 27,500 integers between 0 and 46041637 (46 million) and a description for every data point.
- As I would like to have an idea of the effects of 'changing' the resolution I would like to play a little with different window-sizes and the overlap between individual windows.
- I like to write all the contents of every single 'window' to a separate file and name the file according to the 'windows-startpoint'.
I've prepared some example files which are attached below. However, to make it easier to get an idea here's another very very simplified example:
Datarange 1 - 10
Integers in dataset: 2,4,5,6,9
####
####Example 1: Windowsize=5,overlap=2####
####
file name ="1" contents are: (range is 1 - 5)
2
4
5
file name="3" contents are: (range is 4 - 8, that is, two overlap with previous range)
4
5
6
file name="7" contents are: (range is 7 - 10, if the range was larger, it would be 7 - 11)
9
####
####Example 2: Windowsize=3,overlap=0####
####
file name="1" contents are (range 1 - 3)
2
file name="4" contents are (range 4 - 6)
4
5
6
file name="7" contents are (range 7 - 9)
9
file name="9" contents are (range 10 - 10)
<none>
Example input file
3579
3661
3752 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
3947 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
6734 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6865 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6915 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
8961
13471 EXON 13449 13532 + Solyc06g005020.1.1.2Solyc06g005020.1.1
13561 INTRON 13533 13710 + Solyc06g005020.1.1.2Solyc06g005020.1.1
22226 EXON 22106 22261 + Solyc06g005030.1.1.1Solyc06g005030.1.1
22516
22556
36903 INTRON 36836 36915 + Solyc06g005060.2.1.1Solyc06g005060.2.1
37377 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37605 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37935 3P_UTR 37801 38132 + Solyc06g005060.2.1.0Solyc06g005060.2.1
167942 5P_UTR 167930 167956 - Solyc06g005140.2.1.0Solyc06g005140.2.1
168020 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
168153 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
Example output file with different windowsizes and overlap
> AWK -v windowsize=50000 -v overlap=0 -f awkscript input.file
> ls
1 50001
100001 150001
> cat 1
3579
3661
3752 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
3947 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
6734 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6865 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6915 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
8961
13471 EXON 13449 13532 + Solyc06g005020.1.1.2Solyc06g005020.1.1
13561 INTRON 13533 13710 + Solyc06g005020.1.1.2Solyc06g005020.1.1
22226 EXON 22106 22261 + Solyc06g005030.1.1.1Solyc06g005030.1.1
22516
22556
36903 INTRON 36836 36915 + Solyc06g005060.2.1.1Solyc06g005060.2.1
37377 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37605 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37935 3P_UTR 37801 38132 + Solyc06g005060.2.1.0Solyc06g005060.2.1
3752 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
3947 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
6734 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6865 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6915 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
8961
13471 EXON 13449 13532 + Solyc06g005020.1.1.2Solyc06g005020.1.1
13561 INTRON 13533 13710 + Solyc06g005020.1.1.2Solyc06g005020.1.1
22226 EXON 22106 22261 + Solyc06g005030.1.1.1Solyc06g005030.1.1
22516
22556
36903 INTRON 36836 36915 + Solyc06g005060.2.1.1Solyc06g005060.2.1
37377 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37605 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37935 3P_UTR 37801 38132 + Solyc06g005060.2.1.0Solyc06g005060.2.1
> cat 50001
> cat 100001
> cat 150001
167942 5P_UTR 167930 167956 - Solyc06g005140.2.1.0Solyc06g005140.2.1
168020 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
168153 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
> #And with some different paramenters
> AWK -v windowsize=160000 -v overlap=10000 -f awkscript input.file
> ls
1 10001
> cat 1
3579
3661
3752 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
3947 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
6734 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6865 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6915 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
8961
13471 EXON 13449 13532 + Solyc06g005020.1.1.2Solyc06g005020.1.1
13561 INTRON 13533 13710 + Solyc06g005020.1.1.2Solyc06g005020.1.1
22226 EXON 22106 22261 + Solyc06g005030.1.1.1Solyc06g005030.1.1
22516
22556
36903 INTRON 36836 36915 + Solyc06g005060.2.1.1Solyc06g005060.2.1
37377 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37605 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37935 3P_UTR 37801 38132 + Solyc06g005060.2.1.0Solyc06g005060.2.1
> cat 10001
13471 EXON 13449 13532 + Solyc06g005020.1.1.2Solyc06g005020.1.1
13561 INTRON 13533 13710 + Solyc06g005020.1.1.2Solyc06g005020.1.1
22226 EXON 22106 22261 + Solyc06g005030.1.1.1Solyc06g005030.1.1
22516
22556
36903 INTRON 36836 36915 + Solyc06g005060.2.1.1Solyc06g005060.2.1
37377 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37605 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37935 3P_UTR 37801 38132 + Solyc06g005060.2.1.0Solyc06g005060.2.1
167942 5P_UTR 167930 167956 - Solyc06g005140.2.1.0Solyc06g005140.2.1
168020 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
168153 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
Thank you so much for all your help!
Little adjustment of my initial question because I requires way more computing time than I anticipated for.
Is it possible that, instead of writing all the records that fall in a particular window to its own file, write the 'statistics' of each window to a row in a table? With statistics I mean, how many records does a particular window contain and how many of each type. Applied to the example above this would look like this:
> python script.py 160000 10000 file (using the script from sidharth c nadhan)
> cat result
window | total | exons | intron | 3P_UTR | 5P_UTR
1 | 17 | 6 | 5 | 1 | 0
10001 | 12 | 4 | 4 | 1 | 1