Question

Currently I'm struggling with an AWK problem that I haven't been able to solve yet. I have one huge file (30GB) with genomic data that holds a list with positions (declared in col 1 and 2) and a second list that holds a number of ranges (declared in col 3, 4 and 5). I want to extract all lines in the first file where the position falls within the range declared in the seconds file. As the position is only unique within a certain chromosome (chr) first it has to be tested if the chr's are identical (ie. col1 in file 1 matches col3 in file2)

file 1

chromosome position another....hundred.....columns
chr1       816 .....
chr1       991 .....
chr2       816 .....
chr2       880 .....
chr2       18768 .....
...
chr22      9736286 .....

file 2

name    identifier chromosome   start    end
GENE1   ucsc.86    chr1         800      900
GENE2   ucsc.45    chr2         700      1700
GENE3   ucsc.46    chr2         18000    19000

expected output

chromosome position another....hundred.....columns
chr1       816 .....
chr2       816 .....
chr2       880 .....
chr2       18768 .....

A summery of what I intend to do in (half coded):

(if $1(in file 1) matches $3(in file 2){            ##test if in the correct chr
   if ($2(in file 1) >= $4 && =< $5 (in file 2){    ##test if pos is in the range
         print $0 (in file 1)                       ##if so print the row from file1
   }
}

I kind if understand how to solve this problem by putting file1 in an array and using position as the index but then I still have a problem with the chr and besides that file1 is way to big to put in an array (although I have 128GB of RAM). I've tried some things with multi-dimensional arrays but couldn't really figure out how to do that either.

Thanks a lot for all your help.

Update 8/5/14 Added a third line in file 2 containing another range in the same chrom. as on the second line. This line is skipped in the script below.

Was it helpful?

Solution 2

The change in your data set actually modified the question greatly. You introduced an element which was used as a key and since keys have to be unique it got overwritten.

For your data set, you are better off making composite keys. Something like:

awk '
NR==FNR{ range[$3,$4,$5]; next }
FNR==1
{
    for(x in range) {
        split(x, check, SUBSEP); 
        if($1==check[1] && $2>=check[2] && $2<=check[3]) print $0
    }
}    
' file2 file1
chromosome position another....hundred.....columns
chr1       816 .....
chr2       816 .....
chr2       880 .....
chr2       18768

OTHER TIPS

It'd be something like this, untested:

awk '
NR==FNR{ start[$3] = $4; end[$3] = $5; next }
(FNR==1) || ( ($1 in start) && ($2 >= start[$1]) && ($2 <= end[$1]) )
' file2 file1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top