Well, first of all, you are looping on a
, but you assigned a value to a
inside the loop, so it's not likely to get very far.
Second of all, I believe that strip().split()
is redundant. You don't need strip()
because it's implied in split()
.
Third of all, you should only split
each line in the master file once. You are doing that for each line of input, which is bound to increase processing time a bit.
I am not entirely certain I understand your requirements from your code, but it seems to me something along these lines should help you:
import sys
from collections import defaultdict
master = defaultdict(list)
with open('Pt') as Pt:
for entry in Pt:
n, low, high = entry.split()
master[n].append(map(int, (low, high)))
with open('a') as a:
for line in a:
n, i = line.split()[:2]
for low, high in master[n]:
if low <= int(i) <= high:
sys.stdout.write(line)
break
To explain: First read and process all the data in the master file just once. Storing the master data in a defaultdict is handy here because it allows you to scan only the rows that matched the first column. map(int, ...)
converts to ints.
When processing the input file, we can retrieve the ranges against which to compare the second value using the first value. Since master
is a defaultdict(list)
, if there are no matches for the first column, we'll end up iterating an empty list.
Note that your original code using range()
would have been equivalent to a condition
low <= i < high
You'll have to adjust the comparison operators as needed.
UPDATE oops. I put the break
outside the condition. After fixing it I get the following three items:
chr1 1161693 chr1uGROUPERuDELu0u832 TGCTCTTTCCAGAAACCCTCAACCCTGTACGGTCAGGAGGAAACATGGCACCTCCCCTCTGGGG T 63 NormalSupport;MinSampleCount;LowSomaticScore CLUSTER_NUM=5454;CONTIG=GGTGCAGGGAAGCAGGAAGGAAGTGAAGCTCAAAAGCCCCTAGGACAGGGCACCTCCCCTCTGGATGCTCTTTCCAGAAACCCTCAACCTTGTACGGTCAGGAGAAAACACATCCCACAAG;CONTIG_NUM=5840;DOWNSTREAM=GCTCTTTCCAGAAACCCTCAACCCTGTACGGTCAGGAGAAAACACATCCCACAAG;END=1161756;NS=1;READSOURCES=(0:3:0,1:2:13);SOMATICSCORE=19;SVLEN=-63;SVTYPE=DEL;UPSTREAM=GGTGCAGGGAAGCGGGAAGGAAGTGAAGCTCAAAAGCCCCTAGGACAGGGCACCTCCCCTCTGGAT;ensembl_gene_id=ENSG00000078808 GT:GQ 1/.:.
chr1 158851689 chr1uGROUPERuDELu3u4452 GGGGAGTAATTCTTATTCATGATATGAAAACTCTAATGTGTTTCTTATTCCAGAAAA G 100 NormalSupport CLUSTER_NUM=25182;CONTIG=CATATTTTGCTATATCTCACATCATTGTTCATCTGATAATATATGAAAACTACAATGTGTTTCTTATTCCAGAAAGGGGAGTAATTCTTATTCATGAATAAACACTGAAGGAGAAAGATTATGGATCATAGTGGGAAAAGCCACAATACCATCTACATTC;CONTIG_NUM=24300;DOWNSTREAM=GGGAGTAATTCTTATTCATGAATAAACACTGACGGAGAAAGATTATGGATCATAGTGGGAAAAGCCACAATACCATCTACATTC;END=158851745;NS=1;READSOURCES=(0:11:0,1:3:18);SOMATICSCORE=55;SVLEN=-56;SVTYPE=DEL;UPSTREAM=CATATTTTGCTATATCTCACATCATTGTTCATCTGATAATATATGAAAACTCCAATGTGTTTCTTATTCCAGAAAG;ensembl_gene_id=ENSG00000229849 GT:GQ 1/.:.
chr1 165014865 chr1uGROUPERuDELu3u7344 ACTGGCATTAGCTATGCTTCCTTAGGCAGACAGCATGTTGAGAAATTCACATTCATCAG A 100 NormalSupport CLUSTER_NUM=40249;CONTIG=CTCCAGTAAAGAGCATCTTTTAATGAAGTGTATCTGCCTGGGCTAGAAAGGCAGCTGCCTCCACTAAAGCAGGGCTGGTCCAGAAATATTACCACTTGCCTAATCCTTATAGTAATCCTAACTGGCAGGTATTATTATATCCCAATTCACACACTTAGAGG;CONTIG_NUM=38845;DOWNSTREAM=CTTGCCTAATCCTTATAGTAATCCTAACTGGCAGGTATTATTATATCCCAATTCACACACTTAGAGG;END=165014923;NS=1;READSOURCES=(0:32:0,1:9:18);SOMATICSCORE=60;SVLEN=-58;SVTYPE=DEL;UPSTREAM=CTCCAGTAAAGAGCATCTTTTAATGAAGTGTATCTGCCTGGGCTAGAAAGGCAGCTGCCTCCACTAAAGCAGGGCTGGTCCAGAAATATTACCA GT:GQ 1/.:.