문제

I have an input file like this

         id_per start end   s_len
con1 P1  95.27   1    148    148    
con2 P2  89.86   4    148    148    
con3 P5  76.67   1    512    516

For every con I have P (protein). I want to find proteins that have full lengths, if I know the start site, end site and the length of every P, it is possible. The script below does this. However, now my question is, I would like to find lengths, but taking into consideration also +- 10 units, both from start and end sites.

import re
output=open('res.txt','w')
output2=open('res2.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
    new_list=re.split(r'\t+',line.strip())
    id_per=float(new_list[2])
    s_start=int(new_list[3])
    s_end=int(new_list[4])
    s_len=int(new_list[5])
    if s_start == 1 and s_end == s_len and id_per >= 30:
        new_list.append(s_start)
        new_list.append(s_end)
        new_list.append(s_len)
        new_list.append(id_per)
        output.writelines(line)
    else:
        output2.write(line)
f.close()
output.close()
output2.close()
도움이 되었습니까?

해결책

If I understand you correctly, your condition can be rewritten as: |distance_from_start_to_end - stated_length| < 10. Here is how to express this in Python:

with open('example.txt', 'r') as infile, \
        open('output.txt', 'w') as outfile, \
        open('errors.txt', 'w') as errfile:
    for line in in file:
        id_per, s_start, s_end, s_len = (line.split()[i] for i in [2, 3, 4, 5])
        start_to_end = (int(s_end) - int(s_start)) + 1
        if abs(int(s_len) - start_to_end) < 10:
            outfile.write(line)
        else:
            errfile.write(line)

There are other improvements in this snippet with respect to your original code:

  • use with, a context manager, to avoid having to close the file handles explicitly
  • you do not need the re module, split can accept a tabulator as character to split by.
  • use ,, the tuple operator, to deconstruct automatically the tokens splitted from the line
  • ignore fields by setting them to _
  • removed the new_list variable because it seems it is not used. Maybe I misunderstood your snippet?
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top