Finding/Replacing substrings with annotations in an ASCII file in Python

https://stackoverflow.com/questions/5870012

28-10-2019
|

문제

I'm having a little coding issue in a bioinformatics project I'm working on. Basically, my task is to extract motif sequences from a database and use the information to annotate a sequence alignment file. The alignment file is plain text, so the annotation will not be anything elaborate, at best simply replacing the extracted sequences with asterisks in the alignment file itself.

I have a script which scans the database file, extracts all sequences I need, and writes them to an output file. What I need is, given a query, to read these sequences and match them to their corresponding substrings in the ASCII alignment files. Finally, for every occurrence of a motif sequence (substring of a very large string of characters) I would replace motif sequence XXXXXXX with a sequence of asterisks *.

The code I am using goes like this (11SGLOBULIN is the name of the protein entry in the database):

motif_file = open('/users/myfolder/final motifs_11SGLOBULIN','r')
align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+') 
finalmotifs = motif_file.readlines()
seqalign = align_file.readlines() 


for line in seqalign:
    if motif[i] in seqalign:  # I have stored all motifs in a list called "motif"
        replace(motif, '*****')

But instead of replacing each string with a sequence of asterisks, it deletes the entire file. Can anyone see why this is happening?

I suspect that the problem may lie in the fact that my ASCII file is basically just one very long list of amino acids, and Python cannot know how to replace a particular substring hidden within a very long string.

해결책

Something like the following should do the trick. I've made assumptions about your input data as you've not posted samples and that you're running python 2.7.

motifs = [ x.strip() for x in open('final motifs_11SGLOBULIN','r') ]
redact = '*****'

with open('11sglobulin.seqs','r') as data_in, open('11sglobulin.seqs.new','w') as data_out:
  for seq in data_in:
    for motif in motifs:
      while True:
        x = seq.find(motif)
        if x >= 0:
          seq = seq[:x] + redact + seq[x+len(motif):]
        else:
          break
  data_out.write(seq)

다른 팁

You are misunderstanding the w+ file mode. Using mode w+ with open will truncate the file (that is delete everything in it) see: http://docs.python.org/library/functions.html#open. Your seq data are gone as soon as you call:

align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+')

Also replace is going to operate on strings read from the file. You need to explicitly write the altered strings back out.

Your best bet is to use a third file to store your results. If you really want to you can copy the resulting file over the original align_file when you are done.

You could simplify this a little more by changing the innermost while loop from:

while True:
    x = seq.find(motif)
    if x >= 0:
      seq = seq[:x] + redact + seq[x+len(motif):]
    else:
      break

to:

if motif in seq:
  seq = seq.replace(motif, redact)

Thanks everyone, I really appreciate the responses, sorry for the dealy in answering. So basically what I should have been doing was, as many pointed out, open the file to annotate and write those annotations to a new file. This bit of code did the trick:

align_file_rmode = open('/Users/spyros/folder1/python/printsmotifs/alignfiles/query, 'r') 
align_file_amode = open('/Users/spyros/folder1/python/printsmotifs/alignfiles/query, 'a+')

finalmotifs = motif_file.readlines()
seqalign = align_file_rmode.readlines() 

for line in seqalign: 
   for item in finalmotifs:
      item = item.strip().upper()
      if item in line:
         line = line.replace(item, '$' * len(item)) 
         align_file_amode.write(line) 

motif_file.close()
align_file_rmode.close()
align_file_amode.close()

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow