Domanda

I use Python 2.7 and the regex module. I use this expression to find a short sequence in a longer DNA sequence:

output = regex.findall(r'(?:'+probe+'){s<'+str(int(mismatches)+1)+'}', sequence, regex.BESTMATCH)

The parameters are :

  • probe : a short string I look for in the genome
  • genome: a long string
  • mismatches : how many differences I allow between the probe/snippet from the genome.

Is there a way to get the positions of all the sequences that match the regex in the genome? Does this script finds overlapping matches? It works pretty well but then I decided to try, say :

probe = "TTGACAT" 
genome = "TTGACATTGACATATAAT" 
mismatches = 0

I got :

['TTGACAT']

With the same parameters but mismatches = 10

I got :

['TTGACAT','GACATAT']

So I do not know if the script finds 'TTGACAT' only once because it overlaps with the second occurence or if it actually finds 'TTGACAT' twice and shows the result only once...

Thanks

È stato utile?

Soluzione

This is because it overlaps with the second occurence.

If you want all overlapping results, you must use the same pattern with the overlapped flag:

output = regex.findall(r'(?:'+probe+'){s<'+str(int(mismatches)+1)+'}', sequence, regex.BESTMATCH, overlapped=True)

If you want to know the sequence position:

for m in regex.finditer(r'(?:'+probe+'){s<'+str(mismatches+1)+'}', sequence, regex.BESTMATCH, overlapped=True):
    print '%d: %s' % (m.start(), m.group())

As an aside comment: The limit with overlapping results

If I use these three parameters:

probe = "ACTG.*ACTG"
sequence = "ACTGTTGACATTGAACTGCATATAATACTG"
mismatches = 0

I will find only two results: ['ACTGTTGACATTGAACTGCATATAATACTG', 'ACTGCATATAATACTG'] instead of three. Because two results can not start at the same position in the string.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top