Python regex : overlapping sequences position

https://stackoverflow.com/questions/22116596

18-10-2022
|

Question

I use Python 2.7 and the regex module. I use this expression to find a short sequence in a longer DNA sequence:

output = regex.findall(r'(?:'+probe+'){s<'+str(int(mismatches)+1)+'}', sequence, regex.BESTMATCH)

The parameters are :

probe : a short string I look for in the genome
genome: a long string
mismatches : how many differences I allow between the probe/snippet from the genome.

Is there a way to get the positions of all the sequences that match the regex in the genome? Does this script finds overlapping matches? It works pretty well but then I decided to try, say :

probe = "TTGACAT" 
genome = "TTGACATTGACATATAAT" 
mismatches = 0

I got :

['TTGACAT']

With the same parameters but mismatches = 10

I got :

['TTGACAT','GACATAT']

So I do not know if the script finds 'TTGACAT' only once because it overlaps with the second occurence or if it actually finds 'TTGACAT' twice and shows the result only once...

Thanks

Solution

This is because it overlaps with the second occurence.

If you want all overlapping results, you must use the same pattern with the overlapped flag:

output = regex.findall(r'(?:'+probe+'){s<'+str(int(mismatches)+1)+'}', sequence, regex.BESTMATCH, overlapped=True)

If you want to know the sequence position:

for m in regex.finditer(r'(?:'+probe+'){s<'+str(mismatches+1)+'}', sequence, regex.BESTMATCH, overlapped=True):
    print '%d: %s' % (m.start(), m.group())

As an aside comment: The limit with overlapping results

If I use these three parameters:

probe = "ACTG.*ACTG"
sequence = "ACTGTTGACATTGAACTGCATATAATACTG"
mismatches = 0

I will find only two results: ['ACTGTTGACATTGAACTGCATATAATACTG', 'ACTGCATATAATACTG'] instead of three. Because two results can not start at the same position in the string.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow