How to select only certain Substrings
문제
from a string say dna = 'ATAGGGATAGGGAGAGAGCGATCGAGCTAG' i got substring say dna.format = 'ATAGGGATAG','GGGAGAGAG' i only want to print substring whose length is divisible by 3 how to do that? im using modulo but its not working !
import re
if mydna = 'ATAGGGATAGGGAGAGAGCAGATCGAGCTAG'
print re.findall("ATA"(.*?)"AGA" , mydna)
if len(mydna)%3 == 0
print mydna
corrected code
import re
mydna = 'ATAGGGATAGGGAGAGAGCAGATCGAGCTAG'
re.findall("ATA"(.*?)"AGA" , mydna.format)
if len(mydna.format)%3 == 0:
print mydna.format
this still doesnt give me substring with length divisible by three . . any idea whats wrong ?
im expecting only substrings which has length divisible by three to be printed
해결책
You can also use the regular expression for that:
re.findall('ATA((...)*?)AGA', mydna)
the inner braces match 3 letters at once.
다른 팁
For including overlap substrings, I have the following lengthy version. The idea is to find all starting and ending marks and calculate the distance between them.
mydna = 'ATAGGGATAGGGAGAGAGCAGATCGAGCTAG'
[mydna[start.start():end.start()+3] for start in re.finditer('(?=ATA)',mydna) for end in re.finditer('(?=AGA)',mydna) if end.start()>start.start() and (end.start()-start.start())%3 == 0]
['ATAGGGATAGGG', 'ATAGGG']
Show all substrings, including overlapping ones:
[mydna[start.start():end.start()+3] for start in re.finditer('(?=ATA)',mydna) for end in re.finditer('(?=AGA)',mydna) if end.start()>start.start()]
['ATAGGGATAGGG', 'ATAGGGATAGGGAG', 'ATAGGGATAGGGAGAGAGC', 'ATAGGG', 'ATAGGGAG', 'ATAGGGAGAGAGC']
Using modulo is the correct procedure. If it's not working, you're doing it wrong. Please provide an example of your code in order to debug it.
re.findAll() will return you an array of matching strings, You need to iterate on each of those and do a modulo on those strings to achieve what you want.