Question

ORF_sequences = re.findall(r'ATG(?:...){9,}?(?:TAA|TAG|TGA)',sequence)  #thanks to @Martin Pieters and @nneonneo

I have a line of code that finds any instance of A|G followed by 2 characters and then ATG that is then followed by either a TAA|TAG|TGA when read in units of 3. only works when A|G-xx-ATG-xxx-TAA|TAG|TGA is 30 elements or greater

i want to add a criteria

i need the ATG to be followed by a G

so A|G-xx-ATG-Gxx-xxx-TAA|TGA|TAG #at least 30 elements long example: GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA ^ would work

GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would not work because it is an (A|G) followed by only one value (not 2) before the ATG and there is not a G following the A|G-xx-ATG

i hope this makes sense

I tried

ORF_sequences = re.findall(r'ATGG(?:...){9,}?(?:TAA|TAG|TGA)',sequence)

but it seemed like it was using window size 3 after last G of ATGG

basically I need that code, where the first occurrence is A|G-xx-ATG and the second occurrence is (G-xx)

Was it helpful?

Solution

It'll be easier if you use a character group of [AG], there is no need to group the two 'free' characters:

 ORF_sequences2 = re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)

or you need to group the A|G:

 ORF_sequences2 = re.findall(r'(?:A|G)..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)

Applying the first form to your examples:

>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCATGGGGTTTTGA')
[]

In your attempt, the expression matches either an A, or the expression G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA) because the | symbol applies to everything that preceeds or follows it within the same group. As it is not grouped, it applies to the whole expression instead:

>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'A')
['A']
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']

If you need to match a certain amount of characters in your whole match, you need to tailor those 3 character (?:...) groups to match a minimum number of times:

 ORF_sequences2 = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)

would match A or G followed by 2 characters, followed by ATGG with another 2 characters, then at least 7 times 3 characters (total 21), followed by a specific pattern of 3 more (TAA, TAG or TGA) for a total of at least 33 characters from the first to the last character. The extra .. make up the pattern of 3 after ATG and matches your example from your comment:

>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA']

as well as correctly handling the examples given in your question:

>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA']
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
[]

OTHER TIPS

To ensure you get at least 30 characters, use the {n,} quantifier:

r'[AG]..ATG(?:...){9,}?(?:TAA|TAG|TGA)'

This ensures that you read at least 9 triplets (27 characters) between the ATG opening and the TAA|TGA|TAG terminator.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top