Question

So what I'm trying to do is have a function that finds a sequence 'ATG' in a string and then from there moves along the string in units of 3 until it finds either a 'TAA', 'TAG', or 'TGA' (ATG-xxx-xxx-TAA|TAG|TGA)

To do this, I wrote this line of code (where fdna is the input sequence)

ORF_sequences = re.findall(r'ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)

I then wanted to add 3 requirements:

  1. Total length must be 30
  2. Two places before the ATG there must be either an A or a G to be detected (A|G-x-x-A-T-G-x-x-x)
  3. The next place after the ATG would have to be a G (A-T-G-G-x-x)

To execute this part, I changed my code to:

ORF_sequence_finder = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)

What I want instead of having all of these limits would be to have requirement 1 (greater or equal to 30 characters) and then have EITHER requirement 2 (A|G-x-x-A-T-G-x-x-x) OR requirement 3 (A-T-G-G-x-x) OR both of those.

If I split the above line up into two and appended them to a list, they get out of order and have repeats.

Here are a few examples of the different cases:

sequence1 = 'AGCCATGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGAAAA'
sequence2 = 'ATCCATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTAG'
sequence3 = 'AGCCATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTAG'    
sequence4 = 'ATGGGGTGA'

sequence1 = 'A**G**CC*ATG*TGGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TGA*AAA'

sequence1 would be accepted by criteria because it follows requirement 2 (A|G-x-x-A-T-G-x-x-x) and its length is >= 30.

sequence2 = 'ATCC*ATG***G**GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TAG*

sequence2 would be accepted because it follows requirement 3 (A-T-G-G-x-x) and its length is >=30

sequence3 = 'A**G**CC*ATG***G**GGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TGA*AAA'

sequence3 would be accepted because it fulfills both requirement 2 and 3 while also having >=30 character.

sequence4 = 'ATGGGGTGA'

sequence4 would NOT be accepted because its not >= 30, does not follow requirement 2 or requirement 3.

So basically, I want it to accept sequences that either follow requirement 2 AND/OR requirement 3 (or both) while satisfying requirement 1.

How can I split this up without then adding duplicates (in cases where both occur) and getting out of order?

Was it helpful?

Solution

If the possible [AG].. should be included in the length requirement you can use:

r'(?x) (?: [AG].. ATG | ATG G.. )  (?:...){7,}? (?:TAA|TAG|TGA)'

Or if you don't want to include [AG].. in the match you could use lookarounds:

r'(?x) ATG (?: (?<=[AG].. ATG) | (?=G) ) (?:...){8,}? (?:TAA|TAG|TGA)'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top