The findall
method returns a list of matches. If the pattern contains capturing groups, then each match is a tuple of the strings matched by each capturing group in the pattern.
From the documentation:
Return all non-overlapping matches of
pattern
instring
, as a list of strings. Thestring
is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
Your pattern contains three capturing groups. The groups are nested. The first (and outermost) group is the entire pattern, (ATG((\w\w\w)*)(?!TAA))
. The second group is ((\w\w\w)*)
. The third group is (\w\w\w)
.
Note that the negative lookahead assertion, (?!TAA)
, is not a capturing group.
In essence, your pattern says to match the codon ATG
, followed by as many codons as possible, but back up two codons if the match would stop at a TAA
codon. Since *
is greedy, your pattern will match a TAA
codon in the middle. It will only reject a TAA
codon (and the codon before that) if the TAA
occurs at the end of the input string.
Because of your capturing groups, your pattern says that each returned match should contain three strings: the entire sequence of matched codons, the sequence of matched codons excluding the initial ATG
, and the last matched codon in the sequence.
You can mark a group as non-capturing using (?:...)
, like this:
In [5]: pat = re.compile("(?:ATG(?:(?:\w\w\w)*)(?!TAA))")
If your pattern contains no capturing groups, then findall
returns each match as a single string, instead of as a tuple.
In [6]: pat.findall(s)
Out[6]: ['ATGCCTAAG']
If you want to stop at the first TAA
, but go to the end of the string if there is no TAA
at all, you need to check each codon, by putting your negative lookahead assertion inside the repetition:
pat = re.compile("ATG(?:(?!TAA)\w\w\w)*")
This asserts, at each codon after the initial ATG
, that it should not match a TAA
codon.
If you want to stop at the first TAA
codon, even if that codon is not aligned with the ATG
, you can do it like this:
In [7]: pat = re.compile("ATG(?:(?!.{0,2}TAA)\w\w\w)*")
In [8]: pat.findall(s)
Out[8]: ['ATG']
In [10]: pat.findall('ATGCCTGAATATAAG')
Out[10]: ['ATGCCTGAA']