Pergunta

Why does my pattern produce this result? I expect it to find ATG then a sequence of 3 which does not include TAA.

In [102]: s = 'GATGCCTAAG'
In [103]: pat = re.compile("(ATG((\w\w\w)*)(?!TAA))")
In [104]: pat.findall(s)
Out[104]: [('ATGCCTAAG', 'CCTAAG', 'AAG')]
Foi útil?

Solução

The findall method returns a list of matches. If the pattern contains capturing groups, then each match is a tuple of the strings matched by each capturing group in the pattern.

From the documentation:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Your pattern contains three capturing groups. The groups are nested. The first (and outermost) group is the entire pattern, (ATG((\w\w\w)*)(?!TAA)). The second group is ((\w\w\w)*). The third group is (\w\w\w).

Note that the negative lookahead assertion, (?!TAA), is not a capturing group.

In essence, your pattern says to match the codon ATG, followed by as many codons as possible, but back up two codons if the match would stop at a TAA codon. Since * is greedy, your pattern will match a TAA codon in the middle. It will only reject a TAA codon (and the codon before that) if the TAA occurs at the end of the input string.

Because of your capturing groups, your pattern says that each returned match should contain three strings: the entire sequence of matched codons, the sequence of matched codons excluding the initial ATG, and the last matched codon in the sequence.

You can mark a group as non-capturing using (?:...), like this:

In [5]: pat = re.compile("(?:ATG(?:(?:\w\w\w)*)(?!TAA))")

If your pattern contains no capturing groups, then findall returns each match as a single string, instead of as a tuple.

In [6]: pat.findall(s)
Out[6]: ['ATGCCTAAG']

If you want to stop at the first TAA, but go to the end of the string if there is no TAA at all, you need to check each codon, by putting your negative lookahead assertion inside the repetition:

pat = re.compile("ATG(?:(?!TAA)\w\w\w)*")

This asserts, at each codon after the initial ATG, that it should not match a TAA codon.

If you want to stop at the first TAA codon, even if that codon is not aligned with the ATG, you can do it like this:

In [7]: pat = re.compile("ATG(?:(?!.{0,2}TAA)\w\w\w)*")

In [8]: pat.findall(s)
Out[8]: ['ATG']

In [10]: pat.findall('ATGCCTGAATATAAG')
Out[10]: ['ATGCCTGAA']

Outras dicas

Also, in the re module, * includes possibly zero of that item, in addition to what @rob mayoff wrote.

From the docs:

*

Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

I think that your best solution would be to make a very simple regular expression that does capture TAA, and then apply a filter that strips out TAA patterns.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top