BioPython: How to search for a motif in a collection of Seq objects

https://stackoverflow.com/questions/19553651

01-07-2022
|

문제

I have a list of Seq objects from BioPython and I want to search for an amino acid sequence motif within these sequences. What is the best way to search these sequences? My search is to find a motif like GxxxG, but that could be longer or shorter but stop at the first instance of the next G after the first G. Using a regular expression such as G.*G will give me a results of the first G with any number of amino acids to the last found G.

#Some example code
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
import re

records = Seq("WALLLLFWLGWLGMLAGAVVIIVR", IUPAC.extended_protein)

search = re.search("F.*G", str(records))
print search.group()
# Want FWLG
# Get 
FWLGWLGMLAG

해결책

You want a lazy match.

A.*B given ABBBBBBBBBBBBBE can be though of as trying to match:

ABBBBBBBBBBBBBE
^-------------^

Going "that doesn't match" and trying one letter less

ABBBBBBBBBBBBBE
^------------^

Going "that does match" and returning it

A lazy match A.*?B will try and match as little as possible. In this case:

ABBBBBBBBBBBBBE
^^

A and 0 characters then B, and will be like "That's a match" and return just AB

? usually means optional, but as * is a quantifier (0 or more) ? acts upon it to make it lazy.

You want F.*?G

다른 팁

Using a lazy quantifier is the slower method. To stop at the first occurence of G, you can use a negated character class instead of the dot. Example:

F[^G]*G

[^G] means all characters except G

Then you can use a greedy quantifier.

To have an idea of the speed gain, you can test the different patterns with this code:

import re
import cProfile

s = r'ACATCATCTATCTATACAATAAAAACTATCCCCTAACTACTACACTACTATCATCACATCATATCACTTTATATCCTAC'
for i in range(1,15):
    s = s + s

s = r'F' + s 
s = s + r'ATCTATCTATACAATAATCTATCTATACAATAATCTATCGATCTATCTATACAATAATCTATCTATACAATATCG' + s

cProfile.run('re.search(r"F[^G]+G",s)')

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow