문제

Given a random sequence, how can I check if that sequence is protein or not?

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_prot = Seq("'TGEKPYVCQECGKAFNCSSYLSKHQR")
my_prot


my_prot.alphabet #How to make a check here ??
도움이 되었습니까?

해결책

If your Seq object has an assigned alphabet, you can check if that alphabet is a protein alphabet:

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC, ProteinAlphabet
my_prot = Seq("TGEKPYVCQECGKAFNCSSYLSKHQR", alphabet=IUPAC.IUPACProtein())

print isinstance(my_prot.alphabet, ProteinAlphabet)

However, if you don't have the alphabet known, you'll have to employ some heuristics to guess whether or not it's a protein sequence. This could be as easy as checking if the sequence is entirely "ATC[GU]", or if it employs other letter codes.

But this isn't perfect. For instance, the sequence "ATCG" could be alanine, threonine, cysteine, glycine (i.e. a protein), or it could be adenine, thymine, cytosine, guanine (DNA). Similarly, "ACG" could be a protein, RNA, or DNA. It's technically impossible to be sure that a sequence is DNA, and not a protein sequence. However, if you have a SeqRecord or other context for the Seq, you may be able to check if it's a protein sequence.

다른 팁

Apparently Biopython removed Bio.Alphabet

copying from https://www.biostars.org/p/102/

You can use:


import re

from Bio.Seq import Seq

def validate(seq, alphabet='dna'):
    
    alphabets = {'dna': re.compile('^[acgtn]*$', re.I), 
             'protein': re.compile('^[acdefghiklmnpqrstvwy]*$', re.I)}


    if alphabets[alphabet].search(seq) is not None:
         return True
    else:
         return False



dataz = 'AAAAAAACCCCCCCCCCCCCCDDDDDDRRRRRRRREERRRRGGG'

pippo = Seq(dataz)

print(pippo, type(pippo))

print(validate(str(pippo), 'dna'))

print(validate(str(pippo), 'protein'))

dataz = 'atg'

pippo = Seq(dataz)

print(pippo, type(pippo))

print(validate(str(pippo), 'dna'))

print(validate(str(pippo), 'protein'))

output:

AAAAAAACCCCCCCCCCCCCCDDDDDDRRRRRRRREERRRRGGG <class 'Bio.Seq.Seq'>
False
True
atg <class 'Bio.Seq.Seq'>
True
True
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top