BioPython: How to convert the amino acid alphabet to

https://stackoverflow.com/questions/19552897

01-07-2022
|

Question

When discussing how to import sequence data using Bio.SeqIO.parse(), the BioPython cookbook states that:

There is an optional argument alphabet to specify the alphabet to be used. This is useful for file formats like FASTA where otherwise Bio.SeqIO will default to a generic alphabet.

How do I add this optional argument? I have the following code:

from os.path import abspath
from Bio import SeqIO

handle = open(f_path, "rU")
records = list(SeqIO.parse(handle, "fasta"))
handle.close()

This imports large list of FASTA files from a UniProt database. The problem is that it is in the generic SingleLetterAlphabet class. How do I convert between SingleLetterAlphabet to ExtendedIUPACProtein?

The ultimate goal is to search through these sequences for a motif such as GxxxG.

Solution

Like this:

# Import required alphabet
from Bio.Alphabet import IUPAC

# Pass imported alphabet as an argument for `SeqIO.parse`:
records = list(SeqIO.parse(handle, 'fasta', IUPAC.extended_protein))

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow