Fetching genomic sequence efficiently in Python?

https://stackoverflow.com/questions/3191774

02-10-2019
|

Question

How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fetch_seq(chrom, strand, start, end) which will return the sequence [start, end] on the given chromosome on the specified strand.

Analogously, is there a programmatic python interface for getting phastCons scores?

thanks.

Solution

See my answer to your question over at Biostar:

http://biostar.stackexchange.com/questions/1639/getting-genomic-sequences-and-phastcons-scores-using-python-from-ensembl-ucsc

Use SeqIO with Fasta files and you'll get back record objects for each item in the file. Then you can do:

region = rec.seq[start:end]

to pull out slices. The nice thing about using a standard library is you don't have to worry about the line breaks in the original fasta file.

OTHER TIPS

Retrieving sequence data from large human chromosome files can be inefficient memory-wise, so if you're looking for computational efficiency you can format the sequence data to a packed binary string and lookup based on byte location. I wrote routines to do this in perl (available here ), and python has the same pack and unpack routines - so it can be done, but only worth it if you're running in to trouble with large files on a limited machine. Otherwise use biopython SeqIO

Take a look at biopython, which has support for several gene sequence formats. Specifically, it has support for FASTA and GenBank files, to name a couple.

pyfasta is the module you're looking for. From the description

fast, memory-efficient, pythonic (and command-line) access to fasta sequence files

https://github.com/brentp/pyfasta

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow