Fetching genomic sequence efficiently in Python?
-
02-10-2019 - |
Question
How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fetch_seq(chrom, strand, start, end) which will return the sequence [start, end] on the given chromosome on the specified strand.
Analogously, is there a programmatic python interface for getting phastCons scores?
thanks.
Solution
See my answer to your question over at Biostar:
Use SeqIO with Fasta files and you'll get back record objects for each item in the file. Then you can do:
region = rec.seq[start:end]
to pull out slices. The nice thing about using a standard library is you don't have to worry about the line breaks in the original fasta file.
OTHER TIPS
Retrieving sequence data from large human chromosome files can be inefficient memory-wise, so if you're looking for computational efficiency you can format the sequence data to a packed binary string and lookup based on byte location. I wrote routines to do this in perl (available here ), and python has the same pack and unpack routines - so it can be done, but only worth it if you're running in to trouble with large files on a limited machine. Otherwise use biopython SeqIO
Take a look at biopython, which has support for several gene sequence formats. Specifically, it has support for FASTA and GenBank files, to name a couple.
pyfasta is the module you're looking for. From the description
fast, memory-efficient, pythonic (and command-line) access to fasta sequence files