Fetching genomic sequence efficiently in Python?

https://stackoverflow.com/questions/3191774

02-10-2019
|

문제

How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fetch_seq(chrom, strand, start, end) which will return the sequence [start, end] on the given chromosome on the specified strand.

Analogously, is there a programmatic python interface for getting phastCons scores?

thanks.

해결책

See my answer to your question over at Biostar:

http://biostar.stackexchange.com/questions/1639/getting-genomic-sequences-and-phastcons-scores-using-python-from-ensembl-ucsc

Use SeqIO with Fasta files and you'll get back record objects for each item in the file. Then you can do:

region = rec.seq[start:end]

to pull out slices. The nice thing about using a standard library is you don't have to worry about the line breaks in the original fasta file.

다른 팁

Retrieving sequence data from large human chromosome files can be inefficient memory-wise, so if you're looking for computational efficiency you can format the sequence data to a packed binary string and lookup based on byte location. I wrote routines to do this in perl (available here ), and python has the same pack and unpack routines - so it can be done, but only worth it if you're running in to trouble with large files on a limited machine. Otherwise use biopython SeqIO

Take a look at biopython, which has support for several gene sequence formats. Specifically, it has support for FASTA and GenBank files, to name a couple.

pyfasta is the module you're looking for. From the description

fast, memory-efficient, pythonic (and command-line) access to fasta sequence files

https://github.com/brentp/pyfasta

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow