Fetching genomische Sequenz effizient in Python?

https://stackoverflow.com/questions/3191774

02-10-2019
|

Frage

Wie kann ich genomische Sequenz holen Python effizient zu nutzen? Zum Beispiel von einer .FA Datei oder einem anderen leicht erhalten Format? Ich möchte im Grunde eine Schnittstelle fetch_seq (chrom, Strang, Beginn, Ende), das die Sequenz kehrt [Start, Ende] auf dem gegebenen Chromosom auf dem angegebenen Strang.

Analog gibt es eine programmatische Python-Schnittstelle für phastCons Noten zu bekommen?

Dank.

Lösung

Sehen Sie meine Antwort auf Ihre Frage über bei Biostar:

http://biostar.stackexchange.com/questions/1639/getting-genomic-sequences-and-phastcons-scores-using-python-from-ensembl-ucsc

Verwenden SeqIO mit Fasta Dateien und Sie werden in der Datei Datensatz Objekte für jedes Element zurück. Dann können Sie tun:

region = rec.seq[start:end]

Scheiben herauszuziehen. Die nette Sache über eine Standard-Bibliothek verwendet, wird Sie nicht Sorgen über die Zeilenumbrüche in der ursprünglichen fasta Datei.

Andere Tipps

Retrieving sequence data from large human chromosome files can be inefficient memory-wise, so if you're looking for computational efficiency you can format the sequence data to a packed binary string and lookup based on byte location. I wrote routines to do this in perl (available here ), and python has the same pack and unpack routines - so it can be done, but only worth it if you're running in to trouble with large files on a limited machine. Otherwise use biopython SeqIO

Take a look at biopython, which has support for several gene sequence formats. Specifically, it has support for FASTA and GenBank files, to name a couple.

pyfasta is the module you're looking for. From the description

fast, memory-efficient, pythonic (and command-line) access to fasta sequence files

https://github.com/brentp/pyfasta

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow