>>> from lxml import etree
>>> doc = etree.parse("http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:30037432,30038060",parser=etree.XMLParser())
>>> doc.xpath('SEQUENCE/DNA/text()')
['\natagtggcacatgtctgttgtcctagctcctcggggaaactcaggtggga\ngagtcccttgaactgggaggaggaggtttgcagtgagccagaatcattcc\nactgtactccagcctaggtgacagagcaagactcatctcaaaaaaaaaaa\naaaaaaaaaaaaaagacaatccgcacacataaaggctttattcagctgat\ngtaccaaggtcactctctcagtcaaaggtgggaagcaaaaaaacagagta\naaggaaaaacagtgatagatgaaaagagtcaaaggcaagggaaacaaggg\naccttctatctcatctgtttccattcttttacagacctttcaaatccgga\ngcctacttgttaggactgatactgtctcccttctttctgctttgtgtcag\ngtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc\ntccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg\ncgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc\ntttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc\nacactctatcaataaacacctctggctga\n']
python, fetch sequence from DAS by coordinates
سؤال
ucsc DAS server, which get DNA sequences by coordinate.
URL: http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:30037432,30038060
sample file:
<DASDNA>
<SEQUENCE id="chr20" start="30037832" stop="30038060" version="1.00">
<DNA length="229">
gtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc
tccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg
cgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc
tttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc
acactctatcaataaacacctctggctga
</DNA>
</SEQUENCE>
</DASDNA>
what I want is this part:
gtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc
tccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg
cgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc
tttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc
acactctatcaataaacacctctggctga
I want to get the sequence part from thousands of this kind urls, how should i do it? I tried to write the data to file and parse the file, it worked ok, but is there any way to parse the xml-like string directly? i tried some example from other posts, but they didn't work.
Here, I added my solution. Thanks to the 2 answers below.
Solution 1:
def getSequence2(chromosome, start, end):
base = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
url = base + chromosome + ':' + str(start) + ',' + str(end)
doc = etree.parse(url,parser=etree.XMLParser())
if doc != '':
sequence = doc.xpath('SEQUENCE/DNA/text()')[0].replace('\n','')
else:
sequence = 'THE SEQUENCE DOES NOT EXIST FOR GIVEN COORDINATES'
return sequence
Solution 2:
def getSequence1(chromosome, start, end):
base = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
url = base + chromosome + ':' + str(start) + ',' + str(end)
xml = urllib2.urlopen(url).read()
if xml != '':
w = open('temp.xml', 'w')
w.write(xml)
w.close()
dom = parse('temp.xml')
data = dom.getElementsByTagName('DNA')
sequence = data[0].firstChild.nodeValue.replace('\n','')
else:
sequence = 'THE SEQUENCE DOES NOT EXIST FOR GIVEN COORDINATES'
return sequence
Of course they will need to import some necessary libraries.
المحلول
نصائح أخرى
Use a Python XML parsing library like lxml, load the XML file with that parser, and then use a selector (e.g. using XPath) to grab the node/element that you need.