Question

With the lxml.etree python framework, is it more efficient to parse xml directly from a link to an online xml file or is it better to say, use a different framework (such as urllib2), to return a string and then parse from that? Or does it make no difference at all?

Method 1 - Parse directly from link

from lxml import etree as ET

parsed = ET.parse(url_link)

Method 2 - Parse from string

from lxml import etree as ET
import urllib2

xml_string = urllib2.urlopen(url_link).read()
parsed = ET.parse.fromstring(xml_string)

# note: I do not have access to python 
# at the moment, so not sure whether 
# the .fromstring() function is correct

Or is there a more efficient method than either of these, e.g. save the xml to a .xml file on desktop then parse from those?

Était-ce utile?

La solution

I ran the two methods with a simple timing rapper.

Method 1 - Parse XML Directly From Link

from lxml import etree as ET

@timing
def parseXMLFromLink():
    parsed = ET.parse(url_link)
    print parsed.getroot()

for n in range(0,100):
    parseXMLFromLink()

Average of 100 = 98.4035 ms

Method 2 - Parse XML From String Returned By Urllib2

from lxml import etree as ET
import urllib2

@timing
def parseXMLFromString():
    xml_string = urllib2.urlopen(url_link).read()
    parsed = ET.fromstring(xml_string)
    print parsed

for n in range(0,100):
    parseXMLFromString()

Average of 100 = 286.9630 ms

So anecdotally it seems that using lxml to parse directly from the link is the more immediately quick method. It's not clear whether it would be faster to download then parse large xml documents from the hard drive, but presumably unless the document is huge and the parsing task more intensive, the parseXMLFromLink() function would still remain quicker as it is urllib2 that seems to slow the second function down.

I ran this a few times and the results stayed the same.

Autres conseils

If by 'effective' you mean 'efficient', I'm relatively certain you will see no difference between the two at all (unless ET.parse(link) is horribly implemented).

The reason is that the network time is going to be the most significant part of parsing an online XML file, a lot longer than storing the file to disk or keeping it in memory, and a lot longer than actually parsing it.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top