Question

In answering another question, someone showed me the following tutorial, in which the author claims to have used iterparse to parse a ~100 MB XML file in under 3 seconds:

http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/

I am trying to parse an ~90 MB XML file, and I have the following code:

from xml.etree.cElementTree import *
count = 0

for event, elem in iterparse('foo.xml'):        
    if elem.tag == 'identifier' and elem.text == 'bar':
        count += 1
    elem.clear() # discard the element

print count

It is taking about thirty seconds... not even the same order of magnitude as reported in the tutorial I read using a similarly sized file, a similar algorithm, and the same package.

Could someone please inform me what might be wrong with my code, or what differences I might not be noticing between my situation and the tutorial?

I am using Python 2.7.3.

Addendum:

I am also using a reasonably powerful machine, in case anyone thinks that might be it.

Was it helpful?

Solution

As TJD mentioned, comparing XMLs in size only may not be very informative. However, I happen to have files of the same structure but different size:

With a 79M file:

$ python -m timeit -n 1 -c 'from xml.etree.cElementTree import iterparse
count = 0
for event, elem in iterparse("..../QT20060217_S_18mix23-2500_01.mzML"):
    if elem.tag.endswith("spectrum"): count += 1
    elem.clear()
print count'
6126
6126
6126
1 loops, best of 3: 950 msec per loop

With a 3.8G file the timeit output is:

1 loops, best of 3: 22.3 sec per loop

Also, compare with lxml: changing xml.etree.cElementTree in the first line to lxml.etree I get:

for the first file: 730 msec per loop

for the second file: 11.4 sec per loop

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top