DISABLE ADBLOCK

ADBlock is blocking some content on the site

ADBlock errore

Showing progress of python's XML parser when loading a huge file

StackOverflow https://stackoverflow.com/questions/1001871

Question

Im using Python's built in XML parser to load a 1.5 gig XML file and it takes all day.

from xml.dom import minidom
xmldoc = minidom.parse('events.xml')

I need to know how to get inside that and measure its progress so I can show a progress bar. any ideas?

minidom has another method called parseString() that returns a DOM tree assuming the string you pass it is valid XML, If I were to split up the file myself into chunks and pass them to parseString one at a time, could I possibly merge all the DOM trees back together at the end?

Solution

you usecase requires that you use sax parser instead of dom, dom loads everything in memory , sax instead will do line by line parsing and you write handlers for events as you need so could be effective and you would be able to write progress indicator also

I also recommend trying expat parser sometime it is very useful http://docs.python.org/library/pyexpat.html

for progress using sax:

as sax reads file incrementally you can wrap the file object you pass with your own and keep track how much have been read.

edit: I also don't like idea of splitting file yourselves and joining DOM at end, that way you are better writing your own xml parser, i recommend instead using sax parser I also wonder what your purpose of reading 1.5 gig file in DOM tree? look like sax would be better here

OTHER TIPS

Did you consider to use other means of parsing XML? Building a tree of such big XML files will always be slow and memory intensive. If you don't need the whole tree in memory, stream based parsing will be much faster. It can be a little daunting if you're used to tree based XML manipulation, but it will pay of in form of a huge speed increase (minutes instead of hours).

http://docs.python.org/library/xml.sax.html

I have something very similar for PyGTK, not PyQt, using the pulldom api. It gets called a little bit at a time using Gtk idle events (so the GUI doesn't lock up) and Python generators (to save the parsing state).

def idle_handler (fn):
  fh = open (fn)  # file handle
  doc = xml.dom.pulldom.parse (fh)
  fsize = os.stat (fn)[stat.ST_SIZE]
  position = 0

  for event, node in doc:
    if position != fh.tell ():
      position = fh.tell ()
      # update status: position * 100 / fsize

    if event == ....

    yield True   # idle handler stays until False is returned

 yield False

def main:
  add_idle_handler (idle_handler, filename)

Merging the tree at the end would be pretty easy. You could just create a new DOM, and basically append the individual trees to it one by one. This would give you pretty finely tuned control over the progress of the parsing too. You could even parallelize it if you wanted by spawning different processes to parse each section. You just have to make sure you split it intelligently (not splitting in the middle of a tag, etc.).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow