Question

I am trying to parse a very huge XML file so I decided to use lxml.iterparse as explained here.

So my code is looking like this:

import sys
from lxml import etree

def fast_iter(context, func):
    for event, elem in context:
        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def launchArticleProcessing(elem):
    print elem

context = etree.iterparse(sys.argv[1], events=('end',), tag='text')

fast_iter(context, launchArticleProcessing)

And I call it this way: python lxmlwtf.py "/path/to/my/file.xml"

The memory just fills up (until I kill the process because the file would never fit in it) and nothing get printed. What am I missing here?

Was it helpful?

Solution 2

My bad, as explained in my comment. lxml loads up the file into memory until it finds an item corresponding to the given tag.

If the tag is not found (for instance because lxml prepends the namespace to it), it just loads up indefinitely the file into memory, hence the issue.

So the fix is to provide a correct tag! I found the proper one using a regular parser on a subset of my file.

OTHER TIPS

I answered a very similar question here: lxml and fast_iter eating all the memory The main reason is because lxml.etree still keeps in memory all the elements that are not catched explicitly. Thus you need to clear manually.

What I did was not to filter events for the tag you are looking for:

context = etree.iterparse(open(filename,'r'),events=('end',))

And instead manually parse and clear the rest:

for (event,elem) in progress.bar(context):
    if elem.tag == 'text':
        # do things here

    # every element gets cleared here
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
del context

From my experience it can help quite a bit to call the garbage collector regularly.

Something like this could do the trick:

import sys
from lxml import etree

def fast_iter(context, func):
    for i, (event, elem) in enumerate(context):
        # Garbage collect after every 100 elements
        if i % 100 == 0:
            gc.collect()

        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def launchArticleProcessing(elem):
    print elem

context = etree.iterparse(sys.argv[1], events=('end',), tag='text')

fast_iter(context, launchArticleProcessing)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top