lxml's iterparse tries to load the whole file into memory

Question 1

My bad, as explained in my comment. lxml loads up the file into memory until it finds an item corresponding to the given tag.

If the tag is not found (for instance because lxml prepends the namespace to it), it just loads up indefinitely the file into memory, hence the issue.

So the fix is to provide a correct tag! I found the proper one using a regular parser on a subset of my file.

Question 2

I answered a very similar question here: lxml and fast_iter eating all the memory The main reason is because lxml.etree still keeps in memory all the elements that are not catched explicitly. Thus you need to clear manually.

What I did was not to filter events for the tag you are looking for:

context = etree.iterparse(open(filename,'r'),events=('end',))

And instead manually parse and clear the rest:

for (event,elem) in progress.bar(context):
    if elem.tag == 'text':
        # do things here

    # every element gets cleared here
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
del context

Question 3

From my experience it can help quite a bit to call the garbage collector regularly.

Something like this could do the trick:

import sys
from lxml import etree

def fast_iter(context, func):
    for i, (event, elem) in enumerate(context):
        # Garbage collect after every 100 elements
        if i % 100 == 0:
            gc.collect()

        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def launchArticleProcessing(elem):
    print elem

context = etree.iterparse(sys.argv[1], events=('end',), tag='text')

fast_iter(context, launchArticleProcessing)