Domanda

I'm trying to do incremental parsing via iterparse() of an XML doc that is (designed to be) too large to fit in memory. I find doing even a no-op pass over the doc exhausts process memory and causes my system to start swapping.

Is it wrong to expect xml.etree.ElementTree.iterparse() to run in constant memory independent of size of XML doc? If so, what is the recommended package for doing incremental parsing of arbitrarily long XML docs? And if not, WTF is wrong with my code?

Here is the code: Note that I request 'start' events only (so parser doesn't try to buffer up all the body elements before returning the end tag of the document root element (<osm> in my case). and I explicitly del() the loop variables to force them to be freed.

Thinking that the garbage collector might not be getting a chance to run since the loop doesn't yield, I added explicit call to gc.collect() and time.sleep() every million iterations. But it doesn't help.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.etree.ElementTree as ET
import pprint

import gc
import time
import os
import psutil

def gcStats(myProc):
    # return human readable gc.stats for 3 generations

    extmem = myProc.memory_info_ex()

    a =  "extmem: rss {:12n}, vms {:12n}, shared{:12n}, text{:12n}, lib {:12n}, data{:12n}, dirty{:12n}".format(
                    extmem.rss, extmem.vms, extmem.shared, extmem.text, extmem.lib, extmem.data, extmem.dirty)

    return a + "\tgc enabled {}, sumCount {:n}, lenGarbage {:n}".format( gc.isenabled(), sum(gc.get_count()), len(gc.garbage)) 

# the misbehaving code:    
def count_tags(filename):
    retVal = {}
    iterCount = 0
    sleepTime = 2.0
    myProc = psutil.Process()

    print("Starting: gc.isenabled() == {}\n{}".format(gc.isenabled(), gcStats( myProc)))

    for event, element in ET.iterparse(filename, ('start',)):
        assert event == 'start'
        if iterCount % 1000000 == 0:
            print('{} iterations, sleeping {} sec...'.format(iterCount, sleepTime))
            time.sleep( sleepTime)
            print('{}\nNow starting gc pass...'.format( gcStats( myProc)))
            gcr = gc.collect()
            print('gc returned {}'.format( gcr))
        iterCount += 1
        del element
        del event
    return retVal


if __name__ == "__main__":
    tags = count_tags('/home/bobhy/MOOC_Data/' + 'chicago.osm')

Here is a sample of the doc. It is well-formed OSM data.

<?xml version='1.0' encoding='UTF-8'?>
<osm version="0.6" generator="Osmosis 0.43.1">
  <bounds minlon="-88.50500" minlat="41.33900" maxlon="-87.06600" maxlat="42.29700" origin="http://www.openstreetmap.org/api/0.6"/>
  <node id="219850" version="54" timestamp="2011-04-06T05:17:15Z" uid="207745" user="NE2" changeset="7781188" lat="41.7585879" lon="-87.9101245">
    <tag k="exit_to" v="Joliet Road"/>
    <tag k="highway" v="motorway_junction"/>
    <tag k="ref" v="276C"/>
  </node>
  <node id="219851" version="47" timestamp="2011-04-06T05:18:47Z" uid="207745" user="NE2" changeset="7781188" lat="41.7593116" lon="-87.9076432">
    <tag k="exit_to" v="North I-294 ; Tri-State Tollway;  Wisconsin"/>
    <tag k="highway" v="motorway_junction"/>
    <tag k="ref" v="277A"/>
  </node>
  <node id="219871" version="1" timestamp="2006-04-15T00:34:03Z" uid="229" user="LA2" changeset="3725" lat="41.932278" lon="-87.9179332"/>
  <node id="700724" version="14" timestamp="2009-04-13T11:21:51Z" uid="18480" user="nickvet419" changeset="485405" lat="41.7120272" lon="-88.0158606"/>

. . . and so on for 1.8 GB . . .

  <relation id="3366425" version="1" timestamp="2013-12-07T21:37:35Z" uid="239998" user="Sundance" changeset="19330301">
    <member type="way" ref="250651738" role="outer"/>
    <member type="way" ref="250651748" role="inner"/>
    <tag k="type" v="multipolygon"/>
  </relation>
  <relation id="3378994" version="1" timestamp="2013-12-14T22:24:26Z" uid="371121" user="AndrewSnow" changeset="19456337">
    <member type="way" ref="251850076" role="outer"/>
    <member type="way" ref="251850073" role="inner"/>
    <member type="way" ref="251850074" role="inner"/>
    <member type="way" ref="251850075" role="inner"/>
    <tag k="type" v="multipolygon"/>
  </relation>
  <relation id="3382796" version="1" timestamp="2013-12-17T03:21:18Z" uid="567034" user="Umbugbene" changeset="19492258">
    <member type="way" ref="252225400" role="outer"/>
    <member type="way" ref="252225404" role="inner"/>
    <tag k="type" v="multipolygon"/>
  </relation>
</osm>

And here is the output:

Starting: gc.isenabled() == True
extmem: rss      9097216, vms     37199872, shared     3145728, text     3301376, lib            0, data     5820416, dirty           0 gc enabled True, sumCount 410, lenGarbage 0
0 iterations, sleeping 2.0 sec...
extmem: rss      9097216, vms     37335040, shared     3145728, text     3301376, lib            0, data     5955584, dirty           0 gc enabled True, sumCount 87, lenGarbage 0
Now starting gc pass...
gc returned 0
1000000 iterations, sleeping 2.0 sec...
extmem: rss   1234309120, vms   1262891008, shared     3280896, text     3301376, lib            0, data  1231511552, dirty           0 gc enabled True, sumCount 372, lenGarbage 0
Now starting gc pass...
gc returned 0
2000000 iterations, sleeping 2.0 sec...
extmem: rss   2495262720, vms   2524073984, shared     3280896, text     3301376, lib            0, data  2492694528, dirty           0 gc enabled True, sumCount 37, lenGarbage 0
Now starting gc pass...
gc returned 0
3000000 iterations, sleeping 2.0 sec...
extmem: rss   3781947392, vms   3812208640, shared     3280896, text     3301376, lib            0, data  3780829184, dirty           0 gc enabled True, sumCount 262, lenGarbage 0
Now starting gc pass...
gc returned 0
4000000 iterations, sleeping 2.0 sec...
extmem: rss   5067837440, vms   5096787968, shared     3280896, text     3301376, lib            0, data  5065408512, dirty           0 gc enabled True, sumCount 241, lenGarbage 0
Now starting gc pass...
gc returned 0
5000000 iterations, sleeping 2.0 sec...
extmem: rss   6345998336, vms   6375632896, shared     3063808, text     3301376, lib            0, data  6344253440, dirty           0 gc enabled True, sumCount 333, lenGarbage 0
Now starting gc pass...
gc returned 0
6000000 iterations, sleeping 2.0 sec...
extmem: rss   7266795520, vms   7665147904, shared     1060864, text     3301376, lib            0, data  7633768448, dirty           0 gc enabled True, sumCount 877, lenGarbage 0
Now starting gc pass...

I interpret the output to show process virtual memory is growing by about 1 000 B / iteration (that is, per XML tag parsed). I think the garbage collection stats do not show a monotonic increase in allocated objects, so I don't know where the memory growth is coming from. Garbage collection is indeed enabled.

È stato utile?

Soluzione 2

Close reading of the doc for iterparse() convinces me that the above is intended behavior. The doc says that it returns a full-fledged Element with no restrictions on child access, so it must be retaining the (incrementally growing) document tree in memory.

Since my problem did not need parent or child element access, just an event for each tag encountered, I was able to solve my problem nicely with xml.etree.ElementTree.XMLParser().

Altri suggerimenti

You need to explicitly clear the elements that you no longer need by calling the method element.clear(), otherwise it will still linger in memory. This means that you probably want to listen also to the 'end' events and to call clear() once you reach the end of the encapsulating element where you know that you no longer need any of it's content.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top