How can I know the parent of an element when using iterparse methode of cElementTree?

StackOverflow https://stackoverflow.com/questions/9334881

سؤال

I want to loop trough the elements of an xml file and yield every element, unless the parent is a feature.

So in pseudocode

    for event, element in cElementTree.iterparse('../test.xml'):
        if parentOf_element != 'feature':
        yield element

How can I get the parent of the element? I know it's possible with the tree.getiterator() function, but I don't want to build the full tree because the xml files are a few gigs big.

هل كانت مفيدة؟

المحلول

If you enable start events, you can track ancestor nodes by using a stack. If you really mean to suppress all descendants of a <feature>, instead of just children, you can use a simple flag as demonstrated in another answer.

You can use root.clear() to blow away all finished-with elements. Read this.

Code:

import xml.etree.cElementTree as et
# Produces identical answers with import lxml.etree as et
import cStringIO

def normtext(t):
    return repr("" if t is None else t.strip())

def dump(el):
    print el.tag, normtext(el.text), normtext(el.tail), el.attrib

def my_filtered_elements(source, skip_parent_tag="feature"):
    # get an iterable
    context = et.iterparse(source, events=("start", "end"))
    # turn it into an iterator
    context = iter(context)
    # get the root element
    event, root = context.next()
    tag_stack = [None, root.tag]
    for event, elem in context:
        # print event, elem.tag, tag_stack
        if event == "start":
            tag_stack.append(elem.tag)
        else:
            assert event == "end"
            my_tag = tag_stack.pop()
            assert my_tag == elem.tag
            parent_tag = tag_stack[-1]
            if parent_tag is not None and parent_tag != skip_parent_tag:
                dump(elem)
                # yield elem
            root.clear()

def other_filtered_elements(source, skip_parent_tag="feature"):            
    in_feature_tag = False
    for event, element in et.iterparse(source, events=('start', 'end')):
        if element.tag == skip_parent_tag:
            in_feature_tag = event == 'start'
        if event == 'end' and not in_feature_tag:
            dump(element)            

test_input = """
<top>
    <lev1 guff="1111">
        <lev2>aaaaa</lev2>
        <lev2>bbbbb</lev2>
    </lev1>
    <feature>
        feat text 1
        <fchild>fcfcfcfc
            <fgchild>ggggg</fgchild>    
        </fchild>
        feat text 2
    </feature>
    <lev1 guff="2222">
        <lev2>ccccc</lev2>c-tail
        <lev2>ddddd</lev2>d-tail
        <notext1></notext1>e-tail
        <notext2 />f-tail
     </lev1>g-tail
</top>
"""

print "=== me ==="
my_filtered_elements(cStringIO.StringIO(test_input))
print "=== other ==="
other_filtered_elements(cStringIO.StringIO(test_input))

Output is below. You'll notice from the lev1 nodes that root.clear() doesn't blow away elements that haven't been fully parsed yet. This means that the amount of memory used is O(depth of tree), not O(total number of elements in the tree)

=== me ===
lev2 'aaaaa' '' {}
lev2 'bbbbb' '' {}
lev1 '' '' {'guff': '1111'}
fgchild 'ggggg' '' {}          <<<=== do you want this?
feature 'feat text 1' '' {}
lev2 'ccccc' 'c-tail' {}
lev2 'ddddd' 'd-tail' {}
notext1 '' 'e-tail' {}
notext2 '' 'f-tail' {}
lev1 '' 'g-tail' {'guff': '2222'}
=== other ===
lev2 'aaaaa' '' {}
lev2 'bbbbb' '' {}
lev1 '' '' {'guff': '1111'}
feature 'feat text 1' '' {}
lev2 'ccccc' 'c-tail' {}
lev2 'ddddd' 'd-tail' {}
notext1 '' 'e-tail' {}
notext2 '' 'f-tail' {}
lev1 '' 'g-tail' {'guff': '2222'}
top '' '' {}                           <<<=== do you want this?

نصائح أخرى

You can do this with lxml. It has getparent().

Alternatively, it's possible to handle start and end events and skip feature children with cElementTree:

from xml.etree import cElementTree as etree

in_feature_tag = False
for event, element in etree.iterparse('test.xml', events=('start', 'end')):
    if element.tag == 'feauture':
        in_feature_tag = event == 'start'
    if event == 'end' and not in_feature_tag:
        yield element
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top