Is there a better way to give elements knowlege of their parents and xpath in xml.etree.ElementTree

https://stackoverflow.com/questions/20133316

03-08-2022
|

Question

I have the following code which works:

import xml.etree.ElementTree as etree


def get_path(self):
    parent =  ''
    path = self.tag
    sibs = self.parent.findall(self.tag)
    if len(sibs) > 1:
        path = path + '[%s]'%(sibs.index(self)+1)
    current_node = self
    while True:
        parent = current_node.parent
        if not parent:
            break
        ptag = parent.tag
        path = ptag + '/' + path
        current_node = parent
    return path

etree._Element.get_path = get_path
etree._Element.parent = None

class XmlDoc(object):
    def __init__(self):
        self.root = etree.Element('root')
        self.doc = etree.ElementTree(self.root)

    def SubElement(self, parent, tag):
        new_node = etree.SubElement(parent, tag)
        new_node.parent = parent
        return new_node

doc = XmlDoc()
a1 = doc.SubElement(doc.root, 'a')
a2 = doc.SubElement(doc.root, 'a')
b = doc.SubElement(a2, 'b')
print etree.tostring(doc.root), '\n'
print 'element:'.ljust(15), a1
print 'path:'.ljust(15), a1.get_path()
print 'parent:'.ljust(15), a1.parent, '\n'
print 'element:'.ljust(15), a2
print 'path:'.ljust(15), a2.get_path()
print 'parent:'.ljust(15), a2.parent, '\n'
print 'element:'.ljust(15), b
print 'path:'.ljust(15), b.get_path()
print 'parent:'.ljust(15), b.parent

Which results in this output:

<root><a /><a><b /></a></root> 

element:        <Element a at 87e3d6c>
path:           root/a[1]
parent:         <Element root at 87e3cec> 

element:        <Element a at 87e3fac>
path:           root/a[2]
parent:         <Element root at 87e3cec> 

element:        <Element b at 87e758c>
path:           root/a/b
parent:         <Element a at 87e3fac>

Now this is drastically changed from the original code, but I'm not allowed to share that.

The functions aren't too inefficient but there is a dramatic performance decrease when switching from cElementTree to ElementTree which I expected, but from my experiments it seems like monkey patching cElementTree is impossible so I had to switch.

What I need to know is whether there is either a way to add a method to cElementTree or if there is a more efficient way of doing this so I can gain some of my performance back.

Just to let you know I am thinking of as a last resort implementing selected static typing and to compile with cython, but for certain reasons I really don't want to do that.

Thanks for taking a look.

EDIT: Sorry for the wrong use of the term late binding. Sometimes my vocabulary leaves something to be desired. What I meant was "monkey patching."

EDIT: @Corley Brigman, Guy: Thank you very much for your answers which do address the question, however (and I should have stated this in the original post) I had completed this project before using lxml which is a wonderful library that made coding a breeze but due to new requirements (This needs to be implemented as an addon to a product called Splunk) which ties me to the python 2.7 interpreter shipped with Splunk and eliminates the possibility of adding third party libraries with the exception of django.

Solution

If you need parents, use lxml instead - it tracks parents internally, and is still C behind the scenes so it's very fast.

However... be aware that there is a tradeoff in tracking parents, in that a given node can only have a single parent. This isn't usually a problem, however, if you do something like the following, you will get different results in cElementTree vs. lxml:

p = Element('x')
q = Element('y')
r = SubElement(p, 'z')
q.append(r)

cElementTree:

dump(p)
<x><z /></x>
dump(q)
<y><z /></y>

lxml:

dump(p)
<x/>
dump(q)
<y>
  <z/>
</y>

Since parents are tracked, a node can only have one parent, obviously. As you can see, the element r is copied to both trees in cElementTree, and reparented/moved in lxml.

There are probably only a small number of use cases where this matters, but something to keep in mind.

OTHER TIPS

you can just use xpath, for example:

import lxml.html

def get_path():
    for e in doc.xpath("//b//*"):
        print e

should work, didn't test it though...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow