Pergunta

I have an XML to parse which is proving really tricky for me.

<bundles>
  <bundle>
    <bitstreams>
      <bitstream>
        <id>1234</id>
      </bitstream>
    </bitstream>
    <name>FOO</name>
  </bundle>
  <bundle> ... </bundle>
</bundles>

I would like to iterate through this XML and locate all the id values inside of the bitstreams for a bundle where the name element's value is 'FOO'. I'm not interested in any bundles not named 'FOO', and there may be any number of bundles and any number of bitstreams in the bundles.

I have been using tree.findall('./bundle/name') to find the FOO bundle but this just returns a list that I can't step through for the id values:

for node in tree.findall('./bundle/name'):
if node.text == 'FOO':
 id_values = tree.findall('./bundle/bitstreams/bitstream/id')
 for value in id_values:
     print value.text

This prints out all the id values, not those of the bundle 'FOO'.

How can I iterate through this tree, locate the bundle with the name FOO, take this bundle node and collect the id values nested in it? Is the XPath argument incorrect here?

I'm working in Python, with lxml bindings - but any XML parser I believe would be alright; these aren't large XML trees.

Foi útil?

Solução

You can use xpath to achieve the purpose. Following Python code works perfect:

import libxml2
data = """
<bundles>
  <bundle>
    <bitstreams>
      <bitstream>
        <id>1234</id>
      </bitstream>
    </bitstreams>
    <name>FOO</name>
  </bundle>
</bundles>
"""
doc = xmllib2.parseDoc(data)
for node in doc.xpathEval('/bundles/bundle/name[.="FOO"]/../bitstreams/bitstream/id'):
    print node

or using lxml (data is the same as in the example above):

from lxml import etree

bundles = etree.fromstring(data)

for node in bundles.xpath('bundle/name[.="FOO"]/../bitstreams/bitstream/id'):
    print(node.text)

outputs:

1234

If the <bitstreams> element always precedes the <name> element, you can also use the more efficient xpath expression:

'bundle/name[.="FOO"]/preceding-sibling::bitstreams/bitstream/id'

Outras dicas

One of your questions was "Is the XPath argument incorrect here?". Well, findall() doesn't accept XPath expressions. It uses a simplified version called ElementPath. Also, your second call to findall() is not related in any way to the result of the first one, so it will just return ids of all bundles.

A slight modification to your code should also work (it's basically the same as the XPath expression):

for node in tree.findall('./bundle/name'):
    if node.text != 'FOO':
        continue
    id_values = node.getparent().findall('./bitstreams/bitstream/id')
    for value in id_values:
        print value.text
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top