Python libxml2: querying xml using xpath

https://stackoverflow.com/questions/21584519

07-10-2022
|

Question

I am trying to read an XML file from a command line argument. I am new to using libxml2 and XPath in general. I want to query using XPath.

XML:

<?xml version="1.0"?>                                                                                                                                     
<xmi:XMI xmlns:cas="http:///text/cas.ecore" xmlns:audioform="http:something" xmlns:xmi="http://blahblah" xmlns:lib="http://blahblah" xmlns:solr="http:blahblah" xmlns:tcas="http:///blah" xmi:version="2.0">                                                
  <cas:NULL xmi:id="0"/>                                                                                                                                     
  <cas:Sofa xmi:id="9" Num="1" ID="First" Type="text" String="play a song"/>    
  <cas:Sofa xmi:id="63" Num="2" ID="Second" Type="text" String="Find a contact"/>     
  <cas:Sofa xmi:id="72" Num="3" ID="Third" Type="text" String="Send a message"/>     
  <lib:Confidence xmi:id="1" sofa="9" begin="0" end="1" key="context" value="" confidence="1.0"/>                                                                          
</xmi:XMI>

Code:

def main(argv):
  try:
     xmlfile=argv[0]
     doc=libxml2.parseFile(xmlfile)
     root2=doc.children

     print root2  # This prints everything but <?xml version="1.0"?> 
     result= root2.xpathEval("//*")

     for node in result:
       print node
       print node.nodePath(), node.name, node.content

I want to go further and do some kind of processing using this file.

How do I get values like 63 using xpath ? from xmi:id="63".
Find String where xmi:id = "72". Result should be "Send a message"
Find string where xmi:id = 72 and ID= "Third". Result should be "Send a message"
I tried using node.Path(), node.name and node.content for this node:
```
<cas:Sofa xmi:id="9" Num="1" ID="First" Type="text" String="play a song"/>
```
The results are: /xmi:XMI/cas:Sofa[1] as nodePath(), Sofa as name and prints no content

How do I go about getting 1 and 2 and 3?

Solution

with respect to namespaces:

>>> from lxml import etree
>>> doc = etree.parse('in.html')
>>> names = {'cas':'http:///text/cas.ecore', 'xmi': 'http://blahblah'}
>>> doc.xpath('//cas:Sofa[@xmi:id="63"]', namespaces=names)
[<Element {http:///text/cas.ecore}Sofa at 0x10550a5f0>]
>>> doc.xpath('//cas:Sofa[@xmi:id="63"]/@String', namespaces=names)
['Find a contact']
>>> doc.xpath('//cas:Sofa[@xmi:id="72" and @ID="Third"]/@String', namespaces=names)
['Send a message']

OTHER TIPS

I'm not familiar with Python, but the following XPaths should do:

1.) //*/@xmi:id

2.) //*[@xmi:id='72']/@String

3.) //*[@xmi:id='72' and @ID='Third']/@String

Attributes are selected with @, conditions are created in brackets ([]).

Be aware that your XML uses namespaces. Instead of just selecting everything (//*), you should consider more specific XPaths (/xmi:XMI/cas:Sofa) and using a namespace manager.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow