Parsing XML with namespace

https://stackoverflow.com/questions/20728856

20-09-2022
|

Question

With this XML

<?xml version="1.0" encoding="UTF-8"?>
<Envelope>
    <subject>Reference rates</subject>
    <Sender>
        <name>European Central Bank</name>
    </Sender>
    <Cube>
        <Cube time='2013-12-20'>
            <Cube currency='USD' rate='1.3655'/>
            <Cube currency='JPY' rate='142.66'/>
        </Cube>
    </Cube>
</Envelope>

I can get the inner Cube tags like this

from xml.etree.ElementTree import ElementTree

t = ElementTree()
t.parse('eurofxref-daily.xml')
day = t.find('Cube/Cube')
print 'Day:', day.attrib['time']
for currency in day:
    print currency.items()

Day: 2013-12-20
[('currency', 'USD'), ('rate', '1.3655')]
[('currency', 'JPY'), ('rate', '142.66')]

The problem is that the above XML is a cleaned version of the original file which has defined namespaces

<?xml version="1.0" encoding="UTF-8"?>
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
    <gesmes:subject>Reference rates</gesmes:subject>
    <gesmes:Sender>
        <gesmes:name>European Central Bank</gesmes:name>
    </gesmes:Sender>
    <Cube>
        <Cube time='2013-12-20'>
            <Cube currency='USD' rate='1.3655'/>
            <Cube currency='JPY' rate='142.66'/>
        </Cube>
    </Cube>
</gesmes:Envelope>

When I try to get the first Cube tag I get a None

t = ElementTree()
t.parse('eurofxref-daily.xml')
print t.find('Cube')

None

The root tag includes the namespace

root = t.getroot()
print 'root.tag:', root.tag

root.tag: {http://www.gesmes.org/xml/2002-08-01}Envelope

Its children also

for e in root.getchildren():
    print 'e.tag:', e.tag

e.tag: {http://www.gesmes.org/xml/2002-08-01}subject
e.tag: {http://www.gesmes.org/xml/2002-08-01}Sender
e.tag: {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube

I can get the Cube tags if I include the namespace in the tag

day = t.find('{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube/{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube')
print 'Day: ', day.attrib['time']

Day:  2013-12-20

But that is really ugly. Apart from cleaning the file before processing or doing string manipulation is there an elegant way to handle it?

Solution

There's a more elegant way than including the whole namespace URI in the text of the query. For a python version that does not support the namespaces argument on ElementTree.find, lxml provides the missing functionality and is "mostly compatible" with xml.etree:

from lxml.etree import ElementTree

t = ElementTree()
t.parse('eurofxref-daily.xml')
namespaces = { "exr": "http://www.ecb.int/vocabulary/2002-08-01/eurofxref" }
day = t.find('exr:Cube', namespaces)
print day

Using the namespaces object, you can set it once and for all and then just use prefixes in your queries.

Here is the output:

$ python test.py
<Element '{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube' at 0x7fe0f95e3290>

If you find prefixes inelegant, then you have to work on a file without namespaces. Or there may be other tools out there that will "cheat" and match on local-name() even if namespaces are in effect but I don't use them.

In python 2.7 or python 3.3, or higher, you could use the same code as above but use xml.etree instead of lxml because they've added support for namespaces to these versions.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow