Question

I am using ElementTree to parse a XML file. In some fields, there will be HTML data. For example, consider a declaration as follows:

<Course>
    <Description>Line 1<br />Line 2</Description>
</Course>

Now, supposing _course is an Element variable which hold this Couse element. I want to access this course's description, so I do:

desc = _course.find("Description").text;

But then desc only contains "Line 1". I read something about the .tail attribute, so I tried also:

desc = _course.find("Description").tail;

And I get the same output. What should I do to make desc be "Line 1
Line 2" (or literally anything between and )? In other words, I'm looking for something similar to the .innerText property in C# (and many other languages I guess).

Was it helpful?

Solution

Do you have any control over the creation of the xml file? The contents of xml tags which contain xml tags (or similar), or markup chars ('<', etc) should be encoded to avoid this problem. You can do this with either:

  • a CDATA section
  • Base64 or some other encoding (which doesn't include xml reserved characters)
  • Entity encoding ('<' == '&lt;')

If you can't make these changes, and ElementTree can't ignore tags not included in the xml schema, then you will have to pre-process the file. Of course, you're out of luck if the schema overlaps html.

OTHER TIPS

You are trying to read the tail attribute from the wrong element. Try

desc = _course.find("br").tail;

The tail attribute is used to store trailing text nodes when reading mixed-content XML files; text that follows directly after an element are stored in the tail attribute for that element:

    <tag><elem>this goes into elem's
    text attribute</elem>this goes into
    elem's tail attribute</tag>

Simple code snippet to print text and tail attributes from all elements in xml/xhtml.

import xml.etree.ElementTree as ET

def processElem(elem):
    if elem.text is not None:
        print elem.text
    for child in elem:
        processElem(child)
        if child.tail is not None:
            print child.tail

xml = '''<Course>
    <Description>Line 1<br />Line 2 <span>child text </span>child tail</Description>
    </Course>'''

root = ET.fromstring(xml)
processElem(root)

Output:

Line 1
Line 2 
child text 
child tail

See http://code.activestate.com/recipes/498286-elementtree-text-helper/ for a better solution. It can be modified to suit.

P.S. I changed my name from user839338 as quoted in the next post

Characters like "<" and "&" are illegal in XML elements.

"<" will generate an error because the parser interprets it as the start of a new element.

"&" will generate an error because the parser interprets it as the start of an character entity.

Some text, like JavaScript code, contains a lot of "<" or "&" characters. To avoid errors script code can be defined as CDATA.

Everything inside a CDATA section is ignored by the parser.

A CDATA section starts with "":

More information on: http://www.w3schools.com/xmL/xml_cdata.asp

Hope this helps!

Inspired by user839338's answer, I wen't and looked for a reasonable solution, which looks a bit like this.

>>> from xml.etree import ElementTree as etree
>>> corpus = '''<Course>
...     <Description>Line 1<br />Line 2</Description>
... </Course>'''
>>> 
>>> doc = etree.fromstring(corpus)
>>> desc = doc.find("Description")
>>> desc.tag = 'html'
>>> etree.tostring(desc)
'<html>Line 1<br/>Line 2</html>\n'
>>> 

There's no simple way to eliminate the surrounding tag (originally <Description>), but it's easily modified into something that could be used as needed, for instance <div> or <span>

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top