HTML inside node using ElementTree

https://stackoverflow.com/questions/1088476

23-08-2019
|

Question

I am using ElementTree to parse a XML file. In some fields, there will be HTML data. For example, consider a declaration as follows:

<Course>
    <Description>Line 1<br />Line 2</Description>
</Course>

Now, supposing _course is an Element variable which hold this Couse element. I want to access this course's description, so I do:

desc = _course.find("Description").text;

But then desc only contains "Line 1". I read something about the .tail attribute, so I tried also:

desc = _course.find("Description").tail;

And I get the same output. What should I do to make desc be "Line 1
Line 2" (or literally anything between and )? In other words, I'm looking for something similar to the .innerText property in C# (and many other languages I guess).

Solution

Do you have any control over the creation of the xml file? The contents of xml tags which contain xml tags (or similar), or markup chars ('<', etc) should be encoded to avoid this problem. You can do this with either:

a CDATA section
Base64 or some other encoding (which doesn't include xml reserved characters)
Entity encoding ('<' == '<')

If you can't make these changes, and ElementTree can't ignore tags not included in the xml schema, then you will have to pre-process the file. Of course, you're out of luck if the schema overlaps html.

OTHER TIPS

You are trying to read the tail attribute from the wrong element. Try

desc = _course.find("br").tail;

The tail attribute is used to store trailing text nodes when reading mixed-content XML files; text that follows directly after an element are stored in the tail attribute for that element:

    <tag><elem>this goes into elem's
    text attribute</elem>this goes into
    elem's tail attribute</tag>

Simple code snippet to print text and tail attributes from all elements in xml/xhtml.

import xml.etree.ElementTree as ET

def processElem(elem):
    if elem.text is not None:
        print elem.text
    for child in elem:
        processElem(child)
        if child.tail is not None:
            print child.tail

xml = '''<Course>
    <Description>Line 1<br />Line 2 <span>child text </span>child tail</Description>
    </Course>'''

root = ET.fromstring(xml)
processElem(root)

Output:

Line 1
Line 2 
child text 
child tail

See http://code.activestate.com/recipes/498286-elementtree-text-helper/ for a better solution. It can be modified to suit.

P.S. I changed my name from user839338 as quoted in the next post

Characters like "<" and "&" are illegal in XML elements.

"<" will generate an error because the parser interprets it as the start of a new element.

"&" will generate an error because the parser interprets it as the start of an character entity.

Some text, like JavaScript code, contains a lot of "<" or "&" characters. To avoid errors script code can be defined as CDATA.

Everything inside a CDATA section is ignored by the parser.

A CDATA section starts with "":

More information on: http://www.w3schools.com/xmL/xml_cdata.asp

Hope this helps!

Inspired by user839338's answer, I wen't and looked for a reasonable solution, which looks a bit like this.

>>> from xml.etree import ElementTree as etree
>>> corpus = '''<Course>
...     <Description>Line 1<br />Line 2</Description>
... </Course>'''
>>> 
>>> doc = etree.fromstring(corpus)
>>> desc = doc.find("Description")
>>> desc.tag = 'html'
>>> etree.tostring(desc)
'<html>Line 1<br/>Line 2</html>\n'
>>>

There's no simple way to eliminate the surrounding tag (originally <Description>), but it's easily modified into something that could be used as needed, for instance <div> or <span>

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow