Question

I have the following input XML file,i read the rel_notes tag and print it...running into the following error

Input XML:

<rel_notes>
    •   Please move to this build for all further test and development activities 
    •   Please use this as base build to verify compilation and sanity before any check-in happens

</rel_notes>

Sample python code:

file = open('data.xml,'r')
from xml.etree import cElementTree as etree
tree = etree.parse(file)
print('\n'.join(elem.text for elem in tree.iter('rel_notes')))

OUTPUT

   print('\n'.join(elem.text for elem in tree.iter('rel_notes')))
 File "C:\python2.7.3\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2022' in position 9: character maps to <undefined>
Was it helpful?

Solution

The issue is with printing Unicode to Windows console. Namely, the character '•' can't be represented in cp437 used by your console.

To reproduce the problem, try:

print u'\u2022'

You could set PYTHONIOENCODING environment variable to instruct python to replace all unrepresentable characters with corresponding xml char references:

T:\> set PYTHONIOENCODING=cp437:xmlcharrefreplace
T:\> python your_script.py

Or encode the text to bytes before printing:

print u'\u2022'.encode('cp437', 'xmlcharrefreplace')

answer to your initial question

To print text of each <build_location/> element:

import sys
from xml.etree import cElementTree as etree

input_file = sys.stdin # filename or file object
tree = etree.parse(input_file)
print('\n'.join(elem.text for elem in tree.iter('build_location')))

If input file is large; iterparse() could be used:

import sys
from xml.etree import cElementTree as etree

input_file = sys.stdin
context = iter(etree.iterparse(input_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
    if event == 'end' and elem.tag == 'build_location':
       print(elem.text)
       root.clear() # free memory

OTHER TIPS

I don't think the entire snippet above is completely helpful. But, UnicodeEncodeError usually happens when the ASCII characters aren't handled properly.

Example:

unicode_str = html.decode(<source encoding>)

encoded_str = unicode_str.encode("utf8")

Its already explained clearly in this answer: Python: Convert Unicode to ASCII without errors

This should at least solve the UnicodeEncodeError.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top