Ignore unicode in xml with python and lxml?

https://stackoverflow.com//questions/9674910

12-12-2019
|

Question

I'm looking to either ignore the unicode within my xml. I'm willing to somehow change it as well in the processing of the output.

My python:

import urllib2, os, zipfile 
from lxml import etree

doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
target = doc.xpath('//references-cited/citation/nplcit/*/text()')
#target = '-'.join(target).replace('\n-','')
print "docID:    {0}\nCitation: {1}\n".format(docID,target) 
outFile.write(str(docID) +"|"+ str(target) +"\n")

Creates an output of:

docID:    US-D0607176-S1-20100105
Citation: [u"\u201cThe birth of Lee Min Ho's donuts.\u201d Feb. 25, 2009. Jazzholic. Apr. 22, 2009 <http://www

But if I try to add back in the '-'join(target).replace('\n-','') I get this error for both print and outFile.write:

Traceback (most recent call last):
  File "C:\Documents and Settings\mine\Desktop\test_lxml.py", line 77, in <module>
    print "docID:    {0}\nCitation: {1}\n".format(docID,target)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

How can I ignore the unicode so I can string out target with the outFile.write?

Solution

You are getting this error because you have a string with unicode-characters that you are trying to output using the ascii characterset. When printing the list, you are getting the 'repr' of the lists, and the strings inside it, avoiding the problem.

You need to either encode to a different characterset (UTF-8 for instance), or strip out or replace invalid characters when encoding.

I recommend reading Joels The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), followed by the relevant chapters on encoding and decoding strings in the Python docs.

Here's a small hint to get you started:

print "docID:    {0}\nCitation: {1}\n".format(docID.encode("UTF-8"),
                                              target.encode("UTF-8"))

OTHER TIPS

print "docID: {0}\nCitation: {1}\n".format(docID.encode("utf-8"), target.encode("utf-8"))

All of the characters that are not in ASCII character set will appear as hex escape sequences: for example the "\u201c" will appear as "\xe2\x80\x9c". If this is unacceptable then you can do:

docID = "".join([a if ord(a) < 128 else '.' for a in x])

which will replace all non-ASCII characters with a '.'.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow