Question

I have this issue trying to get all the text nodes in an HTML document using lxml but I get an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128). However, when I try to find out the type of encoding of this page (encoding = chardet.detect(response)['encoding']), it says it's utf-8. It seems weird that a single page has utf-8 and ascii. Actually, this:

fromstring(response).text_content().encode('ascii', 'replace')

solves the problem.

Here it's my code:

from lxml.html import fromstring
import urllib2
import chardet
request = urllib2.Request(my_url)
request.add_header('User-Agent',
                   'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')   
request.add_header("Accept-Language", "en-us")
response = urllib2.urlopen(request).read()

print encoding
print fromstring(response).text_content()

Output:

utf-8
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)

What can I do to solve this issue?. Keep in mind that I want to do this with a few other pages, so I don't want to encode on an individual basis.

UPDATE:

Maybe there is something else going on here. When I run this script on the terminal, I get a correct output but when a run it inside SublimeText, I get UnicodeEncodeError... ¿?

UPDATE2:

It's also happening when I create a file with this output. .encode('ascii', 'replace') is working but I'd like to have a more general solution.

Regards

Was it helpful?

Solution

Can you try wrapping your string with repr()? This article might help.

print repr(fromstring(response).text_content())

OTHER TIPS

As far as writing out to a file as said in your edit, I would recommend opening the file with the codecs module:

import codecs
output_file = codecs.open('filename.txt','w','utf8')

I don't know SublimeText, but it seems to be trying to read your output as ASCII, hence the encoding error.

Based on your first update I would say that the terminal told Python to output utf-8 and SublimeText made clear it expects ascii. So I think the solution will be in finding the right settings in SublimeText.

However, if you cannot change what SublimeText expects it is better to use the encode function like you already did in a separate function.

def smartprint( text ) :
    if sys.stdout.encoding == None :
        print text
    else :
        print text.encode( sys.stdout.encoding , 'replace' )

You can use this function instead of print. Keep in mind that your program's output when run in SublimeText differs from Terminal. Because of the replace accented characters will loose their accents when this code is run in SublimeText, e.g. é will be shown as e.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top