Question

I am using lxml.html to parse an html file and get the text from the page. Bur now I have a string which has a character ' for example Florian's due to which, while printing the output I get traceback

parent_link_id_text =  parent_link_id.xpath('./td[@width="400"]/text()')
print (SGS_Mid[0]+";"+"External"+";"+str(link_id_num[0])+";"+parent_link_id_text[0]+";"+parent_link_link[0], file = log_file_1)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-58: ordinal not in range(128)

Then I tried this

print (SGS_Mid[0]+";"+"PublicFreeUrl"+";"+str(link_id_num[0])+";"+unicode(parent_link_id_text[0],"utf-8")+";"+parent_link_link[0], file = log_file_1)

and I get a traceback:

TypeError: decoding Unicode is not supported

How can I solve this by printing the string with the unicode character?

Was it helpful?

Solution

Not sure if this is the solution to your problem, but perhaps it will guide you in the right direction.

Without seeing the code you have to actually get the data, I'm going to speculate and make a programmatic guess with how to solve your issue.

Please see the following code:

import lxml.html as lh
import urllib2

url = 'http://loremipsum.net/about.html'

doc = lh.parse(urllib2.urlopen(url))

value = doc.xpath('//p/strong/text()')[0]

print value

Printed result:

What is 'lorem ipsum'?

By reading the about page on the lorem ipsum site, you can see that the text returned indeed has the ' in it.

I hope this helps you in the right direction.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top