How to parse html in Japanese with lxml and produce readable output?

https://stackoverflow.com/questions/22208812

09-06-2023
|

Question

I have been using Beautiful Soup and in BS4 it does a wonderful job by simply doing

soup = BeautifulSoup(response.read(), from_encoding='Shift_JIS')

it will print out the whole html nicely in Japanese if I try to print in my terminal and output into files.

<p>PR検索</p>

I have tried an equivalent approach in lxml by reading other people's question

tree = etree.HTML(res.txt, parser=etree.HTMLParser(encoding='shift-jis'))

however,it displays everything in unicode.

<p>PR&#26908;&#32034;</p>

I tried the following as well, but the result is the same

tree = etree.HTML(what.text, parser=etree.HTMLParser(encoding='utf-8'))

I have used an alternative to use lxml.html to parse the page first and then send to BS4 to get the encoded result that I want, however I would still like to know how to get the correct output without BS4, any help appreciated!

Solution

Here's what is working for me:

# -*- coding: utf-8 -*-
from lxml import etree
from lxml.html import fromstring, HTMLParser


data = """<p>PR検索</p>"""

tree = fromstring(data, parser=HTMLParser(encoding='shift-jis'))
print etree.tostring(tree, encoding='shift-jis', method="html")

prints:

<p>PR検索</p>

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow