I have 2 two strings, and I want to compare it together.

  1. "Hỗ trợ ngôn ngữ" I think this is iso-8859-1 encoding
  2. u'H\u1ed7 tr\u1ee3 ng\xf4n ng\u1eef' unicode.

2 strings have same content. I want to compare it. How can I convert the first string to same encoding with the second string.?

有帮助吗?

解决方案

You have HTML entities, simply use the HTMLParser module to unescape those:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape("Hỗ trợ ngôn ngữ")
u'H\u1ed7 tr\u1ee3 ng\xf4n ng\u1eef'
>>> print h.unescape("Hỗ trợ ngôn ngữ")
Hỗ trợ ngôn ngữ

These HTML entities use decimal numbers, not hexadecimal. 7895 is 1ed7 in hexadecimal, etc. They encode unicode codepoints, no UTF-8 or ISO-8859-1 used. ISO-8859-1, or Latin-1, is not even capable of encoding these specific codepoints (Vietnamese for 'Language Support', according to Google Translate).

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top