Question

I have 2 two strings, and I want to compare it together.

  1. "Hỗ trợ ngôn ngữ" I think this is iso-8859-1 encoding
  2. u'H\u1ed7 tr\u1ee3 ng\xf4n ng\u1eef' unicode.

2 strings have same content. I want to compare it. How can I convert the first string to same encoding with the second string.?

Was it helpful?

Solution

You have HTML entities, simply use the HTMLParser module to unescape those:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape("Hỗ trợ ngôn ngữ")
u'H\u1ed7 tr\u1ee3 ng\xf4n ng\u1eef'
>>> print h.unescape("Hỗ trợ ngôn ngữ")
Hỗ trợ ngôn ngữ

These HTML entities use decimal numbers, not hexadecimal. 7895 is 1ed7 in hexadecimal, etc. They encode unicode codepoints, no UTF-8 or ISO-8859-1 used. ISO-8859-1, or Latin-1, is not even capable of encoding these specific codepoints (Vietnamese for 'Language Support', according to Google Translate).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top