iso-8859-1 and utf8 in python

https://stackoverflow.com/questions/16146771

11-04-2022
|

Question

I have 2 two strings, and I want to compare it together.

"Hỗ trợ ngôn ngữ" I think this is iso-8859-1 encoding
u'H\u1ed7 tr\u1ee3 ng\xf4n ng\u1eef' unicode.

2 strings have same content. I want to compare it. How can I convert the first string to same encoding with the second string.?

Solution

You have HTML entities, simply use the HTMLParser module to unescape those:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape("H&#7895; tr&#7907; ng&#244;n ng&#7919;")
u'H\u1ed7 tr\u1ee3 ng\xf4n ng\u1eef'
>>> print h.unescape("H&#7895; tr&#7907; ng&#244;n ng&#7919;")
Hỗ trợ ngôn ngữ

These HTML entities use decimal numbers, not hexadecimal. 7895 is 1ed7 in hexadecimal, etc. They encode unicode codepoints, no UTF-8 or ISO-8859-1 used. ISO-8859-1, or Latin-1, is not even capable of encoding these specific codepoints (Vietnamese for 'Language Support', according to Google Translate).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow