Getting exact symbol using HTMLParser

https://stackoverflow.com/questions/10826954

11-06-2021
|

質問

HTMLParser.unescape behaves like this:

>>> import HTMLParser
>>> h= HTMLParser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
u'alpha < \u03b2'

What should I do to get the exact beta symbol instead of \u03b2?

Thanks

解決

\u03b2 is "the exact beta symbol".

You must learn to distinguish between a thing and a representation of that thing.

Your string consists of lowercase letter a, lowercase letter l, lowercase letter p, lowercase letter h, lowercase letter a, space, left angle bracket, space, and beta.

The u'...' sequence is a representation of a string. It shows you one possible sequence of characters that you could type into a Python source file in order to express the concept of that string. u'foo' is the string foo. So is u'\x66\x6f\x6f'. So is u'\u0066\u006f\u006f'. When you ask Python to display the representation of any of those, it will display u'foo', because that's what Python considers to be the simplest representation of that string.

When you print u'\u0066\u006f\u006f', you will see foo, with no u prefix and no quotes - because now you are asking for a text representation, instead of a source code representation. You can do the same with the string you have in your program: print h.unescape('alpha < β'), and if your terminal is currently capable of displaying β, you should see alpha < β. If it doesn't, you'll typically get a UnicodeEncodeError, as Python attempts to send a byte representation of the string to your terminal (using some kind of string encoding to turn the characters into bytes), and the encoding isn't designed to handle β. For that problem, please see Python, Unicode, and the Windows console

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow