HTML Entity Codes to Text [duplicate]

https://stackoverflow.com/questions/663058

20-08-2019
|

Question

This question already has an answer here:

Decode HTML entities in Python string? 5 answers

Does anyone know an easy way in Python to convert a string with HTML entity codes (e.g. < &) to a normal string (e.g. < &)?

cgi.escape() will escape strings (poorly), but there is no unescape().

Solution

HTMLParser has the functionality in the standard library. It is, unfortunately, undocumented:

(Python2 Docs)

>>> import HTMLParser
>>> h= HTMLParser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
u'alpha < \u03b2'

(Python 3 Docs)

>>> import html.parser
>>> h = html.parser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
'alpha < \u03b2'

htmlentitydefs is documented, but requires you to do a lot of the work yourself.

If you only need the XML predefined entities (lt, gt, amp, quot, apos), you could use minidom to parse them. If you only need the predefined entities and no numeric character references, you could even just use a plain old string replace for speed.

OTHER TIPS

I forgot to tag it at first, but I'm using BeautifulSoup.

Digging around in the documentation, I found:

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

does it exactly as I was hoping.

There is nothing built into the Python stdlib to unescape HTML, but there's a short script you can tailor to your needs at http://www.w3.org/QA/2008/04/unescape-html-entities-python.html.

Use htmlentitydefs module. This my old code, it worked, but I'm sure there is cleaner and more pythonic way to do it:

e2c = dict(('&%s;'%k,eval("u'\\u%04x'"%v)) for k, v in htmlentitydefs.name2codepoint.items())

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow