Convierta entidades XML/HTML en cadenas Unicode en Python [duplicado]

https://stackoverflow.com/questions/57708

09-06-2019
|

Pregunta

Esta pregunta ya tiene respuesta aquí:

¿Decodificar entidades HTML en una cadena de Python? 5 respuestas

Estoy haciendo algo de web scraping y los sitios utilizan con frecuencia entidades HTML para representar caracteres que no son ASCII.¿Tiene Python una utilidad que toma una cadena con entidades HTML y devuelve un tipo Unicode?

Por ejemplo:

Vuelvo:

&#x01ce;

que representa una "ǎ" con una marca de tono.En binario, esto se representa como 01ce de 16 bits.Quiero convertir la entidad html en el valor. u'\u01ce'

Solución

El HTMLParser de la biblioteca estándar tiene una función no documentada unescape() que hace exactamente lo que crees que hace:

import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('&copy; 2010') # u'\xa9 2010'
h.unescape('&#169; 2010') # u'\xa9 2010'

Otros consejos

Python tiene la htmlentidaddefs módulo, pero esto no incluye una función para eliminar el escape de entidades HTML.

El desarrollador de Python Fredrik Lundh (autor de elementtree, entre otras cosas) tiene dicha función en su sitio web, que funciona con entidades decimales, hexadecimales y con nombre:

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

Utilice el incorporado unichr -- BeautifulSoup no es necesario:

>>> entity = '&#x01ce'
>>> unichr(int(entity[3:],16))
u'\u01ce'

Una alternativa, si tienes lxml:

>>> import lxml.html
>>> lxml.html.fromstring('&#x01ce').text
u'\u01ce'

Si tiene Python 3.4 o posterior, simplemente puede usar el html.unescape:

import html

s = html.unescape(s)

Podrías encontrar una respuesta aquí. ¿Obtener caracteres internacionales de una página web?

EDITAR:Parece que BeautifulSoup no convierte entidades escritas en formato hexadecimal.Se puede arreglar:

import copy, re
from BeautifulSoup import BeautifulSoup

hexentityMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
# replace hexadecimal character reference by decimal one
hexentityMassage += [(re.compile('&#x([^;]+);'), 
                     lambda m: '&#%d;' % int(m.group(1), 16))]

def convert(html):
    return BeautifulSoup(html,
        convertEntities=BeautifulSoup.HTML_ENTITIES,
        markupMassage=hexentityMassage).contents[0].string

html = '<html>&#x01ce;&#462;</html>'
print repr(convert(html))
# u'\u01ce\u01ce'

EDITAR:

unescape() función mencionada por @dF que utiliza htmlentitydefs módulo estándar y unichr() podría ser más apropiado en este caso.

Esta es una función que debería ayudarle a hacerlo bien y convertir entidades nuevamente a caracteres utf-8.

def unescape(text):
   """Removes HTML or XML character references 
      and entities from a text string.
   @param text The HTML (or XML) source text.
   @return The plain text, as a Unicode string, if necessary.
   from Fredrik Lundh
   2008-01-03: input only unicode characters string.
   http://effbot.org/zone/re-sub.htm#unescape-html
   """
   def fixup(m):
      text = m.group(0)
      if text[:2] == "&#":
         # character reference
         try:
            if text[:3] == "&#x":
               return unichr(int(text[3:-1], 16))
            else:
               return unichr(int(text[2:-1]))
         except ValueError:
            print "Value Error"
            pass
      else:
         # named entity
         # reescape the reserved characters.
         try:
            if text[1:-1] == "amp":
               text = "&amp;amp;"
            elif text[1:-1] == "gt":
               text = "&amp;gt;"
            elif text[1:-1] == "lt":
               text = "&amp;lt;"
            else:
               print text[1:-1]
               text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
         except KeyError:
            print "keyerror"
            pass
      return text # leave as is
   return re.sub("&#?\w+;", fixup, text)

No estoy seguro de por qué el hilo de desbordamiento de pila no incluye el ';' En la búsqueda/reemplazo (es decir,lambda m:'&#%d*;*') Si no lo hace, BeautifulSoup puede vomitar porque el carácter adyacente puede interpretarse como parte del código HTML (es decir,&#39B para &#39Apagón).

Esto funcionó mejor para mí:

import re
from BeautifulSoup import BeautifulSoup

html_string='<a href="/cgi-bin/article.cgi?f=/c/a/2010/12/13/BA3V1GQ1CI.DTL"title="">&#x27;Blackout in a can; on some shelves despite ban</a>'

hexentityMassage = [(re.compile('&#x([^;]+);'), 
lambda m: '&#%d;' % int(m.group(1), 16))]

soup = BeautifulSoup(html_string, 
convertEntities=BeautifulSoup.HTML_ENTITIES, 
markupMassage=hexentityMassage)

El int(m.group(1), 16) convierte el formato numérico (especificado en base 16) nuevamente a un número entero.
m.group(0) devuelve la coincidencia completa, m.group(1) devuelve el grupo de captura de expresiones regulares
Básicamente, usar markupMessage es lo mismo que:
html_string = re.sub('&#x([^;]+);', lambda m:'&#%d;' % int (m.group (1), 16), html_string)

Otra solución es la biblioteca incorporada xml.sax.saxutils (tanto para html como para xml).Sin embargo, solo convertirá &gt, &amp y &lt.

from xml.sax.saxutils import unescape

escaped_text = unescape(text_to_escape)

Aquí está la versión Python 3 de la respuesta de df:

import re
import html.entities

def unescape(text):
    """
    Removes HTML or XML character references and entities from a text string.

    :param text:    The HTML (or XML) source text.
    :return:        The plain text, as a Unicode string, if necessary.
    """
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return chr(int(text[3:-1], 16))
                else:
                    return chr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = chr(html.entities.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

Los principales cambios se refieren htmlentitydefs eso es ahora html.entities y unichr eso es ahora chr.Mira esto Guía de portabilidad de Python 3.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow