La conversione XML/HTML Enti nella Stringa Unicode in Python [duplica]

https://stackoverflow.com/questions/57708

09-06-2019
|

Domanda

A questa domanda ha già una risposta qui:

Decodificare le entità HTML, Python stringa? 5 risposte

Sto facendo un po ' di web scraping e siti di uso frequente in entità HTML per rappresentare i caratteri non ascii.Non Python disporre di un programma che prende una stringa con le entità HTML e restituisce un tipo di dati unicode?

Per esempio:

Torno:

&#x01ce;

che rappresenta un "ǎ" con un tono interrogativo.In binario, questo è rappresentato dal 16 bit 01ce.Voglio convertire il html entità in valore u'\u01ce'

Soluzione

Standard lib molto proprio HTMLParser ha una funzione non documentata unescape() che fa esattamente quello che tu pensi:

import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('&copy; 2010') # u'\xa9 2010'
h.unescape('&#169; 2010') # u'\xa9 2010'

Altri suggerimenti

Python ha il htmlentitydefs modulo, ma questo non include una funzione per unescape entità HTML.

Sviluppatore Python Fredrik Lundh (autore di elementtree, tra le altre cose) ha una funzione sul suo sito web, che funziona con decimale, esadecimale e le entità con nome:

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

Utilizzare il comando incorporato unichr -- Coherence non è necessario:

>>> entity = '&#x01ce'
>>> unichr(int(entity[3:],16))
u'\u01ce'

In alternativa, se si dispone di lxml:

>>> import lxml.html
>>> lxml.html.fromstring('&#x01ce').text
u'\u01ce'

Se siete su Python 3.4 o più recente, si può semplicemente utilizzare il html.unescape:

import html

s = html.unescape(s)

Si poteva trovare una risposta qui -- Ottenere internazionale di caratteri da una pagina web?

MODIFICA:Sembra BeautifulSoup non converti entità scritto in forma esadecimale.Si può essere risolto:

import copy, re
from BeautifulSoup import BeautifulSoup

hexentityMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
# replace hexadecimal character reference by decimal one
hexentityMassage += [(re.compile('&#x([^;]+);'), 
                     lambda m: '&#%d;' % int(m.group(1), 16))]

def convert(html):
    return BeautifulSoup(html,
        convertEntities=BeautifulSoup.HTML_ENTITIES,
        markupMassage=hexentityMassage).contents[0].string

html = '<html>&#x01ce;&#462;</html>'
print repr(convert(html))
# u'\u01ce\u01ce'

MODIFICA:

unescape() funzione di cui da @dF che utilizza htmlentitydefs modulo standard e unichr() potrebbe essere più appropriato in questo caso.

Questa è una funzione che dovrebbe aiutare a ottenere di destra e converti entità indietro di caratteri utf-8.

def unescape(text):
   """Removes HTML or XML character references 
      and entities from a text string.
   @param text The HTML (or XML) source text.
   @return The plain text, as a Unicode string, if necessary.
   from Fredrik Lundh
   2008-01-03: input only unicode characters string.
   http://effbot.org/zone/re-sub.htm#unescape-html
   """
   def fixup(m):
      text = m.group(0)
      if text[:2] == "&#":
         # character reference
         try:
            if text[:3] == "&#x":
               return unichr(int(text[3:-1], 16))
            else:
               return unichr(int(text[2:-1]))
         except ValueError:
            print "Value Error"
            pass
      else:
         # named entity
         # reescape the reserved characters.
         try:
            if text[1:-1] == "amp":
               text = "&amp;amp;"
            elif text[1:-1] == "gt":
               text = "&amp;gt;"
            elif text[1:-1] == "lt":
               text = "&amp;lt;"
            else:
               print text[1:-1]
               text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
         except KeyError:
            print "keyerror"
            pass
      return text # leave as is
   return re.sub("&#?\w+;", fixup, text)

Non so perché l'Overflow dello Stack del thread non include il ';' in cerca/sostituisci (es.lambda m:'&#%d*;*') Se non, Coherence può barf poiché l'adiacente carattere può essere interpretato come parte di codice HTML (es.'B 'Blackout).

Questo ha funzionato meglio per me:

import re
from BeautifulSoup import BeautifulSoup

html_string='<a href="/cgi-bin/article.cgi?f=/c/a/2010/12/13/BA3V1GQ1CI.DTL"title="">&#x27;Blackout in a can; on some shelves despite ban</a>'

hexentityMassage = [(re.compile('&#x([^;]+);'), 
lambda m: '&#%d;' % int(m.group(1), 16))]

soup = BeautifulSoup(html_string, 
convertEntities=BeautifulSoup.HTML_ENTITIES, 
markupMassage=hexentityMassage)

Int(m.gruppo(1), 16) converte il numero (specificato in base 16) formato da un numero intero.
m.gruppo(0) restituisce l'intero match, m.gruppo(1) restituisce la regexp acquisizione di gruppo
Fondamentalmente utilizzando markupMessage è la stessa:
html_string = re.sub('&#x([^;]+);', lambda m:'&#%d' % int(m.gruppo(1), 16), html_string)

Un'altra soluzione è builtin libreria xml.sax.saxutils (sia per html e xml).Tuttavia, esso permette di convertire solo>>, & e <.

from xml.sax.saxutils import unescape

escaped_text = unescape(text_to_escape)

Ecco la versione di Python 3 dF risposta:

import re
import html.entities

def unescape(text):
    """
    Removes HTML or XML character references and entities from a text string.

    :param text:    The HTML (or XML) source text.
    :return:        The plain text, as a Unicode string, if necessary.
    """
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return chr(int(text[3:-1], 16))
                else:
                    return chr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = chr(html.entities.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

Le principali modifiche riguardano htmlentitydefs che è ora html.entities e unichr che è ora chr.Vedere questo Python 3 porting di guida.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow