Rimuovere completamente un tag cattivo con HTML5lib.Sanitizzatore

https://stackoverflow.com/questions/6032457

14-11-2019
|

Domanda

Sto cercando di usare html5lib.sanitizzatore per pulire l'ingresso utente come suggerito in Documenti

Il problema è che voglio rimuovere completamente tag cattivi e non solo sfuggirli (che sembra comunque una cattiva idea).

La soluzione alternativa suggerita nella patch qui Il lavoro come previsto (mantiene il contenuto interiore di un <tag>content</tag>).

In particolare, voglio fare qualcosa del genere:

Ingresso:

<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world</h1>
Lorem ipsum

Uscita:

<h1>Hello world</h1>
Lorem ipsum

Qualche idea su come raggiungerlo?Ho provato BeachSoup, ma non sembra funzionare bene, e LXML inserisce i tag <p></p> in luoghi molto strani (ad esempio Attrazioni su SRC).Finora, HTML5Lib sembra essere la cosa migliore per lo scopo, se potessi solo ottenerlo per rimuovere i tag invece di sfuggirli.

Soluzione

The challenge is to also strip unwanted nested tags. It isn't pretty but it's a step in the right direction:

from lxml.html import fromstring
from lxml import etree

html = '''
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world<script>bad_thing();</script></h1>
Lorem ipsum
<script>bad_thing();</script>
<b>Bold Text</b>
'''

l = []
doc = fromstring(html)
for el in doc.xpath(".//h1|.//b"):
    i = etree.Element(el.tag)
    i.text, i.tail = el.text, el.tail
    l.append(etree.tostring(i))

print ''.join(l)

Which outputs:

<h1>Hello world</h1>
Lorem ipsum
<b>Bold Text</b>

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow