使用html5lib.sanitizer完全删除一个坏标记

https://stackoverflow.com/questions/6032457

14-11-2019
|

题

我正在尝试使用html5lib.sanitizer来清除文档

问题是我想完全删除不良标签，而不仅仅是逃脱它们（无论如何看起来都是一个坏主意）。

修补程序建议的解决方法在这里't按预期工作（它保留了世代odicetagcode的内容）。

具体来说，我想做这样的事情：

输入：

<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world</h1>
Lorem ipsum

输出：

<h1>Hello world</h1>
Lorem ipsum

关于如何实现它的任何想法？我已经尝试过漂亮的群组，但它似乎没有很好工作，LXML在非常奇怪的地方插入世纪古代古代代码标签（例如，SRC Artrs周围）。到目前为止，HTML5LIB似乎是最佳的目的，如果我能让它删除标签而不是逃避它们。

解决方案

The challenge is to also strip unwanted nested tags. It isn't pretty but it's a step in the right direction:

from lxml.html import fromstring
from lxml import etree

html = '''
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world<script>bad_thing();</script></h1>
Lorem ipsum
<script>bad_thing();</script>
<b>Bold Text</b>
'''

l = []
doc = fromstring(html)
for el in doc.xpath(".//h1|.//b"):
    i = etree.Element(el.tag)
    i.text, i.tail = el.text, el.tail
    l.append(etree.tostring(i))

print ''.join(l)

Which outputs:

<h1>Hello world</h1>
Lorem ipsum
<b>Bold Text</b>

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow