html5lib.sanitizer를 사용하여 잘못된 태그를 완전히 제거하세요.

https://stackoverflow.com/questions/6032457

14-11-2019
|

문제

html5lib.sanitizer를 사용하여 제안된 대로 사용자 입력을 정리하려고 합니다. 문서

문제는 잘못된 태그를 이스케이프 처리하는 것이 아니라 완전히 제거하고 싶다는 것입니다(어쨌든 나쁜 생각인 것 같습니다).

패치에서 제안된 해결 방법 여기 예상대로 작동하지 않습니다(내부 콘텐츠는 <tag>content</tag>).

구체적으로 다음과 같은 작업을 수행하고 싶습니다.

입력:

<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world</h1>
Lorem ipsum

산출:

<h1>Hello world</h1>
Lorem ipsum

그것을 달성하는 방법에 대한 아이디어가 있습니까?BeautifulSoup을 시도해 보았지만 잘 작동하지 않는 것 같고 lxml이 삽입됩니다. <p></p> 매우 이상한 위치에 있는 태그(예:src 속성 주변).지금까지 태그를 이스케이프하는 대신 제거하도록 할 수 있다면 html5lib가 목적에 가장 적합한 것 같습니다.

해결책

문제는 원치 않는 중첩 태그도 제거하는 것입니다.예쁘지는 않지만 올바른 방향으로 나아가는 단계입니다.

from lxml.html import fromstring
from lxml import etree

html = '''
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world<script>bad_thing();</script></h1>
Lorem ipsum
<script>bad_thing();</script>
<b>Bold Text</b>
'''

l = []
doc = fromstring(html)
for el in doc.xpath(".//h1|.//b"):
    i = etree.Element(el.tag)
    i.text, i.tail = el.text, el.tail
    l.append(etree.tostring(i))

print ''.join(l)

출력은 다음과 같습니다.

<h1>Hello world</h1>
Lorem ipsum
<b>Bold Text</b>

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow