Remove contents of <style>…</style> tags using html5lib or bleach

https://stackoverflow.com/questions/7538600

27-01-2021
|

Pergunta

I've been using the excellent bleach library for removing bad HTML.

I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like:

<STYLE> st1:*{behavior:url(#ieooui) } </STYLE>

Using bleach (with the style tag implicitly disallowed), leaves me with:

st1:*{behavior:url(#ieooui) }

Which isn't helpful. Bleach seems only to have options to:

Escape tags;
Remove the tags (but not their contents).

I'm looking for a third option - remove the tags and their contents.

Is there any way to use bleach or html5lib to completely remove the style tag and its contents? The documentation for html5lib isn't really a great deal of help.

Solução

It turned out lxml was a better tool for this task:

from lxml.html.clean import Cleaner

def clean_word_text(text):
    # The only thing I need Cleaner for is to clear out the contents of
    # <style>...</style> tags
    cleaner = Cleaner(style=True)
    return cleaner.clean_html(text)

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow