Python - lxml library 'clean' method erasing only half of empty <li> node

https://stackoverflow.com/questions/16735516

30-05-2022
|

Frage

I'm using the lxml library in Python to clean html pages from potentially harmful code/parts I don't want. I noticed a strange behavior in the function: when given an empty <li> node, it removes the closing </li> tag but not the opening one.

For example,

from lxml.html.clean import Cleaner
text = '<ul><li></li><li>FooBar</li></ul>'
cleaner = Cleaner()
print cleaner.clean_html(text)

will output <ul><li><li>FooBar</li></ul>...

As far as I can tell this only happens when dealing with <li>tags. Is that a bug from the lxml library? Am I doing something wrong?

Any insight would be appreciated. Thanks !

Lösung

The closing tag for <li> in HTML is optional, so its not a bug, though it may not be the behavior you desire.

You could force a closing tag by printing it as XML:

from lxml.html.clean import Cleaner
import lxml.html as LH
text = '<ul><li></li><li>FooBar</li></ul>'
cleaner = Cleaner()
root = LH.fromstring(cleaner.clean_html(text, ))
print(LH.tostring(root, method='xml'))

yields

<ul><li/><li>FooBar</li></ul>

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow