html.parser odd behavior
-
27-10-2019 - |
Question
Using Python 3.2, I attempted the example straight from the html.parser
documentation:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser(strict=False)
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
Instead of getting the result shown on the documentation i get:
Encountered some data : <html>
Encountered some data : <head>
Encountered some data : <title>
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered some data : <body>
Encountered some data : <h1>
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
For some reason, it treats some tags as data BUT only if strict=False
. If strict=True
i get the correct result:
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
Solution
This was a bug that has been fixed (http://bugs.python.org/issue13273). actually when you look at http://hg.python.org/cpython/log/9ce5d456138b/Lib/html/parser.py, there is a whole lot of log messages about problems with Strict=False
; it almost feels like this should still be considered beta.
If you take the most recent version of the file (http://hg.python.org/cpython/raw-file/9ce5d456138b/Lib/html/parser.py) and use that, at least the example from the documentation works again. Still, personally I would be a bit weary for trusting Strict=False to work in "critical applications" at the moment.