Question

Using Python 3.2, I attempted the example straight from the html.parser documentation:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)
    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser(strict=False)
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

Instead of getting the result shown on the documentation i get:

Encountered some data  : <html>
Encountered some data  : <head>
Encountered some data  : <title>
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered some data  : <body>
Encountered some data  : <h1>
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

For some reason, it treats some tags as data BUT only if strict=False. If strict=True i get the correct result:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
Was it helpful?

Solution

This was a bug that has been fixed (http://bugs.python.org/issue13273). actually when you look at http://hg.python.org/cpython/log/9ce5d456138b/Lib/html/parser.py, there is a whole lot of log messages about problems with Strict=False; it almost feels like this should still be considered beta.

If you take the most recent version of the file (http://hg.python.org/cpython/raw-file/9ce5d456138b/Lib/html/parser.py) and use that, at least the example from the documentation works again. Still, personally I would be a bit weary for trusting Strict=False to work in "critical applications" at the moment.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top