Question

now i am writing a web crawler using python, but sometimes it throws HTMLParserError:

junk characters in start tag: u'\u201dTPL_password_1\u201d\r\n\t\t', at line 21285, column 6

it said the error was found at line 21285, does it mean that the error is found at line 21285 in the HTML source code? if not, how can i know what is the current HTML code that generates error? and what is the current parsing url?

my parse class can be simplified as follows:

class ParsePage(HTMLParser):

    """Parse the given page content using HTMLParser"""

    def __init__(self):
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):

        #Here i tried to add `try...expect` to inspect the current tag and attrs, but it seems python didnt enter the except at all, why? the error message said the error was found at start tag, why it didnt enter the except at all?

        try:
            Some codes doing with the start tag...
        except HTMLParser.HTMLParseError, e:
            print "e: ", e, '\n' 
            print 'tag: ', tag, '\n'
            print 'attrs: ', atts, '\n'
            exit(1) 

    def handle_endtag(self, tag):
        #Some codes doing with end tags...



geturl = ParsePage()

#Here i can catch the HTMLParseError if i add `try...except` in the following line, but i dont know how to get the useful information here when i catch the exception    
geturl.feed(cur_page)

thanks for any help.

Was it helpful?

Solution

how can i know what is the current HTML code that generates error?

junk characters in start tag: u'\u201dTPL_password_1\u201d\r\n\t\t', at line 21285, column 6

html line number 21285 in current HTML page

and what is the current parsing url?

what link do you parse?

geturl.feed(cur_page)

cur_page is your current page.

OTHER TIPS

Well, it told you the line the error was found in. What else do you need?

Also, what does the URL have to do with this? You pass your HTML page as a string to feed - HTMLParser has no idea where it came from.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top