Question

i create a class extended SGMLParser:

class URLLister(SGMLParser):

    def __init__(self):
        SGMLParser.__init__(self)

    def start_title(self, attrs):
        pass

    def handle_data(self, data):
        print data

very very simple code. IMO start_title was invoked when it came across <title> tag, and handle_data was invoked when it came across normal text. now i want to extract the text between <title> and </title>, e.g.

<html><head><title>Webpage title</title></head><body>Simple text</body></html>

i want to print the Webpage title between <title> tag, but using handle_data tag i will output all the simple text including Webpage title and Simple text. how to simply output the text between <title> tag?

Was it helpful?

Solution

Really, you could just add a hard-coded check in handle_data like so:

def handle_data(self, data):
    tag = self.get_starttag_text().replace("<","").replace(">","")
    tag_words = tag.split(" ")
    if len(tag_words) > 0 and tag_words[0].endswith("title"):
        print data

I'm not sure if this is what you wanted exactly, and I'm sure there's a more elegant answer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top