how to extract specified text in HTML using SGMLParser
Вопрос
i create a class extended SGMLParser:
class URLLister(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
def start_title(self, attrs):
pass
def handle_data(self, data):
print data
very very simple code. IMO start_title
was invoked when it came across <title>
tag, and handle_data
was invoked when it came across normal text. now i want to extract the text between <title>
and </title>
, e.g.
<html><head><title>Webpage title</title></head><body>Simple text</body></html>
i want to print the Webpage title
between <title>
tag, but using handle_data
tag i will output all the simple text including Webpage title
and Simple text
. how to simply output the text between <title>
tag?
Решение
Really, you could just add a hard-coded check in handle_data
like so:
def handle_data(self, data):
tag = self.get_starttag_text().replace("<","").replace(">","")
tag_words = tag.split(" ")
if len(tag_words) > 0 and tag_words[0].endswith("title"):
print data
I'm not sure if this is what you wanted exactly, and I'm sure there's a more elegant answer.
Не связан с StackOverflow