Come estrarre il testo specificato in HTML usando SGMLPARSER

https://stackoverflow.com/questions/9450571

13-11-2019
|

Domanda

Creo una classe estesa sgmlparser:

class URLLister(SGMLParser):

    def __init__(self):
        SGMLParser.__init__(self)

    def start_title(self, attrs):
        pass

    def handle_data(self, data):
        print data

Codice molto molto semplice. Imo start_title è stato invocato quando si è imbattuto in <title> tag, e handle_data è stato invocato quando si è imbattuto nel testo normale. ora voglio estrarre il testo tra <title> e </title>, per esempio

<html><head><title>Webpage title</title></head><body>Simple text</body></html>

Voglio stampare il Webpage title fra <title> tagg, ma usando handle_data tag userò tutto il testo semplice incluso Webpage title e Simple text. come produrre semplicemente il testo tra <title> etichetta?

Soluzione

Davvero, potresti semplicemente aggiungere un check-in codificato handle_data così:

def handle_data(self, data):
    tag = self.get_starttag_text().replace("<","").replace(">","")
    tag_words = tag.split(" ")
    if len(tag_words) > 0 and tag_words[0].endswith("title"):
        print data

Non sono sicuro che questo sia quello che volevi esattamente, e sono sicuro che ci sia una risposta più elegante.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow