dynamically extracting data from HTML page

https://stackoverflow.com/questions/13760909

05-12-2021
|

Question

I'm working on a script to extract some string/data from HTML document (Nagios status page, in this case) using this custom class:

## tagLister.py

from sgmllib import SGMLParser
class TAGLister(SGMLParser):

    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_td(self, attrs):
        CLS = [ v for k, v in attrs if k == 'class' ]
        if CLS:
            self.urls.extend(CLS)

Whenever a < td > tag is found, SGMLParser is called by start_td and look for the CLASS attribute.

>>> import urllib, tagLister
>>> usock = urllib.urlopen("http://www.someurl.com/test/test_page.html")
>>> parser = tagLister.TAGLister()
>>> parser.feed(usock.read())  
>>> for url in parser.urls: print url
>>> ...

The above lists all the values found in the <td> tag for the CLASS attributes. Is there any way to dynamically assign the td bit (in start_td) and class (as the value of k), so that using optparse, it can be assigned on the fly, like this:

tagLister.py -t td -k class

rather then coding it statically? I'm intended to [re]use this class for any tag (e.g. <a>, <div> etc.) and the associated attributes (e.g. href, id etc.) from the command-line. Any help would be greatly appreciated.

Solution

One option is to switch to lxml.html and use XPath - and the result of that will already be a list... (and since an XPath expression is just a string - it's easier to formulate than playing around with class inheritance)

>>> tag = 'a'
>>> attr = 'href'
>>> xpq = '//{}/@{}'.format(tag, attr)
>>> a = '<a href="test-or-something">hello</a><a>No href here</a><a href="something-else">blah</a>'
>>> import lxml.html
>>> lxml.html.fromstring(a).xpath(xpq)
['test-or-something', 'something-else']

if you have to use stdlib - then you could do something similar with HTMLParser

from HTMLParser import HTMLParser

class ListTags(HTMLParser):
    def __init__(self, tag, attr):
        HTMLParser.__init__(self)
        self.tag = tag
        self.attr = attr
        self.matches = []
    def handle_starttag(self, tag, attrs):
         if tag == self.tag:
            ad = dict(attrs)
            if self.attr in ad:
                self.matches.append(ad[self.attr])

>>> lt = ListTags('a', 'href')
>>> lt.feed(a)
>>> lt.matches
['test-or-something', 'something-else']

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow