Python: Get data between html tags if an attribute matches and put it into a list

https://stackoverflow.com/questions/17242183

01-06-2022
|

質問

this is part of a html file.

......
......
<tr>
<td color="white" style="color:black; bgcolor="#ffff00">Adam</th>
<td color="white" style="color:white; bgcolor="#ff9900">450231</th>
<td color="white" style="color:black; bgcolor="#cc0000">658902</th>
</tr>
.......
.......
<tr>
<td color="white" style="color:black; bgcolor="#ffff00">John</th>
<td color="white" style="color:white; bgcolor="#ff9900">8734658</th>
<td color="white" style="color:black; bgcolor="#cc0000">90865</th>
</tr>
.......
.......

if bgcolor="#ff9900", I need to extract 450231 and 8734658 and put them into a list

So far, I've done this..

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.recording = 0
        self.data = []

    def handle_starttag(self, tag, attrs):
        if tag == 'td':
            for name, value in attrs:
                if name == 'bgcolor' and value == '#993399':
                    self.recording = 1 

    def handle_endtag(self,tag):
        if tag == 'th':
            self.recording -= 1

    def handle_data(self, data):
        if self.recording:
            self.data.append(data)
.
.
.
        y = urllib2.urlopen(x)   # x gets the html file
        html = y.read()
        parser = MyHTMLParser()
        parser.feed(html)
        print parser.data
        parser.close()

parser.data contains ['\n', 'Adam', '\n', '450231', '\n', '658902\n', '\n', '\n', '\n'....] when it should contain only ['450231', '8734658'] I'm not sure where I'm going wrong.

解決

Your recording flag seems to be always on, except when initialized. You might need to reset it to zero whenever appropriate. Because the flag is set for all tags, you will always get data appended in the list. This is mainly because of your 'th' tag which is not present in the HTML. Correct the HTML first.

EDIT: Just read that the HTML is not in your control. I am not sure if the check in the 'th' end tag would succeed. Try printing the tag in the endtag. If it is not a th, then control will never reach there. How about using regular expressions to match it. If beautiful soup cannot parse it, you might need to resort to regex.


pattern = '<td.*?bgcolor="#ff9900".*?>(.*?)</th>'
re.findall(pattern, html)

should give you the result.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow