Question

I am trying to read through an html doc using python and gather all of the table rows into a single list. (I am aware of specialized tools for this purpose, but I must use regex.) Here is my code so far:

import urllib
import re
URL = 'http://www.xpn.org/events/concert-calendar'
sock = urllib.urlopen( URL )
doc = sock.read()
sock.close()
patString = r'''
    < tr(. * ?)>
    (.*?)
    < /tr>
    '''
pattern = re.compile(patString, re.VERBOSE)
concerts = re.findall(pattern, doc)
print (concerts)

However, the print is only printing an empty list. I have tried a few different patterns but all have produced the same result. I'm pretty sure that the issue is the pattern, but I'm not entirely sure (as I am trying to become accommodated with python while writing this.) the table rows I am trying to find have the format <tr class="odd/even"> other data </tr> and I would like to capture all of this data and place it into a list for use later in the script.

Any help is appreciated. Thanks

Était-ce utile?

La solution

This matches your sample data just fine. If the data runs on multiple lines, turn on the option for . to match \n. That option is re.DOTALL by the way.

<tr(.*?)>(.*?)</tr>

The ? qualification for the data in the middle is pretty important, otherwise it could match entire <tr></tr> blocks as the data part.

It is easy because you are not parsing HTML, but instead just trying to extract some tags in a very specific case.

Things will get ugly if you have a <tr> in a <tr> for example.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top