Using regex in python for html tags

https://stackoverflow.com/questions/23569872

19-07-2023
|

Question

I am trying to read through an html doc using python and gather all of the table rows into a single list. (I am aware of specialized tools for this purpose, but I must use regex.) Here is my code so far:

import urllib
import re
URL = 'http://www.xpn.org/events/concert-calendar'
sock = urllib.urlopen( URL )
doc = sock.read()
sock.close()
patString = r'''
    < tr(. * ?)>
    (.*?)
    < /tr>
    '''
pattern = re.compile(patString, re.VERBOSE)
concerts = re.findall(pattern, doc)
print (concerts)

However, the print is only printing an empty list. I have tried a few different patterns but all have produced the same result. I'm pretty sure that the issue is the pattern, but I'm not entirely sure (as I am trying to become accommodated with python while writing this.) the table rows I am trying to find have the format <tr class="odd/even"> other data </tr> and I would like to capture all of this data and place it into a list for use later in the script.

Any help is appreciated. Thanks

La solution

This matches your sample data just fine. If the data runs on multiple lines, turn on the option for . to match \n. That option is re.DOTALL by the way.

<tr(.*?)>(.*?)</tr>

The ? qualification for the data in the middle is pretty important, otherwise it could match entire <tr></tr> blocks as the data part.

It is easy because you are not parsing HTML, but instead just trying to extract some tags in a very specific case.

Things will get ugly if you have a <tr> in a <tr> for example.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow