Python parsing HTML Using Regular Expressions

Question 1

Use the right tool for the right job.

Let's make an analogy to explain why it's wrong: it's like trying to have a 5 year old understand Hamlet, whereas he does not have the vocabulary and grammar to understand Shakespeare's, that he will get when he'll be able to process more abstract concepts.

Use either lxml or BeautifulSoup to do that.

As an example: to get a list of all the evens and all the odds:

>>> from lxml import etree
>>> tree = etree.HTML(your_html_text)
>>> odds = tree.xpath('//td[@class="odd"]/text()')
>>> evens = tree.xpath('//td[@class="even"]/text()')
>>> odds
['001', 'Linear Algebra', 'Guang  Yang', '30']
>>> evens
['  4.00', 'University City', 'Lecture']

edit:

I am just trying to extract the contents in such a way where I don't get the section number AND max enroll number. I just need help with getting only the Max Enroll number.

ok, now I'm getting what you want, so here's the solution using lxml:

>>> for elt in tree.xpath('//tr'):
...     if elt.xpath('td[@class="tableHeader"]')[0].text == "Max Enroll":
...         elt.xpath('td[@class="odd"]|td[@class="even"]')[0].text
... 
'30'

There you have only the max enroll number.

Using BeautifulSoup it's a bit easier:

>>> bs = BeautifulSoup(your_html_text)
>>> for t in bs.findAll('td', attrs={'class': 'tableHeader'}):
...   if t.text == "Max Enroll":
...     print t.findNext('td').text
'30'

Question 2

Use the tool that is specialized on parsing html, like BeautifulSoup:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

For example, here's how you can get what you want:

from bs4 import BeautifulSoup

data = """your html here"""

soup = BeautifulSoup(data)
print soup.find('td', text="Max Enroll").find_next_sibling('td').text

Prints:

Question 3

An alternate to zmo's answer, using BeautifulSoup:

from bs4 import BeautifulSoup

data = """
<snipped html>
"""

soup = BeautifulSoup(data)

for tableHeaders in soup.find_all('td', class_="tableHeader"):
    if tableHeaders.get_text() == "Max Enroll":
        print tableHeaders.find_next_siblings('td', class_="odd")[0].get_text()

Output: