Extract Multiline Data from a HTML site using Python

Question 1

Thank you all for your help! You pointed me in the right direction, here's how I got my code to work with BeautifulSoup. I noticed that all the data I wanted was under a class called "value chg" followed by and my data is always the 3rd and 5th element in that search, so this is what I did:

from BeautifulSoup import BeautifulSoup
import urllib

content = urllib.urlopen(url).read()
soup = BeautifulSoup(''.join(content))

td_list = soup.findAll('td', {'class':'value chg'} )

mon3 = td_list[2].text.encode('ascii','ignore')
yr1 = td_list[4].text.encode('ascii','ignore')

Again, "content" is the HTML that I've downloaded..

Question 2

You need to add the "multiline" regex switch (?m). You can directly extract the target content using findall and taking the first element of the match via findall(regex, content)[0]:

percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0]

By using \s* to match newlines, the regex is compatible with both unix and windows style line terminators.

See a live demo of the following test code:

import re
content = '<tr>\n<td width=20%>3 month\n<td width=1% class=bar>\n&nbsp;\n<td width=1% nowrap class="value chg">+10.03%\n<td width=54% class=bar>\n<table width=100% cellpadding=0 cellspacing=0 class=barChart>\n<tr>'        
percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0]
print(percent)

Output:

+10.03%