Domanda

So I've had great success extracting data as long as the what I'm matching doesn't cross more than 1 line, if it crosses more than 1 line I have heartburn (seemingly)... Here's a snippet of the HTML data I get:

<tr>
<td width=20%>3 month
<td width=1% class=bar>
&nbsp;
<td width=1% nowrap class="value chg">+10.03%
<td width=54% class=bar>
<table width=100% cellpadding=0 cellspacing=0 class=barChart>
<tr>

I am interested in the "+10.03%" number and

<td width=20%>3 month

is the pattern that lets me know that the "+10.03%" is what I want.

So I've got this so far in Python:

percent = re.search('<td width=20%>3 month\r\n<td width=1% class=bar>\r\n&nbsp;\r\n<td width=1% nowrap class="value chg">(.*?)', content)

where the variable content has all the html code I'm searching. This doesn't seem to work for me... any advice would be greatly appreciated! I've read a couple other posts that talk about re.compile() and re.multiline() but I haven't any luck with them mostly because I don't understand how they work I guess...

È stato utile?

Soluzione

Thank you all for your help! You pointed me in the right direction, here's how I got my code to work with BeautifulSoup. I noticed that all the data I wanted was under a class called "value chg" followed by and my data is always the 3rd and 5th element in that search, so this is what I did:

from BeautifulSoup import BeautifulSoup
import urllib

content = urllib.urlopen(url).read()
soup = BeautifulSoup(''.join(content))

td_list = soup.findAll('td', {'class':'value chg'} )

mon3 = td_list[2].text.encode('ascii','ignore')
yr1 = td_list[4].text.encode('ascii','ignore')

Again, "content" is the HTML that I've downloaded..

Altri suggerimenti

You need to add the "multiline" regex switch (?m). You can directly extract the target content using findall and taking the first element of the match via findall(regex, content)[0]:

percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0]

By using \s* to match newlines, the regex is compatible with both unix and windows style line terminators.


See a live demo of the following test code:

import re
content = '<tr>\n<td width=20%>3 month\n<td width=1% class=bar>\n&nbsp;\n<td width=1% nowrap class="value chg">+10.03%\n<td width=54% class=bar>\n<table width=100% cellpadding=0 cellspacing=0 class=barChart>\n<tr>'        
percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0]
print(percent)

Output:

+10.03%
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top