Python webscrape text inside of multiple tags

https://stackoverflow.com//questions/21030734

21-12-2019
|

Question

I am trying to return some values in a yahoo finance page. They are wrapped in tags. I was able to get it to return these values

543.46
546.8
None
None
595.73
0.65

I'm having problems with the None values that I got. I should be returning "537.51 x 100" and "537.60 x 100" The numbers do change because of the website. I just need that format to be the output. The particular html I am looking at from the source page is below. This code is inside more tags, but BeautifulSoup doesn't care about that.

<tr>
<th scope="row" width="48%">
    Prev Close:</th>
<td class="yfnc_tabledata1">
    543.46</td>
</tr>

<tr>
<th scope="row" width="48%">
    Open:</th>
<td class="yfnc_tabledata1">
    546.80</td>
</tr>

<tr>
<th scope="row" width="48%">
    Bid:</th>
<td class="yfnc_tabledata1">
    <span id="yfs_b00_aapl">
        536.55</span>
    <small> x 
        <span id="yfs_b60_aapl">
            100</span>
    </small>
</td>
</tr>

<tr><
th scope="row" width="48%">
    Ask:</th>
<td class="yfnc_tabledata1">
    <span id="yfs_a00_aapl">
        536.63</span>
    <small> x 
        <span id="yfs_a50_aapl">
            100</span>
    </small>
</td>
</tr>

<tr>
<th scope="row" width="48%">
    1y Target Est:</th>
<td class="yfnc_tabledata1">
    595.73</td>
</tr>

<tr>
<th scope="row" width="48%">
    Beta:</th>
<td class="yfnc_tabledata1">
    0.65</td>
</tr>

As you can see the third and and fourth values have some extra tags such as and inside the td tag so it is returning None which I don't want. My code is here:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://finance.yahoo.com/q?s=AAPL&q1=1")
soup = BeautifulSoup(html)


for data in soup.find_all('td', attrs = {'class': 'yfnc_tabledata1'} ) [0:6]:
        print (data.string) #I have .string so it wouldn't print the tags, only the text. I would rather have it return strings before it needs to print.

I'm thinking I need another for loop inside the first one that will account for the extra tags or maybe if statements. I'm not sure what the coding would look like.

Solution

Personally, I'd do a nooby way for this:

from urllib2 import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://finance.yahoo.com/q?s=AAPL&q1=1")
soup = BeautifulSoup(html)

for data in soup.find_all('td', class_="yfnc_tabledata1")[0:6]:
    if data.parent.name == "tr":
            print (data.text)

Outputs:

>>>
543.46
546.80
536.50 x 100
536.60 x 100
595.73
0.65
>>>

Works good enough :)

Note: I changed to urllib2 for the urlopen function.

You could also use either of the following:

for data in soup.find_all('td', class_="yfnc_tabledata1")[0:6]:
    print (data.text)

for data in soup.find_all('td', attrs={'class': 'yfnc_tabledata1'})[0:6]:
    print (data.text)

OTHER TIPS

The shortest answer is that in bs4 they have added .strings.

your code could look something like:

for data in soup.find_all('td', attrs={'class': 'yfnc_tabledata1'})[0:6]:
    print '--> ',(''.join(data.strings))

the "\n" characters are preserved so you can strip and recombine the strings to your liking.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow