Question

I am new to using BeautifulSoup and am try to use it to grab some test data from NHL.com. Here is my code so far but I am pretty lost...

Here is a snippet of the HTML code I want to extract data from:

<tr>
    <td rowspan="1" colspan="1"> … </td>
    <td style="text-align: left;" rowspan="1" colspan="1">
        <a href="/ice/player.htm?id=8474564">

            Steven Stamkos

        </a>
    </td>
    <td style="text-align: center;" rowspan="1" colspan="1">
        <a href="javascript:void(0);" rel="TBL" onclick="loadTeamSpotlight(jQuery(this));" style="border-bottom:1px dotted;">

            TBL

        </a>
    </td>
    <td style="text-align: center;" rowspan="1" colspan="1">

        C

    </td>
    <td style="center" rowspan="1" colspan="1">

        16

    </td>
    <td style="center" rowspan="1" colspan="1">

        14

    </td>
    <td style="center" rowspan="1" colspan="1">

        9

    </td>

I would like to extract data from these fields for the entire page, so there are about 30 different table rows. Here is my Python code so far, I'm not really sure where to go.

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.nhl.com/ice/playerstats.htm?fetchKey=20142ALLSASAll&viewName=summary&sort=points&pg=1")

data = r.text
t_data=[]
soup = BeautifulSoup(data)
table = soup.find('table', {'class': 'data stats'})

I know it isn't much but I have no idea how to go about this. Thanks for the help everyone

EDIT: I solved the problem, and hopefully this will help anyone in the future. Here is my code:

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.nhl.com/ice/playerstats.htm?fetchKey=20142ALLSASAll&viewName=summary&sort=points&pg=1")

player=[]
team=[]
goals=[]
assists=[]
cells=[]
points=[]
i=0
data = r.text
soup = BeautifulSoup(data)
table = soup.find('table', {'class': 'data stats'})
row=[]
for rows in table.find_all('tr'):
    cells=rows.find_all('td')
    if(len(cells)==19):
        player.append(cells[1].find(text=True))
        team.append(cells[2].find(text=True))
        goals.append(cells[5].find(text=True))
        assists.append(cells[6].find(text=True))
        points.append(cells[7].find(text=True))
        print(player[i],team[i],goals[i],assists[i],points[i])
        i=i+1
Was it helpful?

Solution

I just wanted to post an other approach, so you don't have to use 6 different lists to store connected data. Additionally there is a shorter and more elegant way of getting all intended rows.

# getting data
#...
from bs4 import BeautifulSoup
from collections import namedtuple
soup = BeautifulSoup(data)
# thats where the data are collected
rows = list()
# named tuple to store the relevant data of one player
Player = namedtuple('Player', ['name', 'team', 'goals', 'assists', 'points'])
# getting every row of the tbody in the specified table
for tr in soup.select('table.data.stats tbody tr'):
    # put text-contents of the row in a list
    cellStrings = [cell.find(text = True) for cell in tr.findAll('td')]
    # add it to the
    rows.append(
        Player(
            name=cellStrings[1],
            team=cellStrings[2],
            goals=cellStrings[5],
            assists=cellStrings[6],
            points=cellStrings[7]
        )
    )

rows looks like that

[Player(name=u'Steven Stamkos', team=u'TBL', goals=u'14', assists=u'9', points=u'23'),
 Player(name=u'Sidney Crosby', team=u'PIT', goals=u'8', assists=u'15', points=u'23'),
 Player(name=u'Ryan Getzlaf', team=u'ANA', goals=u'10', assists=u'12', points=u'22'),
 Player(name=u'Alexander Steen', team=u'STL', goals=u'14', assists=u'7', points=u'21'),
 Player(name=u'Corey Perry', team=u'ANA', goals=u'11', assists=u'10', points=u'21'),
 Player(name=u'Alex Ovechkin', team=u'WSH', goals=u'13', assists=u'7', points=u'20'),
 ....

Access like that

>>> rows[20].name
u'Bryan Little'

OTHER TIPS

You have not mentioned exactly what data you need, but you can go ahead on these lines:

from BeautifulSoup import BeautifulSoup
...
table = soup.find('table', {'class': 'data stats'})
rows = table.find('tr')
for row in rows:
    cols = row.findAll('td')
    for col in cols:
        print col.text
        link = col.find("a")
        if link:
            print link.get("href"), link.get("rel"), link.get("onclick"), link.text
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top