Domanda

I'm having a problem scraping a table from a html. Actually it is 3 tables inside a bigger table. I'm using BS4 and it works fine up to the point of finding all the 'td' tags, but when I try to print the info that I need the program stops in the end of the first table and show this error message:

"IndexError: list index out of range"

import re
import urllib2
from bs4 import BeautifulSoup

url = 'http://trackinfo.com/entries-alphabetical.jsp?raceid13=GBR$20140314A'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)


for tr in soup.find_all('tr')[2:]:
  tds = tr.find_all('td')
  print tds[0].text, tds[1].text

Any ideas how to fix it?

È stato utile?

Soluzione

By looking at your code, an assumption is made in the loop that there will always be (at least) 2 td elements in the list of found tr elements. If there are some case where a tr element contains less than 2 elements, an IndexError will be raised.

Try changing the loop to something like this:

for tr in soup.find_all('tr')[2:]:
  tds = tr.find_all('td')
  if len(tds) >= 2:
    print tds[0].text, tds[1].text

The check where the number of td elements must be 2 or more is specific for the page you are parsing and I guess that you want the two values written together. A more general solution could be:

for tr in soup.find_all('tr')[2:]:
  for td in tr.find_all('td'):
    print td.text

Altri suggerimenti

The idea is to iterate over tables inside the top-level table, then for each table iterate over rows (except the first one with titles):

import urllib2
from bs4 import BeautifulSoup


url = 'http://trackinfo.com/entries-alphabetical.jsp?raceid13=GBR$20140314A'
soup = BeautifulSoup(urllib2.urlopen(url))

for index, table in enumerate(soup.find('table').find_all('table')):
    print "Table #%d" % index
    for tr in table.find_all('tr')[1:]:
        tds = tr.find_all('td')
        print "Runner: %s, Race: %s" % (tds[0].text.strip(), tds[1].text.strip())

prints:

Table #0
Runner: ALL SHOOK UP, Race: 11
Runner: ARLINGTON ADIE, Race: 9
Runner: BARTS BIKERCHICK, Race: 10
Runner: BARTS GAME DAY, Race: 4
Runner: BARTS SIR PRIZE, Race: 7
Runner: BJ'S PIZAZZ, Race: 7
Runner: BOC'S BAMA BOY, Race: 14
Runner: BOC'S BRADBERRY, Race: 2
Runner: BOC'S CRIMSNTIDE, Race: 9
...

Also, note that you can pass urllib2.urlopen(url) directly to the BeautifulSoup constructor - it will call read() under the hood.

Hope that helps.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top