Question

I have written my first bit of python code to scrape a website.

import csv
import urllib2
from BeautifulSoup import BeautifulSoup

c = csv.writer(open("data.csv", "wb"))
soup = BeautifulSoup(urllib2.urlopen('http://www.kitco.com/kitco-gold-index.html').read())
table = soup.find('table', id="datatable_main")
rows = table.findAll('tr')[1:]

for tr in rows:
   cols = tr.findAll('td')
   text = []
   for td in cols:
       text.append(td.find(text=True))
   c.writerow(text)

When I test it locally in my ide called pyCharm it works good but when I try it out on my server which runs CentOS, I get the following error:

domainname.com [~/public_html/livegold]# python scraper.py
Traceback (most recent call last):
  File "scraper.py", line 8, in <module>
    rows = table.findAll('tr')[:]
AttributeError: 'NoneType' object has no attribute 'findAll'

I'm guessing I don't have a module installed remotely, I've been hung up on this for two days any help would be greatly appreciated! :)

Was it helpful?

Solution

You are ignoring any errors that could occur in urllib2.urlopen, if for some reason you are getting an error trying to get that page on your server, which you don't get testing locally you are effectively passing in an empty string ('') or a page you don't expect (such as a 404 page) to BeautifulSoup.

Which in turn makes your soup.find('table', id="datatable_main") return None since the document is something you don't expect.

You should either make sure you can get the page you are trying to get on your server, or handle exceptions properly.

OTHER TIPS

There is no table with id datatable_main in the page that the script read.

Try printing the returned page to the terminal - perhaps your script is failing to contact the web server? Sometimes hosting services prevent outgoing HTTP connections.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top