Question

I would like to know how can I start a crawler based on Scrapy. I installed the tool via apt-get install and I tried to run an example:

/usr/share/doc/scrapy/examples/googledir/googledir$ scrapy list
directory.google.com

/usr/share/doc/scrapy/examples/googledir/googledir$ scrapy crawl

I hacked the code from spiders/google_directory.py but it seems that it is not executed, because I don't see any prints that I inserted. I read their documentation, but I found nothing related to this; do you have any ideas?

Also, if you think that for crawling a website I should use other tools, please let me know. I'm not experienced with Python tools and Python is a must.

Thanks!

Was it helpful?

Solution

You missed the spider name in the crawl command. Use:

$ scrapy crawl directory.google.com

Also, I suggest you copy the example project to your home, instead of working in the /usr/share/doc/scrapy/examples/ directory, so you can modify it and play with it:

$ cp -r /usr/share/doc/scrapy/examples/googledir ~
$ cd ~/googledir
$ scrapy crawl directory.google.com

OTHER TIPS

EveryBlock.com released some quality scraping code using lxml, urllib2 and Django as their stack.

Scraperwiki.com is inspirational, full of examples of python scrapers.

Simple example with cssselect:

from lxml.html import fromstring

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top