Pergunta

I would like to know how can I start a crawler based on Scrapy. I installed the tool via apt-get install and I tried to run an example:

/usr/share/doc/scrapy/examples/googledir/googledir$ scrapy list
directory.google.com

/usr/share/doc/scrapy/examples/googledir/googledir$ scrapy crawl

I hacked the code from spiders/google_directory.py but it seems that it is not executed, because I don't see any prints that I inserted. I read their documentation, but I found nothing related to this; do you have any ideas?

Also, if you think that for crawling a website I should use other tools, please let me know. I'm not experienced with Python tools and Python is a must.

Thanks!

Foi útil?

Solução

You missed the spider name in the crawl command. Use:

$ scrapy crawl directory.google.com

Also, I suggest you copy the example project to your home, instead of working in the /usr/share/doc/scrapy/examples/ directory, so you can modify it and play with it:

$ cp -r /usr/share/doc/scrapy/examples/googledir ~
$ cd ~/googledir
$ scrapy crawl directory.google.com

Outras dicas

EveryBlock.com released some quality scraping code using lxml, urllib2 and Django as their stack.

Scraperwiki.com is inspirational, full of examples of python scrapers.

Simple example with cssselect:

from lxml.html import fromstring

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top