How to use Scrapy
-
04-10-2019 - |
Pergunta
I would like to know how can I start a crawler based on Scrapy. I installed the tool via apt-get install and I tried to run an example:
/usr/share/doc/scrapy/examples/googledir/googledir$ scrapy list directory.google.com /usr/share/doc/scrapy/examples/googledir/googledir$ scrapy crawl
I hacked the code from spiders/google_directory.py but it seems that it is not executed, because I don't see any prints that I inserted. I read their documentation, but I found nothing related to this; do you have any ideas?
Also, if you think that for crawling a website I should use other tools, please let me know. I'm not experienced with Python tools and Python is a must.
Thanks!
Solução
You missed the spider name in the crawl command. Use:
$ scrapy crawl directory.google.com
Also, I suggest you copy the example project to your home, instead of working in the /usr/share/doc/scrapy/examples/
directory, so you can modify it and play with it:
$ cp -r /usr/share/doc/scrapy/examples/googledir ~
$ cd ~/googledir
$ scrapy crawl directory.google.com
Outras dicas
EveryBlock.com released some quality scraping code using lxml, urllib2 and Django as their stack.
Scraperwiki.com is inspirational, full of examples of python scrapers.
Simple example with cssselect:
from lxml.html import fromstring
dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]