Frage

I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using:

curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2

But how do I schedule all spiders in a project at once?

All help much appreciated!

War es hilfreich?

Lösung

My solution for running 200+ spiders at once has been to create a custom command for the project. See http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands for more information about implementing custom commands.

YOURPROJECTNAME/commands/allcrawl.py :

from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log

class AllCrawlCommand(ScrapyCommand):

    requires_project = True
    default_settings = {'LOG_ENABLED': False}

    def short_desc(self):
        return "Schedule a run for all available spiders"

    def run(self, args, opts):
        url = 'http://localhost:6800/schedule.json'
        for s in self.crawler.spiders.list():
            values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
            data = urllib.urlencode(values)
            req = urllib2.Request(url, data)
            response = urllib2.urlopen(req)
            log.msg(response)

Make sure to include the following in your settings.py

COMMANDS_MODULE = 'YOURPROJECTNAME.commands'

Then from the command line (in your project directory) you can simply type

scrapy allcrawl

Andere Tipps

Sorry, I know this is an old topic, but I've started learning scrapy recently and stumbled here, and I don't have enough rep yet to post a comment, so posting an answer.

From the common scrapy practices you'll see that if you need to run multiple spiders at once, you'll have to start multiple scrapyd service instances and then distribute your Spider runs among those.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top