Question

I have several spiders that I need to crawl. I am using scrapyd with default settings. I managed to schedule my jobs with scrapyd interface. Everything at this point is fine, except that jobs aren't ending. Every time I check I find that 16 (4 jobs / 4 cpus) jobs are running and all other jobs are pending, unless I shut down scrapy.

I also checked the logs, and it says :

2013-09-22 12:20:55+0000 [spider1] INFO: Dumping Scrapy stats:
    {
     'downloader/exception_count': 1,
     'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
     'downloader/request_bytes': 244,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 7886,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 9, 22, 12, 20, 55, 635611),
     'log_count/DEBUG': 7,
     'log_count/INFO': 3,
     'request_depth_max': 1,
     'response_received_count': 1,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2013, 9, 22, 12, 20, 55, 270275)}
2013-09-22 12:20:55+0000 [spider1] INFO: Spider closed (finished)

How do you scrape hundreds of spiders using scrapyd?

Edit:

scrapy.cfg :

[settings]
default = myproject.scrapers.settings

[deploy]
url = http://localhost:6800/
project = myproject
version = GIT

[scrapyd]
eggs_dir    = scrapy_dir/eggs
logs_dir    = scrapy_dir/logs
items_dir   = scrapy_dir/items
dbs_dir     = scrapy_dir/dbs

scrapy settings.py

import os
from django.conf import settings

PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__))
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "myproject.settings")

BOT_NAME = 'scrapers'

SPIDER_MODULES = ['myproject.scrapers.spiders']

DOWNLOADER_MIDDLEWARES = {
    'myproject.scrapers.middlewares.IgnoreDownloaderMiddleware': 50,
}

ITEM_PIPELINES = [
    'myproject.scrapers.pipelines.CheckPipeline',
    'myproject.scrapers.pipelines.CleanPipeline',
    'myproject.contrib.pipeline.images.ImagesPipeline',
    'myproject.scrapers.pipelines.SerializePipeline',
    'myproject.scrapers.pipelines.StatsCollectionPipeline',
]    

DOWNLOAD_DELAY = 0.25

path_to_phatomjs = '/home/user/workspace/phantomjs-1.9.1-linux-x86_64/bin/phantomjs'

IMAGES_STORE = settings.MEDIA_ROOT + '/' + settings.IMAGES_STORE
IMAGES_THUMBS = {
    'small': (70, 70),
    'big': (270, 270),
}
Était-ce utile?

La solution

I tried to post this answer yesterday as soon as I found the origin of the problem, but someting went wrong with my account.

The problem was coming from the PhantomJs driver, it was preventing scrapyd from finishing the jobs.

At first I was quitting the driver with delete function:

def __del__(self):
    self.driver.quit()
    ...

Now I created a function quit_driver, and I hooked it to spider_closed signal.

@classmethod
def from_crawler(cls, crawler):
    temp = cls(crawler.stats)
    crawler.signals.connect(temp.quit_driver, signal=signals.spider_closed)
    return temp
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top