سؤال

The CONCURRENT_ITEMS section at http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-items defines it as:

Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the Item Pipeline).

This confuses me. Does this imply that the items sent to the pipeline are processed in parallel ie. really multiprocessed?

Suppose my parsing involves a lot of lxml querying and xpath'ing. Should I do them in the spider's parse method itself, or should I send an Item with the whole response in it and let custom pipeline classes populate the Item's fields by parsing the response body?

هل كانت مفيدة؟

المحلول 2

The Requests system also works in parallel, see http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests. Scrapy is designed to handle requesting and parsing in the spider itself, the callback methods make it asynchronous and by default multiple Requests work in parallel indeed.

The item pipeline, which does process in parallel, isn't intended to do heavy parsing: it is rather meant to check and validate the values you got in each item. (http://doc.scrapy.org/en/latest/topics/item-pipeline.html)

Therefore you should do your queries in the spider itself, as they are designed to be there. From the docs on spiders:

Spiders are classes which define how a certain site (or group of sites) will be scraped, including how to perform the crawl (ie. follow links) and how to extract structured data from their pages (ie. scraping items).

نصائح أخرى

The CONCURRENT_ITEMS setting refers to limiting the concurrent activity when processing items from the spider output. By concurrent activity, I mean what twisted (the underlying framework used by Scrapy) will do concurrently - usually it's stuff like network requests.

Scrapy does not use multithreading and will not use more than one core. If your spider is CPU bound, the usual way to speed up is to use multiple separate scrapy processes, avoiding any bottlenecks with the python GIL.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top