Question

I am building a project where I need a web crawler which crawls a list of different webpages. This list can change at any time. How is this best implemented with scrapy? Should I create one spider for all websites or dynamically create spiders?

I have read about scrapyd, and I guess that dynamically creating spiders is the best approach. I would need a hint about how to implement it though.

Was it helpful?

Solution

If parsing logic is same then there are two methods,

  1. For large number of webpages, you can create a list and read that list at the start may b in start_requests method or in constructor and assign that list to start_urls
  2. You can pass you webpage link as a parameter to the spider from command line arguments, ans same in requests_method or in constructor you can access this parameter and assign it to start_urls

Passing parameters in scrapy

    scrapy crawl spider_name -a start_url=your_url

In scrapyd replace -a with -d

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top