Question

I am using scrapyd to run multiple spiders as jobs across the same domain. I assumed scrapy had a hashtable of visited urls that it shared and co-ordinated with other spiders when it crawled. When I create instances of the same spider by

curl http://localhost:6800/schedule.json -d project=projectname -d spider=spidername.

it rather crawls the same urls and duplicate data is being scraped. Has someone dealt with a similar problem before?

Was it helpful?

Solution

My advice would be to try to divide the site into multiple start_urls. Then, you can pass the different values for start_urls to each spider.

If you want to get particularly fancy, (or if the pages you want to crawl change on a regular basis) you could create a spider that crawls the sitemap, divides the links up into n cunks, then starts n other spiders to actually crawl the site...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top