I could solve the problem by modifying the SGMLlinkextractor. I added these two lines before returning the links:
domain = response.url.replace("http://","").replace("https://","").split("/")[0]
links = [k for k in links if domain in k.url]
Вопрос
I am using Scrapy to Crawl thousands of websites. I have a large list of domains to crawl. Everything works fine just that the crawler follows external links too, which is why it crawls way too many domains than necessary. I already tried to use "allow_domains" in the SGMLlinkextractor but this does not work when I parse a huge list of domains to it.
So my question: How can I limit a broad scrapy crawl to internal links?
Any idea much appreciated.
UPDATE: The problem is caused by a allow_domains list which is too large to handle for scrapy
Решение 2
I could solve the problem by modifying the SGMLlinkextractor. I added these two lines before returning the links:
domain = response.url.replace("http://","").replace("https://","").split("/")[0]
links = [k for k in links if domain in k.url]
Другие советы
OffsiteMiddleware is what you should consider using:
class scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware
Filters out Requests for URLs outside the domains covered by the spider.
This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute.