Only Allow Internal Links in Broad Scrapy Web Crawl

https://stackoverflow.com/questions/22943404

29-06-2023
|

Вопрос

I am using Scrapy to Crawl thousands of websites. I have a large list of domains to crawl. Everything works fine just that the crawler follows external links too, which is why it crawls way too many domains than necessary. I already tried to use "allow_domains" in the SGMLlinkextractor but this does not work when I parse a huge list of domains to it.

So my question: How can I limit a broad scrapy crawl to internal links?

Any idea much appreciated.

UPDATE: The problem is caused by a allow_domains list which is too large to handle for scrapy

Решение 2

I could solve the problem by modifying the SGMLlinkextractor. I added these two lines before returning the links:

domain = response.url.replace("http://","").replace("https://","").split("/")[0]
links = [k for k in links if domain in k.url]

Другие советы

OffsiteMiddleware is what you should consider using:

class scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware

Filters out Requests for URLs outside the domains covered by the spider.

This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow