Question

I am using scrapy to crawl a product website which over 4 million products. However after crawling around 50k products it starts throwing 500 HTTP error. I have set Auto throttling to false as after enabling its very slow and will take around 20-25 days to complete the scraping. I think the server starts blocking the crawler temporarily after sometime. Any solutions what can be done ? I am using sitemap crawler - I want to extract some information from the url itself if the server is not responding and proceed with next url instead of finishing the crawling and closing the spider, for that I was looking at the errback parameter in Request. However, since I am using sitemap crawler so I don't explicitly create a Request Object. Is there any default errback function that I can override or where can I define it.

One more way to do it is defined here-Scrapy:In a request fails (eg 404,500), how to ask for another alternative request?

Was it helpful?

Solution

HTTP 500 typically indicates an internal server error. When getting blocked, it is much more likely you'd see a 403 or 404. (or perhaps a 302 redirect to a "you've been blocked" page) You're probably visiting links that cause something to break server-side. You should store which request caused the error and try visiting it yourself. It could be the case that the site is simply broken.

Ok..i get it but can you tell where and how to define errback function so that I can handle this error and my spider does not finishes

I took a look at SitemapSpider and unfortunately, it does not allow you to specify an errback function, so you're going to have to add support for it yourself. I'm basing this on the source for SitemapSpider.

First, you're going to want to change how sitemap_rules works by adding a function to handle errors:

sitemap_rules = [
    ('/product/', 'parse_product'),
    ('/category/', 'parse_category'),
]

will become:

sitemap_rules = [
    ('/product/', 'parse_product', 'error_handler'),
    ('/category/', 'parse_category', 'error_handler'),
]

Next, in init, you want to store the new callback in _cbs.

 for r, c in self.sitemap_rules:
    if isinstance(c, basestring):
        c = getattr(self, c)
    self._cbs.append((regex(r), c))

will become:

 for r, c, e in self.sitemap_rules:
    if isinstance(c, basestring):
        c = getattr(self, c)
    if isinstance(e, basestring):
        e = getattr(self, e)
    self._cbs.append((regex(r), c, e))

Finally, at the end of _parse_sitemap, you can specify your new errback function

elif s.type == 'urlset':
    for loc in iterloc(s):
        for r, c in self._cbs:
            if r.search(loc):
                yield Request(loc, callback=c)
                break

will become:

elif s.type == 'urlset':
    for loc in iterloc(s):
        for r, c, e in self._cbs:
            if r.search(loc):
                yield Request(loc, callback=c, errback=e)
                break

From there, simply implement your errback function (keep in mind that it takes a Twisted Failure as an argument) and you should be good to go.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top