HTTP 500 typically indicates an internal server error. When getting blocked, it is much more likely you'd see a 403 or 404. (or perhaps a 302 redirect to a "you've been blocked" page) You're probably visiting links that cause something to break server-side. You should store which request caused the error and try visiting it yourself. It could be the case that the site is simply broken.
Ok..i get it but can you tell where and how to define errback function so that I can handle this error and my spider does not finishes
I took a look at SitemapSpider and unfortunately, it does not allow you to specify an errback function, so you're going to have to add support for it yourself. I'm basing this on the source for SitemapSpider.
First, you're going to want to change how sitemap_rules
works by adding a function to handle errors:
sitemap_rules = [
('/product/', 'parse_product'),
('/category/', 'parse_category'),
]
will become:
sitemap_rules = [
('/product/', 'parse_product', 'error_handler'),
('/category/', 'parse_category', 'error_handler'),
]
Next, in init
, you want to store the new callback in _cbs
.
for r, c in self.sitemap_rules:
if isinstance(c, basestring):
c = getattr(self, c)
self._cbs.append((regex(r), c))
will become:
for r, c, e in self.sitemap_rules:
if isinstance(c, basestring):
c = getattr(self, c)
if isinstance(e, basestring):
e = getattr(self, e)
self._cbs.append((regex(r), c, e))
Finally, at the end of _parse_sitemap
, you can specify your new errback function
elif s.type == 'urlset':
for loc in iterloc(s):
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
will become:
elif s.type == 'urlset':
for loc in iterloc(s):
for r, c, e in self._cbs:
if r.search(loc):
yield Request(loc, callback=c, errback=e)
break
From there, simply implement your errback function (keep in mind that it takes a Twisted Failure as an argument) and you should be good to go.