The problem is that there is jsessionid
inserted into the links you are trying to extract, for example:
<a href="/category.sc;jsessionid=EA2CAA7A3949F4E462BBF466E03755B7.m1plqscsfapp05?categoryId=16">
Fix it by using .*?
non-greedy match for any characters instead of looking for /?
:
rules = [Rule(SgmlLinkExtractor(allow=[r'category\.sc.*?categoryId=\d+']), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=[r'product\.sc.*?productId=\d+&categoryId=\d+']), callback='parse_item')]
Hope that helps.