Pregunta

I have a scrapy scrawler written trying to gather items on http://www.shop.ginakdesigns.com/main.sc

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

from .. import items

class GinakSpider(CrawlSpider):
    name = "ginak"
    start_urls = [
   "http://www.shop.ginakdesigns.com/main.sc"
    ]
    rules = [Rule(SgmlLinkExtractor(allow=[r'category\.sc\?categoryId=\d+'])),
        Rule(SgmlLinkExtractor(allow=[r'product\.sc\?productId=\d+&categoryId=\d+']), callback='parse_item')]

def parse_item(self, response):
    sel = Selector(response)
    self.log(response.url)
    item = items.GinakItem()
    item['name'] = sel.xpath('//*[@id="wrapper2"]/div/div/div[1]/div/div/div[2]/div/div/div[1]/div[1]/h2/text()').extract()
    item['price'] = sel.xpath('//*[@id="listPrice"]/text()').extract()
    item['description'] = sel.xpath('//*[@id="wrapper2"]/div/div/div[1]/div/div/div[2]/div/div/div[1]/div[4]/div/p/text()').extract()
    item['category'] = sel.xpath('//*[@id="breadcrumbs"]/a[2]/text()').extract()

    return item

However it doesn't go beyond the home page into any links. I've tried all sorts of things and checked my regular expressions for the SgmlLinkExtractor as well. Anything wrong here?

¿Fue útil?

Solución

The problem is that there is jsessionid inserted into the links you are trying to extract, for example:

<a href="/category.sc;jsessionid=EA2CAA7A3949F4E462BBF466E03755B7.m1plqscsfapp05?categoryId=16">

Fix it by using .*? non-greedy match for any characters instead of looking for /?:

rules = [Rule(SgmlLinkExtractor(allow=[r'category\.sc.*?categoryId=\d+']), callback='parse_item'),
         Rule(SgmlLinkExtractor(allow=[r'product\.sc.*?productId=\d+&categoryId=\d+']), callback='parse_item')]

Hope that helps.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top