instead of matching, getting the same url in scrapy

https://stackoverflow.com/questions/23437729

14-07-2023
|

Pergunta

Scraping several fields from this website with the following spider. The problem I encounter is that urls I'm getting are the ones, which apply to all 16 models on the page , then again another url, again applying to 16 models. I just can't establish the problem with url xpath. Could you point out where do I have a flaw in the url xpath? Thanks. p.s. Other fields are working just fine and are matching. Missing price fields are out of stock models.

class ZoomSpider(CrawlSpider):
name = "zoom2"
allowed_domains = ["zoomer.ge"]
start_urls = [
    "http://zoomer.ge/index.php?cid=35&act=search&category=1&search_type=mobile"
]

rules = (Rule (SgmlLinkExtractor(allow=("index.php\?cid=35&act=search&category=1&search_type=mobile&page=\d*", )) 
        , callback="parse_items", follow=True),)


def parse_items(self, response):
        sel = Selector(response)
        titles = sel.xpath('//div[@class="productContainer"]/div[5]/div[@class="productListContainer"]')
        items = []
        for t in titles:
        item = ZoomerItem()
            url = sel.xpath('//div[@class="productListImage"]/a/@href').extract()
            item["brand"] = t.xpath('div[3]/text()').re('^([\w\-]+)')
            item["price"] = t.xpath('div[@class="productListPrice"]/div/text()').extract()
            item["model"] = t.xpath('div[3]/text()').re('\s+(.*)$')[0].strip()
            item["url"] = urljoin("http://zoomer.ge", url[0])

            items.append(item)

        return(items)

enter image description here

Solução

You need to use relative xpaths, with your xpath you are always getting first link on each page you should use:

t.xpath('.//div[@class="productListImage"]/a/@href').extract()

note the dot there at the beginning. Xpaths should be relative to specific selector, in your case this is 't' in for loop.

This is pretty common mistake, it's described in scrapy docs

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow