Question

I have to crawl a Web Site, so I use Scrapy to do it, but I need to pass a cookie to bypass the first page (which is a kind of login page, you choose you location)

I heard on the web that you need to do this with a base Spider (not a Crawl Spider), but I need to use a Crawl Spider to do my crawling, so what do I need to do?

At first a Base Spider? then launch my Crawl spider? But I don't know if cookie will be passed between them or how do I do it? How to launch a spider from another spider?

How to handle cookie? I tried with this

def start_requests(self):
   yield Request(url='http://www.auchandrive.fr/drive/St-Quentin-985/', cookies={'auchanCook': '"985|"'})

But not working

My answer should be here, but the guy is really evasive and I don't know what to do.

Was it helpful?

Solution

First, you need to add open cookies in settings.py file

COOKIES_ENABLED = True

Here is my testing spider code for your reference. I tested it and passed

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy import log

class Stackoverflow23370004Spider(CrawlSpider):
    name = 'auchandrive.fr'
    allowed_domains = ["auchandrive.fr"]

    target_url = "http://www.auchandrive.fr/drive/St-Quentin-985/"

    def start_requests(self):
        yield Request(self.target_url,cookies={'auchanCook': "985|"}, callback=self.parse_page)

    def parse_page(self, response):        
        if 'St-Quentin-985' in response.url:
            self.log("Passed : %r" % response.url,log.DEBUG)
        else:
            self.log("Failed : %r" % response.url,log.DEBUG)

You can run command to test and watch the console output:

scrapy crawl auchandrive.fr

OTHER TIPS

I noticed that in your code snippet, you were using cookies={'auchanCook': '"985|"'}, instead of cookies={'auchanCook': "985|"}.

This should get you started:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request


class AuchanDriveSpider(CrawlSpider):
    name = 'auchandrive'
    allowed_domains = ["auchandrive.fr"]

    # pseudo-start_url
    begin_url = "http://www.auchandrive.fr/"

    # start URL used as shop selection
    select_shop_url = "http://www.auchandrive.fr/drive/St-Quentin-985/"

    rules = (
        Rule(SgmlLinkExtractor(restrict_xpaths=('//ul[@class="header-menu"]',))),
        Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "vignette-content")]',)),
             callback='parse_product'),
    )

    def start_requests(self):
        yield Request(self.begin_url, callback=self.select_shop)

    def select_shop(self, response):
        return Request(url=self.select_shop_url, cookies={'auchanCook': "985|"})

    def parse_product(self, response):
        self.log("parse_product: %r" % response.url)

Pagination might be tricky.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top