Recursively crawl a webpage using authentication with scrapy

https://stackoverflow.com/questions/22102486

18-10-2022
|

Pergunta

I am trying to crawl a webpage with authentication required.

The problem I am facing is, the login is working fine for the first time and I get the Successfull logged-in log but when the crawler starts crawling pages from the start_url, it doesn't capture the pages in the csv file output which require the login credentials for viewing the data.

Am I missing anything to retain the login session throughtout the process or some check to check that each url which requires login and then only continue.

My login form is a post form and the output is as below -

2014-02-28 21:16:53+0000 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2014-02-28 21:16:53+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023

2014-02-28 21:16:53+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080

2014-02-28 21:16:53+0000 [myspider] DEBUG: Crawled (200) https://someurl.com/login_form> (referer: None)

2014-02-28 21:16:53+0000 [myspider] DEBUG: Crawled (200) https://someurl.com/search> (referer: https://someurl.com/login_form)

2014-02-28 21:16:53+0000 [myspider] DEBUG: Successfully logged in. Start crawling!

Its automatically going to the search page on the first hit instead of the login_form page (start_url)

Please anyone help me out in this ?

Below is my code :

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import urlparse
from scrapy import log


class MySpider(CrawlSpider):

        name = 'myspider'
        allowed_domains = ['someurl.com']
        login_page = 'https://someurl.com/login_form'
        start_urls = 'https://someurl.com/'

        rules = [Rule(SgmlLinkExtractor(), follow=True, callback='parse_item')]

        def start_requests(self):

            yield Request(
                url=self.login_page,
                callback=self.login,
                dont_filter=True
            )


        def login(self, response):
            """Generate a login request."""
            return FormRequest.from_response(response,
                    formdata={'__ac_name': 'username', '__ac_password': 'password' },
                    callback=self.check_login_response)


        def check_login_response(self, response):
            if "Sign Out" in response.body:
                self.log("Successfully logged in. Start Crawling")
                return Request(url=self.start_urls)
            else:
                self.log("Not Logged in")


        def parse_item(self, response):

            # Scrape data from page
            items = []
            failed_urls = []
            hxs = HtmlXPathSelector(response)

            urls = hxs.select('//base/@href').extract()
            urls.extend(hxs.select('//link/@href').extract())
            urls.extend(hxs.select('//a/@href').extract())
            urls = list(set(urls))

            for url in urls :

                item = DmozItem()

                if response.status == 404:
                    failed_urls.append(response.url)
                    self.log('failed_url : %s' % failed_urls)
                    item['failed_urls'] = failed_urls
                else :

                    if url.startswith('http') :
                        if url.startswith('https://someurl.com'):
                            item['internal_link'] = url
                            self.log('internal_link : %s ' % url)
                        else :
                            item['external_link'] = url
                            self.log('external_link : %s ' % url)

                items.append(item)

            items = list(set(items))
            return items

Nenhuma solução correta

Outras dicas

You can pass authentication with Scrapy using FormRequest function, like this :

scrapy.FormRequest(
        self.start_urls[0],
        formdata={'LoginForm[username]':username_scrapy, 'LoginForm[password]':password_scrapy,'yt0': 'Login'},
        headers=self.headers)

LoginForm[username], LoginForm[password] are the variable passed via the login FORM

You need a headless browser, not just a scraper. Try extending scrapy with scrapyjs (https://github.com/scrapinghub/scrapyjs ) or selenium.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow