How are response objects passed through request callbacks in a Scrapy scraper?

https://stackoverflow.com/questions/20938610

24-09-2022
|

문제

My code is included below and is really not much more than a slightly tweaked version of the example lifted from Scrapy's documentation. The code works as-is, but I there is a gap in the logic I am not understanding between the login and how the request is passed through subsequent requests.

According to the documentation, a request object returns a response object. This response object is passed as the first argument to a callback function. This I get. This is the way authentication can be handled and subsequent requests made using the user credentials.

What I am not understanding is how the response object makes it to the next request call following authentication. In my code below, the parse method returns a result object created when authenticating using the FormRequest method. Since the FormRequest has a callback to the after_login method, the after_login method is called with the response from the FormRequest as the first parameter.

The after_login method checks to make sure there are no errors, then makes another request through a yield statement. What I do not understand is how the response passed in as an argument to the after_login method is making it to the Request following the yield. How does this happen?

The primary reason why I am interested is I need to make two requests per iterated value in the after_login method, and I cannot figure out how the responses are being handled by the scraper to then understand how to modify the code. Thank you in advance for your time and explanations.

# import Scrapy modules
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy import log

# import custom item from item module
from scrapy_spage.items import ReachItem


class AwSpider(BaseSpider):
    name = 'spage'
    allowed_domains = ['webpage.org']
    start_urls = ('https://www.webpage.org/',)

    def parse(self, response):
        credentials = {'username': 'user',
                       'password': 'pass'}
        return [FormRequest.from_response(response, 
                                          formdata=credentials,
                                          callback=self.after_login)]

    def after_login(self, response):
        # check to ensure login succeeded
        if 'Login failed' in response.body:

            # log error
            self.log('Login failed', level=log.ERROR)

            # exit method
            return

        else:
            # for every integer from one to 5000, 1100 to 1110 for testing...
            for reach_id in xrange(1100, 1110):

                # call make requests, use format to create four digit string for each reach
                yield Request('https://www.webpage.org/content/River/detail/id/{0:0>4}/'.format(reach_id),
                              callback=self.scrape_page)

    def scrape_page(self, response):
        # create selector object instance to parse response
        sel = Selector(response)

        # create item object instance
        reach_item = ReachItem()

        # get attribute
        reach_item['attribute'] = sel.xpath('//body/text()').extract()

        # other selectors...

        # return the reach item
        return reach_item

해결책

how the response passed in as an argument to the after_login method is making it to the Request following the yield.

if I understand your question, the answer is that it doesn't

the mechanism is simple:

for x in spider.function():
    if x is a request:
        http call this request and wait for a response asynchronously
    if x is an item: 
        send it to piplelines etc...

upon getting a response:
    request.callback(response)

as you can see, there is no limit to the number of requests the function can yield so you can:

for reach_id in xrange(x, y):
    yield Request(url=url1, callback=callback1)
    yield Request(url=url2, callback=callback2)

hope this helps

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow