Passing arguments inside Scrapy spider through lambda callbacks

https://stackoverflow.com/questions/3887968

28-09-2019
|

Pergunta

HI,

I'm have this short spider code:

class TestSpider(CrawlSpider):
    name = "test"
    allowed_domains = ["google.com", "yahoo.com"]
    start_urls = [
        "http://google.com"
    ]

    def parse2(self, response, i):
        print "page2, i: ", i
        # traceback.print_stack()


    def parse(self, response):
        for i in range(5):
            print "page1 i : ", i
            link = "http://www.google.com/search?q=" + str(i)
            yield Request(link, callback=lambda r:self.parse2(r, i))

and I would expect the output like this:

page1 i :  0
page1 i :  1
page1 i :  2
page1 i :  3
page1 i :  4

page2 i :  0
page2 i :  1
page2 i :  2
page2 i :  3
page2 i :  4

, however, the actual output is this:

page1 i :  0
page1 i :  1
page1 i :  2
page1 i :  3
page1 i :  4

page2 i :  4
page2 i :  4
page2 i :  4
page2 i :  4
page2 i :  4

so, the arguemnt I pass in callback=lambda r:self.parse2(r, i) is somehow wrong.

What's wrong with the code ?

Solução

The lambdas are accessing i which is being held in closure so they are all referencing the same value (the value of i in youre parse function when the lambdas are called). A simpler reconstruction of the phenomenon is:

>>> def do(x):
...     for i in range(x):
...         yield lambda: i
... 
>>> delayed = list(do(3))
>>> for d in delayed:
...     print d()
... 
2
2
2

You can see that the i's in the lambdas are all bound to the value of i in the function do. They will return whatever value it currently has and python will keep that scope alive as long as any of the lambdas are alive to preserve the value for it. This is what's referred to as a closure.

A simple but ugly work around is

>>> def do(x):
...     for i in range(x):
...         yield lambda i=i: i
... 
>>> delayed = list(do(3))
>>> for d in delayed:
...     print d()
... 
0
1
2

This works because, in the loop, the current value of i is bound to the paramater i of the lambda. Alternatively (and maybe a little bit clearer) lambda r, x=i: (r, x). The important part is that by making an assignment outside the body of the lambda (which is only executed later) you are binding a variable to the current value of i instead of the value that it takes at the end of the loop. This makes it so that the lambdas are not closed over i and can each have their own value.

So all you need to do is change the line

yield Request(link, callback=lambda r:self.parse2(r, i))

yield Request(link, callback=lambda r, i=i:self.parse2(r, i))

and you're cherry.

Outras dicas

According to the Scrapy documentation using lambda will prevent the libraries Jobs functionality from working (http://doc.scrapy.org/en/latest/topics/jobs.html).

The Request() and FormRequest() both contain a dictionary named meta which can be used to pass arguments.

def some_callback(self, response):
    somearg = 'test'
    yield Request('http://www.example.com', 
                   meta={'somearg': somearg}, 
                   callback=self.other_callback)

def other_callback(self, response):
    somearg = response.meta['somearg']
    print "the argument passed is:", somearg

lambda r:self.parse2(r, i) binds the variable name i, not the value of i. Later when the lambda is evaluated the current value of i in the closure i.e. the last value of i is used. This can be easily demonstrated.

>>> def make_funcs():
    funcs = []
    for x in range(5):
        funcs.append(lambda: x)
    return funcs

>>> f = make_funcs()
>>> f[0]()
4
>>> f[1]()
4
>>>

Here make_funcs is a function that returns a list of functions, each bound to x. You'd expect the functions when called to print values 0 to 4 respectively. And yet they all return 4 instead.

All is not lost however. There is a solution(s?).

>>> def make_f(value):
    def _func():
        return value
    return _func

>>> def make_funcs():
    funcs = []
    for x in range(5):
        funcs.append(make_f(x))
    return funcs

>>> f = make_funcs()
>>> f[0]()
0
>>> f[1]()
1
>>> f[4]()
4
>>>

I am using an explicit, named function here instead of lambda. In this case the variable's value gets bound rather than the name. Consequently the individual functions behave as expected.

I see that @Aaron has given you an answer for changing your lambda. Stick with that and you'll be good to go :)

class TestSpider(CrawlSpider):
    name = "test"
    allowed_domains = ["google.com", "yahoo.com"]
    start_urls = [
        "http://google.com"
    ]

    def parse(self, response):
        for i in range(5):
            print "page1 i : %s" % i
            yield Request("http://www.google.com/search?q=%s" % i, callback=self.next, meta={'i': i})

    def next(self, response):
        print "page1 i : %s" % response.meta['i']
        # traceback.print_stack()

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow