Scrapy only returns limited amount of output

https://stackoverflow.com/questions/23491965

16-07-2023
|

Вопрос

I have a Spider that crawls several start_urls, but the problem is that I only receive a limited amount of output. When I crawl one start_url, however, it does return all the results before the infinite scrolling of the page. This is my Spider's code:

from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem

class PinterestSpider(Spider):
    name = "pinterest"
    allowed_domains = ["pinterest.com"]
    start_urls = [
        "http://www.pinterest.com/jetsetterphoto/pins/",
        "http://www.pinterest.com/llbean/pins/",        
        "http://www.pinterest.com/nordstrom/pins/"
    ]

    def parse(self, response):
        hxs = Selector(response)
        pin_links = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()
        repin_counts = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()
        like_counts = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()
        comment_counts = hxs.xpath("//em[@class='socialMetaCount commentCountSmall']/text()").extract()
        board_names = hxs.xpath("//div[@class='creditTitle']/text()").extract()
        pin_descriptions = hxs.xpath("//p[@class='pinDescription']/text()").extract()

        items = []
        for pin_link, repin_count, like_count, comment_count, board_name, pin_description in zip(pin_links, repin_counts, like_counts, comment_counts, board_names, pin_descriptions):
            item = PinterestItem()
            item["pin_link"] = pin_link.strip()
            item["repin_count"] = repin_count.strip()
            item["like_count"] = like_count.strip()
            item["comment_count"] = comment_count.strip()
            item["board_name"] = board_name.strip()
            item["pin_description"] = pin_description.strip()
            items.append(item)
        return items

I can see that the crawler does crawl the start_urls, but only returns 16 lines of output in a JSON file. When I use one start_url, it gives more lines of output (all of them, up to the infinite scrolling of the page). Is there maybe a limit set on the amount of requests I can do in settings? I tried looking for similar questions, but could not find any that is like mine. Any Ideas?

EDIT: Can it have something to do with the COncurrent Requests per DOmain setting??? http://doc.scrapy.org/en/latest/topics/settings.html They specify here that the max is 8 and I get exactly 8 lines of output per domain. The total of concurrent requests is 16 by default so that would explain why I only get results of 2 start_urls. I will test if this works if I change the default (I have no idea if this makes sense to anyone else).

EDIT: I want to add this to my spider to extract basic info:

for BasicInfo in selector.css('div.userProfilePage'):
    item["company_pins"] = get(pin.css('div.PinCount::text'))
    item["company_likes"] = get(pin.css('ul.userStats li~ li+ li a::text'))
    item["company_name"] = get(pin.css('h1.userProfileHeaderName::text'))
    item["company_followers"] = get(pin.css('a.FollowerCount .buttonText::text'))

Then the code would be something like this:

def parse(self, response):
    selector = Selector(response)
    items = []
    for pin in selector.css('div.pinWrapper'):
        item = PinterestItem()            
        item["pin_link"] = get(pin.css('div.pinHolder a::attr(href)'))
        item["repin_count"] = get(pin.css('em.repinCountSmall::text'))
        item["like_count"] = get(pin.css('em.likeCountSmall::text'))
        item["comment_count"] = get(pin.css('em.commentCountSmall::text'))
        item["board_name"] = get(pin.css('div.creditTitle::text'))
        item["pin_description"] = get(pin.css('p.pinDescription::text'))

        items.append(item)
    self.log("extracted %d item(s) from %s" % (len(items), response.url))
    return items

def parse(self, response):
    selector = Selector(response)
    items = []
    for BasicInfo in selector.css('div.userProfilePage'):
        item["company_pins"] = get(pin.css('div.PinCount::text'))
        item["company_likes"] = get(pin.css('ul.userStats li~ li+ li a::text'))
        item["company_name"] = get(pin.css('h1.userProfileHeaderName::text'))
        item["company_followers"] = get(pin.css('a.FollowerCount .buttonText::text'))

items.append(item)
    self.log("extracted %d item(s) from %s" % (len(items), response.url))
    return items

I know this is wrong, but I dont know where or how to put it. Should I include a callback using request?

Решение

I've rewritten your spider:

looping on <div class="pinWrapper"> elements,
and using CSS selectors relatively to each pin (you can of course use relative XPath expressions, i.e. starting with ".//" instead of "//": //div[@class='creditTitle']/text() should be .//div[@class='creditTitle']/text())

Note that ::text and ::attr(attribute_name) are extensions added by Scrapy to CSS selector syntax

The following code grabs 25 items per page:

import operator
from scrapy.spider import Spider
from scrapy.selector import Selector, SelectorList
#from Pinterest.items import PinterestItem
from scrapy.item import Item, Field

class PinterestItem(Item):
    pin_link = Field()
    repin_count = Field()
    like_count = Field()
    comment_count = Field()
    board_name = Field()
    pin_description = Field()

def get(selectorlist):
    for i in selectorlist.extract():
        return i.strip()

class PinterestSpider(Spider):
    name = "pinterest"
    allowed_domains = ["pinterest.com"]
    start_urls = [
        "http://www.pinterest.com/jetsetterphoto/pins/",
        "http://www.pinterest.com/llbean/pins/",
        "http://www.pinterest.com/nordstrom/pins/"
    ]

    def parse(self, response):
        selector = Selector(response)
        items = []
        for pin in selector.css('div.pinWrapper'):
            item = PinterestItem()
            item["pin_link"] = get(pin.css('div.pinHolder a::attr(href)'))
            item["repin_count"] = get(pin.css('em.repinCountSmall::text'))
            item["like_count"] = get(pin.css('em.likeCountSmall::text'))
            item["comment_count"] = get(pin.css('em.commentCountSmall::text'))
            item["board_name"] = get(pin.css('div.creditTitle::text'))
            item["pin_description"] = get(pin.css('p.pinDescription::text'))
            items.append(item)
        self.log("extracted %d item(s) from %s" % (len(items), response.url))
        return items

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow