Question

Okay so I'm new to programming in general and using Scrapy for this purpose in specific. I wrote a crawler to get data from pins on pinterest.com. The problem is that I used to get data from all the pins on the page I am crawling, but now I get only the data of the first pin.

I think the problem lies with the pipeline or in the spider itself. Something changed after I added the "strip" to the spider to get rid of the whitespace, but when I changed it back I got the same output but then with the whitespace. This is the spider:

from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem

class PinterestSpider(Spider):
    name = "pinterest"
    allowed_domains = ["pinterest.com"]
    start_urls = ["http://www.pinterest.com/llbean/pins/"]

    def parse(self, response):
        hxs = Selector(response)
        item = PinterestItem()
        items = []
        item ["pin_link"] = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()[0].strip()
        item ["repin_count"] = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()[0].strip()
        item ["like_count"] = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()[0].strip()
        item ["board_name"] = hxs.xpath("//div[@class='creditTitle']/text()").extract()[0].strip()
        items.append(item)
        return items

And this is my pipeline:

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import JsonLinesItemExporter

class JsonLinesExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_items.json' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = JsonLinesItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

When I use the command "scrapy crawl pinterest" this is the output i get in a JSON file:

"pin_link": "/pin/94716398388365841/", "board_name": "Outdoor Fun", "like_count": "14", "repin_count": "94"}

This is exactly the output I want, but I get it only from one pin, not from all pins on the page. I spent a lot of time reading through similar questions but I couldnt find any with the similar problem. Any ideas on what is wrong?? Thanks in advance!

EDIT: Oh I guess its because of the [0] before the strip function? Sorry I just realized this could be the problem...

EDIT: Mmm, that was not the problem. I am pretty sure it has to do something with the strip function, but I cant seem to use it correctly to get multiple pins as output. Could the solution be part of this question?: Scrapy: Why extracted strings are in this format? I see some overlap but I have no idea how to use it.

EDIT: Okay so when I modified the spider like this:

from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem

class PinterestSpider(Spider):
name = "pinterest"
allowed_domains = ["pinterest.com"]
start_urls = ["http://www.pinterest.com/llbean/pins/"]

def parse(self, response):
    hxs = Selector(response)
    sites = hxs.xpath("//div[@class='pinWrapper']")
    items = []
    for site in sites:
        item = PinterestItem()        
        item ["pin_link"] = site.select("//div[@class='pinHolder']/a/@href").extract()[0].strip()
        item ["repin_count"] = site.select("//em[@class='socialMetaCount repinCountSmall']/text()").extract()[0].strip()
        item ["like_count"] = site.select("//em[@class='socialMetaCount likeCountSmall']/text()").extract()[0].strip()
        item ["board_name"] = site.select("//div[@class='creditTitle']/text()").extract()[0].strip()
        items.append(item)
    return items

It did give me several lines of output, but apparently all with the same information, so it crawled the items of the number of pins on the page, but all with the same output:

{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}

etc.

Was it helpful?

Solution

I haven't used Scrapy, so this is a wild guess.

Your selectors are pulling back multiple results. You're then selecting the first value out of each list (with the slice [0]), creating a single PinterestItem called item, which you append to the items list before returning that. Nothing appears to be looping over all the possible results returned by the selectors.

So pull out all of the results, then iterate over them to create your items list:

def parse(self, response):
    hxs = Selector(response)
    pin_links = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()
    repin_counts = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()
    like_counts = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()
    board_names = hxs.xpath("//div[@class='creditTitle']/text()").extract()

    items = []
    for pin_link, repin_count, like_count, board_name in zip(pin_links, repin_counts, like_counts, board_names):
        item = PinterestItem()
        item["pin_link"] = pin_link.strip()
        item["repin_count"] = repin_count.strip()
        item["like_count"] = like_count.strip()
        item["board_name"] = board_name.strip()
        items.append(item)
    return items
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top