I have a Spider that crawls several start_urls, but the problem is that I only receive a limited amount of output. When I crawl one start_url, however, it does return all the results before the infinite scrolling of the page. This is my Spider's code:
from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem
class PinterestSpider(Spider):
name = "pinterest"
allowed_domains = ["pinterest.com"]
start_urls = [
"http://www.pinterest.com/jetsetterphoto/pins/",
"http://www.pinterest.com/llbean/pins/",
"http://www.pinterest.com/nordstrom/pins/"
]
def parse(self, response):
hxs = Selector(response)
pin_links = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()
repin_counts = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()
like_counts = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()
comment_counts = hxs.xpath("//em[@class='socialMetaCount commentCountSmall']/text()").extract()
board_names = hxs.xpath("//div[@class='creditTitle']/text()").extract()
pin_descriptions = hxs.xpath("//p[@class='pinDescription']/text()").extract()
items = []
for pin_link, repin_count, like_count, comment_count, board_name, pin_description in zip(pin_links, repin_counts, like_counts, comment_counts, board_names, pin_descriptions):
item = PinterestItem()
item["pin_link"] = pin_link.strip()
item["repin_count"] = repin_count.strip()
item["like_count"] = like_count.strip()
item["comment_count"] = comment_count.strip()
item["board_name"] = board_name.strip()
item["pin_description"] = pin_description.strip()
items.append(item)
return items
I can see that the crawler does crawl the start_urls, but only returns 16 lines of output in a JSON file. When I use one start_url, it gives more lines of output (all of them, up to the infinite scrolling of the page). Is there maybe a limit set on the amount of requests I can do in settings? I tried looking for similar questions, but could not find any that is like mine. Any Ideas?
EDIT: Can it have something to do with the COncurrent Requests per DOmain setting??? http://doc.scrapy.org/en/latest/topics/settings.html They specify here that the max is 8 and I get exactly 8 lines of output per domain. The total of concurrent requests is 16 by default so that would explain why I only get results of 2 start_urls. I will test if this works if I change the default (I have no idea if this makes sense to anyone else).
EDIT: I want to add this to my spider to extract basic info:
for BasicInfo in selector.css('div.userProfilePage'):
item["company_pins"] = get(pin.css('div.PinCount::text'))
item["company_likes"] = get(pin.css('ul.userStats li~ li+ li a::text'))
item["company_name"] = get(pin.css('h1.userProfileHeaderName::text'))
item["company_followers"] = get(pin.css('a.FollowerCount .buttonText::text'))
Then the code would be something like this:
def parse(self, response):
selector = Selector(response)
items = []
for pin in selector.css('div.pinWrapper'):
item = PinterestItem()
item["pin_link"] = get(pin.css('div.pinHolder a::attr(href)'))
item["repin_count"] = get(pin.css('em.repinCountSmall::text'))
item["like_count"] = get(pin.css('em.likeCountSmall::text'))
item["comment_count"] = get(pin.css('em.commentCountSmall::text'))
item["board_name"] = get(pin.css('div.creditTitle::text'))
item["pin_description"] = get(pin.css('p.pinDescription::text'))
items.append(item)
self.log("extracted %d item(s) from %s" % (len(items), response.url))
return items
def parse(self, response):
selector = Selector(response)
items = []
for BasicInfo in selector.css('div.userProfilePage'):
item["company_pins"] = get(pin.css('div.PinCount::text'))
item["company_likes"] = get(pin.css('ul.userStats li~ li+ li a::text'))
item["company_name"] = get(pin.css('h1.userProfileHeaderName::text'))
item["company_followers"] = get(pin.css('a.FollowerCount .buttonText::text'))
items.append(item)
self.log("extracted %d item(s) from %s" % (len(items), response.url))
return items
I know this is wrong, but I dont know where or how to put it. Should I include a callback using request?