How to get scrapy spider to add information to an item based on a CSV file

https://stackoverflow.com/questions/22135078

19-10-2022
|

Question

As some of you may have gathered, I'm learning scrapy to scrape some data off of Google Scholar for a research project that I am running. I have a file that contains many article titles for which I am scraping citations. I read in the file using pandas, generate the URLs that need scraping, and start scraping.

One problem that I face is 503 errors. Google shuts me off fairly quickly, and many entries remain unscraped. This is a problem that I am working on using some middleware provided by Crawlera.

Another problem I face is that when I export my scraped data, I have a hard time matching the scraped data to what I was trying to look for. My input data is a CSV file with three fields -- 'Authors','Title','pid' where 'pid' is a unique identifier.

I use pandas to read in the file and generate URLs for scholar based off the title. Each time a given URL is scraped, my spider goes through the scholar webpage, and picks up the title, publication information and cites for each article listed on that page.

Here is how I generate the links for scraping:

class ScholarSpider(Spider):
    name = "scholarscrape"
    allowed_domains = ["scholar.google.com"]

    # get the data
    data = read_csv("../../data/master_jeea.csv")
    # get the titles
    queries = data.Title.apply(urllib.quote)
    # generate a var to store links
    links = []
    # create the URLs to crawl
    for entry in queries:
        links.append("http://scholar.google.com/scholar?q=allintitle%3A"+entry)
    # give the URLs to scrapy
    start_urls = links

For example, one title from my data file could be the paper 'Elephants Don't Play Chess' by Rodney Brooks with 'pid' 5067. The spider goes to

http://scholar.google.com/scholar?q=allintitle%3Aelephants+don%27t+play+chess

Now on this page, there are six hits. The spider gets all six hits, but they need to be assigned the same 'pid'. I know I need to insert a line somewhere that reads something like item['pid'] = data.pid.apply("something") but I can't figure out exactly how I would do that.

Below is the rest of the code for my spider. I am sure the way to do this is pretty straightforward, but I can't think of how to get the spider to know which entry of data.pid it should look for if that makes sense.

def parse(self, response):
    # initialize something to hold the data
    items=[]
    sel = Selector(response)
    # get each 'entry' on the page
    # an entry is a self contained div
    # that has the title, publication info
    # and cites
    entries = sel.xpath('//div[@class="gs_ri"]')
    # a counter for the entry that is being scraped
    count = 1
    for entry in entries:
        item = ScholarscrapeItem()
        # get the title
        title = entry.xpath('.//h3[@class="gs_rt"]/a//text()').extract()
        # the title is messy
        # clean up
        item['title'] = "".join(title)
        # get publication info
        # clean up
        author = entry.xpath('.//div[@class="gs_a"]//text()').extract()
        item['authors'] = "".join(author)
        # get the portion that contains citations
        cite_string = entry.xpath('.//div[@class="gs_fl"]//text()').extract()
        # find the part that says "Cited by"
        match = re.search("Cited by \d+",str(cite_string))
        # if it exists, note the number
        if match:
            cites = re.search("\d+",match.group()).group()
        # if not, there is no citation info
        else:
            cites = None
        item['cites'] = cites
        item['entry'] = count
        # iterate the counter
        count += 1
        # append this item to the list
        items.append(item)
    return items

I hope this question is well-defined, but please let me know if I can be more clear. There is really not much else in my scraper except some lines at the top importing things.

Edit 1: Based on suggestions below, I have modified my code as follows:

# test-case: http://scholar.google.com/scholar?q=intitle%3Amigratory+birds
import re
from pandas import *
import urllib

from scrapy.spider import Spider
from scrapy.selector import Selector

from scholarscrape.items import ScholarscrapeItem

class ScholarSpider(Spider):
    name = "scholarscrape"
    allowed_domains = ["scholar.google.com"]

    # get the data
    data = read_csv("../../data/master_jeea.csv")
    # get the titles
    queries = data.Title.apply(urllib.quote)
    pid = data.pid
    # generate a var to store links
    urls = []
    # create the URLs to crawl
    for entry in queries:
        urls.append("http://scholar.google.com/scholar?q=allintitle%3A"+entry)
    # give the URLs to scrapy
    start_urls = (
        (urls, pid),
        )

    def make_requests_from_url(self, (url,pid)):
        return Request(url, meta={'pid':pid}, callback=self.parse, dont_filter=True)

    def parse(self, response):
        # initialize something to hold the data
        items=[]
        sel = Selector(response)
        # get each 'entry' on the page
        # an entry is a self contained div
        # that has the title, publication info
        # and cites
        entries = sel.xpath('//div[@class="gs_ri"]')
        # a counter for the entry that is being scraped
        count = 1
        for entry in entries:
            item = ScholarscrapeItem()
            # get the title
            title = entry.xpath('.//h3[@class="gs_rt"]/a//text()').extract()
            # the title is messy
            # clean up
            item['title'] = "".join(title)
            # get publication info
            # clean up
            author = entry.xpath('.//div[@class="gs_a"]//text()').extract()
            item['authors'] = "".join(author)
            # get the portion that contains citations
            cite_string = entry.xpath('.//div[@class="gs_fl"]//text()').extract()
            # find the part that says "Cited by"
            match = re.search("Cited by \d+",str(cite_string))
            # if it exists, note the number
            if match:
                cites = re.search("\d+",match.group()).group()
            # if not, there is no citation info
            else:
                cites = None
            item['cites'] = cites
            item['entry'] = count
            item['pid'] = response.meta['pid']
            # iterate the counter
            count += 1
            # append this item to the list
            items.append(item)
        return items

No correct solution

OTHER TIPS

You need to populate your list start_urls with tuples (url, pid). Now redefine the method make_requests_from_url(url):

class ScholarSpider(Spider):
    name = "ScholarSpider"
    allowed_domains = ["scholar.google.com"]
    start_urls = (
        ('http://www.scholar.google.com/', 100),
        )

    def make_requests_from_url(self, (url, pid)):
        return Request(url, meta={'pid': pid}, callback=self.parse, dont_filter=True)

    def parse(self, response):
        pid = response.meta['pid']
        print '!!!!!!!!!!!', pid, '!!!!!!!!!!!!'
        pass

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow