Question

What I'm trying to do is scrape the names and prices of items from multiple vendors using Nokogiri. I'm passing the CSS selectors (to the find names and prices) to Nokogiri with method arguments.

Any guidance on how to pass multiple URLs to the "scrape" method while also passing the other arguments (ex: vendor, item_path)? Or am I going about this the completely wrong way?

Here is the code:

require 'rubygems' # Load Ruby Gems
require 'nokogiri' # Load Nokogiri
require 'open-uri' # Load Open-URI

@@collection = Array.new # Array to hold meta hash

def scrape(url, vendor, item_path, name_path, price_path)
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
        @@collection << meta = Hash.new # Creates a new hash then add to global array
        meta[:vendor] = vendor
        meta[:name] = item.css(name_path).text.strip
        meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join 
    end
end

scrape( "page_a.html", "Sample Vendor A", "#products", ".title", ".prices")
scrape( ["page_a.html", "page_b.html"], "Sample Vendor B",  "#items", ".productname", ".price")
Was it helpful?

Solution

You can pass multiple url's the same way you're already doing it in you second example:

scrape( ["page_a.html", "page_b.html"], "Sample Vendor B",  "#items", ".productname", ".price")

Your scrape method will have to iterate through those urls, for instance:

def scrape(urls, vendor, item_path, name_path, price_path)
  urls.each do |url|
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
        @@collection << meta = Hash.new # Creates a new hash then add to global array
        meta[:vendor] = vendor
        meta[:name] = item.css(name_path).text.strip
        meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join 
    end 
  end   
end

This also means that the first example need also be passed as an array:

scrape( ["page_a.html"], "Sample Vendor A", "#products", ".title", ".prices")

OTHER TIPS

FYI, using @@collection is inappropriate. Instead, write your method to return a value:

def scrape(urls, vendor, item_path, name_path, price_path)
  collection = []
  urls.each do |url|
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
      collection << {
        :vendor => vendor,
        :name   => item.css(name_path).text.strip,
        :price  => item.css(price_path).to_s.scan(/\d+[.]\d+/).join
      }
    end 
  end   

  collection
end

Which can be reduced to:

def scrape(urls, vendor, item_path, name_path, price_path)
  urls.map { |url|
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.map { |item| # Iterates through each item on grid
      {
        :vendor => vendor,
        :name   => item.css(name_path).text.strip,
        :price  => item.css(price_path).to_s.scan(/\d+[.]\d+/).join
      }
    } 
  }
end
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top