Question

The following program does almost everything I want it to but it won't write the image files to disc that are scraped. The latest error has no such file or directory for the basename of one of the image files that I would like to obtain. It should be writing the new file but I guess I'm doing something wrong. Error: No such file or directory - h3130gy1-3-7ec5.jpg . Ideally this program would write each image to disc with the name of each image being the basename of the absolute url that was used to obtain it. I would also like the spreadsheet element to write the basename of each scraped image to the output file that is being compiled.

require "capybara/dsl"
require "spreadsheet"
require "fileutils"
require "open-uri"

 LOCAL_DIR = 'data-hold/images'

 FileUtils.makedirs(LOCAL_DIR) unless File.exists?LOCAL_DIR
 Capybara.run_server = false
 Capybara.default_driver = :selenium
 Capybara.default_selector = :xpath
 Spreadsheet.client_encoding = 'UTF-8'

 class Tomtop
   include Capybara::DSL

   def initialize
     @excel = Spreadsheet::Workbook.new
     @work_list = @excel.create_worksheet
     @row = 0
   end

   def go
     visit_main_link
   end

   def visit_main_link
     visit "http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position"
     results = all("//h5/a[contains(@onclick, 'analyticsLog')]")
     item = []

     results.each do |a|
       item << a[:href]
     end
     item.each do |link|
          visit link
          save_item
      end
     @excel.write "inventory.csv"

   end

   def save_item

     data = all("//*[@id='content-wrapper']/div[2]/div/div")
     data.each do |info|
       @work_list[@row, 0] = info.find("//*[@id='productright']/div/div[1]/h1").text
       price = info.first("//div[contains(@class, 'price font left')]")
       @work_list[@row, 1] = (price.text.to_f * 1.33).round(2) if price
       @work_list[@row, 2] = info.find("//*[@id='productright']/div/div[11]").text
       @work_list[@row, 3] = info.find("//*[@id='tabcontent1']/div/div").text.strip
       color = info.all("//dd[1]//select[contains(@name, 'options')]//*[@price='0']")
       @work_list[@row, 4] = color.collect(&:text).join(', ')
       size = info.all("//dd[2]//select[contains(@name, 'options')]//*[@price='0']")
       @work_list[@row, 5] = size.collect(&:text).join(', ')
       imagelink = info.all("//*[@rel='lightbox[rotation]']")
       @work_list[@row, 6] = imagelink.map { |link| link['href'] }.join(', ')  
       image = imagelink.map { |link| link['href'] }
       File.open (File.basename("#{LOCAL_DIR}/#{image}", 'w')) do |f|
         f.write(open(image).read)
       end
       @row = @row + 1
     end

   end

 end


 tomtop = Tomtop.new
 tomtop.go
Was it helpful?

Solution

It appears as if you have a parenthesis misplaced, this line:

File.open (File.basename("#{LOCAL_DIR}/#{image}", 'w')) do |f|

Should be this:

File.open(File.basename("#{LOCAL_DIR}/#{image}"), 'w') do |f|

But actually, on further investigation of your code, it appears that File.basename is acting on the incorrect string in this situation. After getting your code to run, it filled the root folder of scraper.rb with images. So, what I think you really want for that line is this:

#only grab the basename of the image, then concatenate that to the end of the local_dir:
filename = "#{LOCAL_DIR}/#{File.basename(image)}"
File.open(filename, 'w') do |f|

After running this, I got to the next problem. It appears as though 'image' is an array which contains many urls.

Depending on what you are trying to achieve, you may need to do some additional filtering to get the image down to a single image, or change it to 'images' and have the following code:

images = imagelink.map { |link| link['href'] }
images.each do |image|
  File.open(File.basename("#{LOCAL_DIR}/#{image}"), 'w') do |f|
    f.write(open(image).read)
  end
end
@row = @row + 1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top