Вопрос

I am working on a javascript capable screen-scraper using capybara/dsl, selienium webdriver, and the spreadsheet gem. Very close to the desired output however two major problems arise:

  1. I have not been able to figure out the exact xpath selector to filter out only the elements I'm looking for; to ensure that none are missing I am using a broad selector that I know will produce duplicate elements. I was planning on just calling .uniq on that selector but this throws an error. What is the proper way to do this results in the desired filtering. The error is an undefined no method for 'uniq'. Maybe I'm not using it properly: results = all("//a[contains(@onclick, 'analyticsLog')]").uniq. I know that the xpath that I have chosen to extract hrefs: //a[contains(@onclick, 'analyticsLog')] will define more nodes than I intended because using find to inspect the page elements shows 144 rather than 72 that make up the page results. I have looked for a more specific selector however I haven't been able to find one without filtering out some desired links due to the business logic used on the site.

  2. My save_item method has two selectors that are not always found within the info results, I would like the script to just skip those that aren't found and save only the ones that are however my current iteration will throw a Capybara::ElementNotFound and exit. How could I configure this to work in the intended way.

#

code below

#

require "capybara/dsl"
require "spreadsheet"

 Capybara.run_server = false
 Capybara.default_driver = :selenium
 Capybara.default_selector = :xpath
 Spreadsheet.client_encoding = 'UTF-8'

 class Tomtop
   include Capybara::DSL

   def initialize
     @excel = Spreadsheet::Workbook.new
     @work_list = @excel.create_worksheet
     @row = 0
   end

   def go
     visit_main_link
   end

   def visit_main_link
     visit "http://www.some.com/clothing-accessories?dir=asc&limit=72&order=position"
     results = all("//a[contains(@onclick, 'analyticsLog')]")# I would like to use .uniq here to filter out the duplicates that I know will be delivered by this selector
     item = []

     results.each do |a|
       item << a[:href]
     end
     item.each do |link|
          visit link
          save_item
      end
     @excel.write "inventory.csv"

   end

   def save_item

     data = all("//*[@id='content-wrapper']/div[2]/div/div")
     data.each do |info|
       @work_list[@row, 0] = info.find("//*[@id='productright']/div/div[1]/h1").text
       @work_list[@row, 1] = info.find("//div[contains(@class, 'price font left')]").text
       @work_list[@row, 2] = info.find("//*[@id='productright']/div/div[11]").text
       @work_list[@row, 3] = info.find("//*[@id='tabcontent1']/div/div").text.strip
       @work_list[@row, 4] = info.find("//select[contains(@name, 'options[747]')]//*[@price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
       @work_list[@row, 5] = info.find("//select[contains(@name, 'options[748]')]//*[@price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
       @row = @row + 1
     end

   end

 end


 tomtop = Tomtop.new
 tomtop.go
Это было полезно?

Решение

For Question 1: Get unique elements

All of the elements returned by all are unique. Therefore, I assume by "unique" elements, you mean that the "onclick" attribute is unique.

The collection of elements returned by Capybara is an enumerable. Therefore, you can convert it to an array and then take the unique element's based on their onclick attribute:

results = all("//a[contains(@onclick, 'analyticsLog')]")
            .to_a.uniq{ |e| e[:onclick] }

Note that it looks like the duplicate links are due to one for the image and one for the text below the image. You could scope your search to just one or the other and then you would not need to do the uniq check. To scope to just the text link, use the fact that the link is a child of an h5:

results = all("//h5/a[contains(@onclick, 'analyticsLog')]")

For Question 2: Capture text if element present

To solve your second problem, you could use first to locate the element. This will return the matching element if one exists and nil if one does not. You could then save the text if the element is found.

For example:

el = info.first("//select[contains(@name, 'options[747]')]//*[@price='0']")
@work_list[@row, 4] = el.text if el

If you want the text of all matching elements, then use all:

options = info.all(".//select[contains(@name, 'options[747]')]//*[@price='0']")
@work_list[@row, 4] = options.collect(&:text).join(', ')

When there are multiple matching options, you will get something like "Green, Pink". If there are no matching options, you will get "".

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top