axslx - how can I check if an array element exists and if so alter its output?

StackOverflow https://stackoverflow.com/questions/11755755

  •  24-06-2021
  •  | 
  •  

سؤال

I have a Xpath query which accepts array elements for output using Axslx, I need to tidy up my ouput for certain conditions one of which is the 'Software included'

My xpath scrapes the following URL http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1

A sample of my code is below:

clues = Array.new
clues << 'Optical drive'
clues << 'Pointing device'
clues << 'Software included'

selector = "//td[text()='%s']/following-sibling::td"

data = clues.map do |clue| 
         xpath = selector % clue
         [clue, doc.at(xpath).text.strip]
       end

Axlsx::Package.new do |p|
  p.workbook.add_worksheet do |sheet|
    data.each { |datum| sheet.add_row datum }
  end
  p.serialize 'output.xlsx'
end

My Current output formatting

enter image description here

My Desired output formatting

enter image description here

هل كانت مفيدة؟

المحلول

If you can rely on the data always using ';' for separators, have a go at this:

data = []
clues.each do |clue|
  xpath = selector % clue
  details = doc.at(xpath).text.strip.split(';')
  data << [clue, details.pop]
  details.each { |detail| data << ['', detail] }
end

to generate the data before the Axlsx::Package.new block

In answer to you comment/question: You do it with something like this ;)

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'axlsx'

class Scraper

   def initialize(url, selector)
     @url = url
     @selector = selector
   end

   def hooks
     @hooks ||= {}
   end

   def add_hook(clue, p_roc)
     hooks[clue] = p_roc
   end

   def export(file_name)
     Scraper.clues.each do |clue|
       if detail = parse_clue(clue)
         output << [clue, detail.pop]
         detail.each { |datum| output << ['', datum] }
       end
     end
     serialize(file_name)
   end

   private

   def self.clues
     @clues ||= ['Operating system', 'Processors', 'Chipset', 'Memory type', 'Hard drive', 'Graphics',
                 'Ports', 'Webcam', 'Pointing device', 'Keyboard', 'Network interface', 'Chipset', 'Wireless',
                 'Power supply type', 'Energy efficiency', 'Weight', 'Minimum dimensions (W x D x H)',
                 'Warranty', 'Software included', 'Product color']
   end

   def doc
     @doc ||= begin 
                Nokogiri::HTML(open(@url))
              rescue
                raise ArgumentError, 'Invalid URL - Nothing to parse'
              end
   end

   def output
     @output ||= []
   end

   def selector_for_clue(clue)
     @selector % clue
   end

   def parse_clue(clue)
     if element = doc.at(selector_for_clue(clue))
       call_hook(clue, element) || element.inner_html.split('<br>').each(&:strip)
     end
   end

   def call_hook(clue, element)
     if hooks[clue].is_a? Proc
        value = hooks[clue].call(element)
        value.is_a?(Array) ? value : [value]
     end
   end

   def package
     @package ||= Axlsx::Package.new
   end

   def serialize(file_name)
     package.workbook.add_worksheet do |sheet|
       output.each { |datum| sheet.add_row datum }
     end
     package.serialize(file_name)
   end
end

scraper = Scraper.new("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1", "//td[text()='%s']/following-sibling::td")

# define a custom action to take against any elements found.
os_parse = Proc.new do |element|
  element.inner_html.split('<br>').each(&:strip!).each(&:upcase!)
end

scraper.add_hook('Operating system', os_parse)

scraper.export('foo.xlsx')

And the FINAL answer is... a gem.

http://rubydoc.info/gems/ninja2k/0.0.2/frames

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top