axslx - how can I check if an array element exists and if so alter its output?

https://stackoverflow.com/questions/11755755

24-06-2021
|

سؤال

I have a Xpath query which accepts array elements for output using Axslx, I need to tidy up my ouput for certain conditions one of which is the 'Software included'

My xpath scrapes the following URL http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1

A sample of my code is below:

clues = Array.new
clues << 'Optical drive'
clues << 'Pointing device'
clues << 'Software included'

selector = "//td[text()='%s']/following-sibling::td"

data = clues.map do |clue| 
         xpath = selector % clue
         [clue, doc.at(xpath).text.strip]
       end

Axlsx::Package.new do |p|
  p.workbook.add_worksheet do |sheet|
    data.each { |datum| sheet.add_row datum }
  end
  p.serialize 'output.xlsx'
end

My Current output formatting

enter image description here

My Desired output formatting

enter image description here

المحلول

If you can rely on the data always using ';' for separators, have a go at this:

data = []
clues.each do |clue|
  xpath = selector % clue
  details = doc.at(xpath).text.strip.split(';')
  data << [clue, details.pop]
  details.each { |detail| data << ['', detail] }
end

to generate the data before the Axlsx::Package.new block

In answer to you comment/question: You do it with something like this ;)

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'axlsx'

class Scraper

   def initialize(url, selector)
     @url = url
     @selector = selector
   end

   def hooks
     @hooks ||= {}
   end

   def add_hook(clue, p_roc)
     hooks[clue] = p_roc
   end

   def export(file_name)
     Scraper.clues.each do |clue|
       if detail = parse_clue(clue)
         output << [clue, detail.pop]
         detail.each { |datum| output << ['', datum] }
       end
     end
     serialize(file_name)
   end

   private

   def self.clues
     @clues ||= ['Operating system', 'Processors', 'Chipset', 'Memory type', 'Hard drive', 'Graphics',
                 'Ports', 'Webcam', 'Pointing device', 'Keyboard', 'Network interface', 'Chipset', 'Wireless',
                 'Power supply type', 'Energy efficiency', 'Weight', 'Minimum dimensions (W x D x H)',
                 'Warranty', 'Software included', 'Product color']
   end

   def doc
     @doc ||= begin 
                Nokogiri::HTML(open(@url))
              rescue
                raise ArgumentError, 'Invalid URL - Nothing to parse'
              end
   end

   def output
     @output ||= []
   end

   def selector_for_clue(clue)
     @selector % clue
   end

   def parse_clue(clue)
     if element = doc.at(selector_for_clue(clue))
       call_hook(clue, element) || element.inner_html.split('<br>').each(&:strip)
     end
   end

   def call_hook(clue, element)
     if hooks[clue].is_a? Proc
        value = hooks[clue].call(element)
        value.is_a?(Array) ? value : [value]
     end
   end

   def package
     @package ||= Axlsx::Package.new
   end

   def serialize(file_name)
     package.workbook.add_worksheet do |sheet|
       output.each { |datum| sheet.add_row datum }
     end
     package.serialize(file_name)
   end
end

scraper = Scraper.new("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1", "//td[text()='%s']/following-sibling::td")

# define a custom action to take against any elements found.
os_parse = Proc.new do |element|
  element.inner_html.split('<br>').each(&:strip!).each(&:upcase!)
end

scraper.add_hook('Operating system', os_parse)

scraper.export('foo.xlsx')

And the FINAL answer is... a gem.

http://rubydoc.info/gems/ninja2k/0.0.2/frames

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow