Question

I have the following ruby script:

require "rubygems"
require "rest-client" #although not required in the program
require "open-uri"
require "nokogiri"


puts "Opening file"
page=File.open("file.html","r"){|file| file.read}
puts page
    page = Nokogiri::HTML(page)
    puts page.class
    #Filters content of page to select all references to the documents filing date
    td_rows = page.css('td i.blue')
    puts td_rows

I can run this script from CodeRunner or TextWrangler and invoke it from the terminal using ruby 'filename'. However, I am trying to get the script to run at a certain point in time and have tried calling the script using Keyboard Maestro or Platypus but although it runs it does not seem to complete the line

td_rows = page.css('td i.blue')

The variable td_rows contains nothing. Does anyone have any idea why this will not work?

Many thanks

Was it helpful?

Solution 2

I managed to find out why the nokogiri parse was not working.

For some reason, if the page was opened from the web, the script would work but if the web page was saved to disk first and then opened it did not. I found that when the page was opened from disk it encountered a nokogiri error and only read and parsed the first few lines of the file. The error was due to a html comment not being closed on the same line but on a subsequent line.

I managed to overcome this problem by reading the file with the mode "rb" instead of just "r". i.e. if I replace the file.open line with:

page=File.open("file.html","rb"){|file| file.read}

nokogiri correctly parses the file.

OTHER TIPS

If your code can't read the file, Nokogiri will still create an empty HTML document when attempting to parse an empty string:

[2] (pry) main: 0> Nokogiri::HTML('')
=> #(Document:0x245962c {
  name = "document",
  children = [ #(DTD:0x24ab210 { name = "html" })]
  })
[3] (pry) main: 0> Nokogiri::HTML('').to_html
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n\n"

And, at that point you will get a Nokogiri::HTML document when you look at its class:

[4] (pry) main: 0> Nokogiri::HTML('').class
=> Nokogiri::HTML::Document

So checking for the class name in puts page.class doesn't do you any good. And, looking for the cells will return empty:

[3] (pry) main: 0> Nokogiri::HTML('').css('td i.blue')
=> []

Personally, if you want to know if you read the document, look to see if you got any characters:

abort("Got nothing") if page.empty?

instead of printing the contents or looking at the document.class.

Also, I'd use page = File.read('file.html') instead of the File.open, but that's just me.

This all points to the file not being found or it being empty. You could use something like File.exists?('file.html') to look for its existence and File.size('file.html') to check to see if it has contents before continuing.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top