Question

I am trying to run a local Ruby script using Mechanize that logs me onto a website and goes through about 1500 of its webpages and parses information from each of them. The parsing works, but only for a certain length of time; the script runs for about 45 seconds or so, and then it stops completely and reports:

/Users/myname/.rvm/gems/ruby-1.9.3-p374/gems/mechanize-2.7.1/lib/mechanize/http/agent.rb:306:in `fetch': 503 => Net::HTTPServiceUnavailable for http://example.com/page;53 -- unhandled response (Mechanize::ResponseCodeError)

I can't tell for sure, but I feel like this would be due to a connection timeout. I tried resolving that in my script with a very long timeout (this script could take up to 15 minutes to run), but it still doesn't change anything. Let me know if you have any ideas.

This is my script:

require 'mechanize'
require 'open-uri'
require 'rubygems'

agent = Mechanize.new 
agent.open_timeout   = 1000
agent.read_timeout   = 1000
agent.max_history = 1

page = agent.get('examplesite.com')

myform = page.form_with(:action => '/maint')

myuserid_field = myform.field_with(:id => "username")
myuserid_field.value = 'myusername'  
mypass_field = myform.field_with(:id => "password")
mypass_field.value = 'mypassword' 

page = agent.submit(myform, myform.buttons.first)

urlArray = [giant array of webpages here]

urlArray.each do |term|
    page = agent.get('' + term + '')
    page.encoding = 'windows-1252'
    puts agent.page.parser.xpath("//tr[4]/td[2]/textarea/text()").text + 'NEWLINEHERE'
end
Was it helpful?

Solution

Try calling sleep(1) in your each loop. It's very likely that the target server is overwhelmed by all the requests without any pause.

OTHER TIPS

My first suspicion is that you are violating the site's terms of service (TOS) and/or their robots.txt file, and their system is temporarily banning you.

Running a spider or crawler at full speed isn't being a good network citizen, so search for their TOS and learn how to load and parse a robots.txt file to play by their rules. Mechanize knows how to deal with robots.txt files but you have to enable it using robots=.

Trying to read 1500 pages at one time, without an agreement with them that it's OK, would be a pretty obvious sack and pillage run, so instead don't hit them so hard. Remember, it's their bandwidth and CPU you're hitting also. Keep hitting them hard and they might ban you permanently which is not what you want.

It could be that the server response time is delayed or not responding to your parsing request, that means catching the error may help to continue your request. I had a similar sort of problems before and solve it using TimeoutError. You might want to implement it like this

begin
  status=Timeout.timeout(5){
    #Interrupts if it takes more than 5 secs
  }
rescue Timeout::Error
  #Should read the data from time-out and carry on where it was left off.
end

You might need to make use of Rails.cache.write and Rails.cache.read to store and read the data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top