Question

I'm having issues getting around websites that use http authentication, I have a list of sites which I do some scrapping on but some of these have http authentication on them. I'm not looking to get the content of those sites I want to be able to be able to determine if they are guarded by http auth and then move on. For example in the snippet below agent.get never return so it's impossible for me to handle it. How can I handle a case like this?

require 'mechanize'
agent = Mechanize.new
page = agent.get('http://freyalovesmusic.co.uk')
Was it helpful?

Solution

You could assume that if a page takes too long to load, it is using http authentication. Obviously not 100% accurate, but perhaps good enough for your situation?

You can use the Timeout module to move on after a certain amount of time, even if agent.get never returns:

require 'mechanize'
require 'timeout'

agent = Mechanize.new
begin
    Timeout::timeout(5) do
        page = agent.get('http://freyalovesmusic.co.uk')
    end
rescue Timeout::Error
    puts 'Page likely using http authentication'
end

OTHER TIPS

It should be raising a Mechanize::UnauthorizedError but it's misbehaving for some reason. Maybe you should report it on the mechanize github issues form.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top