Question

I had been using the code below to crawl a website, but I think I might have crawled too much and gotten myself banned from the site entirely. As in, I can still access the site on my browser, but any code involving open-uri and this site throws me a 503 site unavailable error. I think this is site specific because open-uri still works fine with, say, google and facebook. Is there a workaround for this?

require 'rubygems'
require 'hpricot'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.quora.com/What-is-the-best-way-to-get-ove$

topic = doc.at('span a.topic_name span').content
puts topic
Was it helpful?

Solution

There are workarounds, but the best idea is to be a good citizen according to their terms. You might want to confirm that you are following their Terms of Service:

If you operate a search engine or robot, or you republish a significant fraction of all Quora Content (as we may determine in our reasonable discretion), you must additionally follow these rules:

  • You must use a descriptive user agent header.
  • You must follow robots.txt at all times.
  • You must make it clear how to contact you, either in your user agent string, or on your website if you have one.

You can set your user-agent header easily using OpenURI:

Additional header fields can be specified by an optional hash argument.

  open("http://www.ruby-lang.org/en/",
    "User-Agent" => "Ruby/#{RUBY_VERSION}",
    "From" => "foo@bar.invalid",
    "Referer" => "http://www.ruby-lang.org/") {|f|
    # ...
  }

Robots.txt can be retrieved from http://www.quora.com/robots.txt. You'll need to parse it and honor its settings or they'll ban you again.

Also, you might want to restrict the speed of your code by sleeping between loops.

Also, if you are spidering their site for content, you might want to look into caching pages locally, or using one of the spidering packages. It's easy to write a spider. It's more work to write one that plays nicely with a site but better that than not be able to spider their site at all.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top