When to use 'http://' or 'http://www.' when scraping?

https://stackoverflow.com/questions/17120241

31-05-2022
|

Question

I am scraping a small number of sites with the ruby anemone gem.

Anemone.crawl("http://www.somesite.com") do |anemone|
         anemone.on_every_page do |page|
            ...
         end
end

Depending on the site, some require 'www' to be present in the url while others require that it be omitted. How can I configure the crawler or code it so that it known when to use the correct url?

Solution

You can't know, so, do something similar to what you'd do while sitting in front of the browser.

Try one, see if you get a connection, see if you got a 200 response, then see if the title has "error" in it. If none of those fail, then consider it good.

If not, try the other.

The problem using a canned spider/crawler is you have to work around their code when the situation is different than they expected when they wrote the software.

OTHER TIPS

Most sites redirect www to somesite.com, or the other way around automatically, so you should not have to worry about that.

I would think Anemone can handle redirects(?). But if it can't then I suggest you pre-check the URLs for redirects before you hand them over to Anemone. You can look here how to do that:

How can I get the final URL after redirects using Ruby?

I.e.:

final_url = check_base_url_for_redirect('www.somesite.com')
Anemone.crawl(final_url) ...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow