Question

I need to fetch HTML page with all objects on it (stylesheets, javascripts, images) and store data in the database. It is possible to implement this by simple fetching files listed in src attributes, but maybe someone can suggest any helper gem for this.

Also, is there way to package all this files to one (like web archieve), which can be opened by most browsers?

Thanks

Was it helpful?

Solution

You could use mechanize to do this job:

require "rubygems"
require "mechanize"

url = "http://stackoverflow.com/"
agent = WWW::Mechanize.new
page = agent.get(url)


page.search('img[@src]').each do |image|
  src = image["src"]
  image_file = agent.get(src) if src
  # Store image_file data it in database ...  
end

page.search('link[rel="stylesheet"]').each do |css|
  src = css["src"]
  css_file = agent.get(src) if src
  # Store css_file data it in database ...  
end

page.search('script[type="text/javascript"]').each do |script|
  src = script["src"]
  script_file = agent.get(src) if src
  # Store script_file data it in database ...    
end

You still have to handle exceptions and fix resources with relative src attributes. But this should do the job. This solution will however not fetch images that are referenced in the stylesheets.

OTHER TIPS

Check out Mechanize

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top