How to fetch HTML page with all its objects on Ruby
Question
I need to fetch HTML page with all objects on it (stylesheets, javascripts, images) and store data in the database. It is possible to implement this by simple fetching files listed in src attributes, but maybe someone can suggest any helper gem for this.
Also, is there way to package all this files to one (like web archieve), which can be opened by most browsers?
Thanks
Solution
You could use mechanize to do this job:
require "rubygems"
require "mechanize"
url = "http://stackoverflow.com/"
agent = WWW::Mechanize.new
page = agent.get(url)
page.search('img[@src]').each do |image|
src = image["src"]
image_file = agent.get(src) if src
# Store image_file data it in database ...
end
page.search('link[rel="stylesheet"]').each do |css|
src = css["src"]
css_file = agent.get(src) if src
# Store css_file data it in database ...
end
page.search('script[type="text/javascript"]').each do |script|
src = script["src"]
script_file = agent.get(src) if src
# Store script_file data it in database ...
end
You still have to handle exceptions and fix resources with relative src attributes. But this should do the job. This solution will however not fetch images that are referenced in the stylesheets.
OTHER TIPS
Check out Mechanize
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow