Loading a webpage for parsing in Rails
-
16-09-2019 - |
Question
Assume, I want to get a page from the web to my application and make some sort of parsing with it. How do I do that? Where should I start from? Should be some plugins/gems required? What is your usual practice in resolving such type of tasks?
Solution
You should try Gems like Hpricot (wiki) or Nokogiri.
Hpricot example:
require 'open-uri'
require 'rubygems'
require 'hpricot'
html = Hpricot(open(an_url).read)
# This would search for any images inside a paragraph (XPath)
html.search('/html/body//p//img')
# This would search for any images with the class "test" (CSS selector)
html.search('img.test')
Nokogiri example:
require 'open-uri'
require 'rubygems'
require 'hpricot'
html = Nokogiri::HTML(open(an_url).read)
# This would search for any images inside a paragraph (XPath)
html.xpath('/html/body//p//img')
# This would search for any images with the class "test" (CSS selector)
html.css('img.test')
Nokogiri is generally faster. Both libraries feature a lot of functionality.
OTHER TIPS
What you want to do is called "Scraping"
Ryan Bates made two excelent screencasts on this topic:
I personally like Nokogiri more. You can also check out the following answer: Best Rails HTML Parser
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow