Question

I have this class

class Scrapper
    require 'rubygems'
    require 'mechanize'

    def initialize(url)
        @url = url
        agent = Mechanize.new
        @page = agent.get(url)
    end

    def perform(type)
        if type == 'title'
            get_title
        else
            get_content
        end
    end

    def get_title
        @page.title
    end

    def get_content
        @page
    end
end

Right now i can get the title of the page but how do i get the relevant content?
E.g. http://thenextweb.com/facebook/2014/03/06/facebook-launches-improved-version-major-news-feed-redesign-teased-last-year/#!yJE5N

  • I would like to get a cover/any relevant image if any.
  • The content of the page.
    Thanks.
Was it helpful?

Solution

This will return that image as a Nokogiri::XML::Element

def get_article_image_tag
  @page.at(".article-featured-image > img")
end
#=> #<Nokogiri::XML::Element:0x19ac280 name="img" attributes= #<Nokogiri::XML::Attr:0x19ac238 name="width" value="786">, #<Nokogiri::XML::Attr:0x19ac22c name="height" value="305">, #<Nokogiri::XML::Attr:0x19ac 220 name="src" value="http://cdn0.tnwcdn.com/wp-content/blogs.dir/1/files/2014/03 187265573-786x305.jpg">, #<Nokogiri::XML::Attr:0x19ac214 name="class" value="attachment-featured_post wp-post-image">, #<Nokogiri::XML::Attr:0x19ac208 name="alt" value="SWEDEN-FACEBOOK-DATA-CENTER-SERVERS">, #<Nokogiri::XML::Attr:0x19ac1fc name="title" value="Facebook launches an improved version of the News Feed redesign teased last year">]>

This will return the source url

def get_article_image_src
  @page.at(".article-featured-image > img").attributes["src"].value
end
#=>"http://cdn0.tnwcdn.com/wp-content/blogs.dir/1/files/2014/03/187265573-786x305.jpg"

To get the article text

def get_article_text
  @page.at("div.article").text
end

This will return the article text without any formatting just text and non visible characters such as \n, \t, etc. This method also seems to scrape HTML/Javascript code inside the selector.

Also for dynamic capabilities you could alter your call here

def perform(type)
   self.send("get_#{type.to_s}")
end

then it can be called with any of "content", "title","article_image_tag","article_image_src" and any other get_xxx methods you define.

Edit to show your user all the images this will work in rails view

<% @page.images.each do |image| %>
  <%= image_tag(image.url) %>
<% end %>

This will iterate through all the images and display them in image tags in your page. Obviously this may need tinkering depending on if the urls are relative or full.

Honestly unless you need mechanize to set cookies or something I would take a look at Nokogiri. Not 100% sure how to do this with mechanize but with Nokogiri you could determine "relevance" of a picture by it's overall size like so.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://thenextweb.com/facebook/2014/03/06/facebook-launches-improved-version-major-news-feed-redesign-teased-last-year/#!yJ6uM"))
largest_image = doc.search("img").sort_by{|image| image.attributes["height"].value.to_i * image.attributes["width"].value.to_i}.pop
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top