Question

I'm searching for a ruby gem for my ruby on rails project for extracting content from web pages. I found the ruby-readability gem, but it does not support multiple pages on articles. Can you reccomend a gem who also supports multiple page article extraction?

Or how can I code the ability to recognise multiple sites on articles?

Thanks

Was it helpful?

Solution

You can use a high level gem like Pismo in combination with Mechanize to iteratevely go through each page and concatenate the body of the article. For that you need to know what link brings you to the next page. Google is pushing for the adoption of a convention based on the rel attribute

<a href="blog-post?page=2" rel='next'>next</a>

Here's a very very rough draft of ruby code:

agent = WWW::Mechanize.new
agent.get("http://www.awesomeblog.com/amazing-article")

scraper.text = MyScraper.new(:text => Pismo::Document.new(agent.url))

while agent.page.link_with("rel='next'").click do
  pismo_doc = Pismo::Document.new(agent.url)
  scraper.text << pismo_doc.lede
end

scraper.save!

This is pseudo code/wilde guess (I don't know the API of mechanize) but you get the general idea.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top