문제

I am trying to parse og meta tags using the HTTParty gem using this code:

link = http://www.usatoday.com/story/gameon/2013/01/08/nfl-jets-tony-sparano-fired/1817037/
# link = http://news.yahoo.com/chicago-lottery-winners-death-ruled-homicide-181627271.html
resp = HTTParty.get(link)
ret_body = resp.body

# title
  og_title = ret_body.match(/\<[Mm][Ee][Tt][Aa] property\=\"og:title\"\ content\=\"(.*?)\"\/\>/)
  og_title = og_title[1].to_s

The problem is that it worked on some sites (yahoo!) but not others (usa today)

도움이 되었습니까?

해결책

Don't parse HTML with regular expressions, because they're too fragile for anything but the simplest problems. A tiny change to the HTML can break the pattern, causing you to begin a slow battle of maintaining an ever expanding pattern. It's a war you won't win.

Instead, use a HTML parser. Ruby has Nokogiri, which is excellent. Here's how I'd do what you want:

require 'nokogiri'
require 'httparty'

%w[
  http://www.usatoday.com/story/gameon/2013/01/08/nfl-jets-tony-sparano-fired/1817037/
  http://news.yahoo.com/chicago-lottery-winners-death-ruled-homicide-181627271.html
].each do |link|
  resp = HTTParty.get(link)

  doc = Nokogiri::HTML(resp.body)
  puts doc.at('meta[property="og:title"]')['content']
end

Which outputs:

Jets fire offensive coordinator Tony Sparano
Chicago lottery winner's death ruled a homicide

다른 팁

Perhaps I can offer an easier solution? Check out the OpenGraph gem.

It's a simple library for parsing Open Graph protocol information from web sites and should solve your problem.

Solution:

og_title = ret_body.match(/\<[Mm][Ee][Tt][Aa] property\=\"og:title\"\ content\=\"(.*?)\"[\s\/\>|\/\>]/)
og_title = og_title[1].to_s

Trailing whitespace messed up the parsing so make sure to check for that. I added an OR clause to the regex to allow for both trailing and non trailing whitespace.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top