Getting Wikipedia infoboxes in a format that Ruby can understand

https://stackoverflow.com/questions/4542612

13-10-2019
|

Question

I am trying to get the data from Wikipedia's infoboxes into a hash or something so that I can use it in my Ruby on Rails program. Specifically I'm interested in the Infobox company and Infobox person. The example I have been using is "Ford Motor Company". I want to get the company info for that and the person info for the people linked to in Ford's company box.

I've tried figuring out how to do this from the Wikipedia API or DBPedia but I haven't had much luck. I know wikipedia can return some things as json which I could parse with ruby but I haven't been able to figure out how to get the infobox. In the case of DBPedia I am kind of lost on how to even query it to get the info for Ford Motor Company.

Solution

I vote for DBpedia.

A simple explanation is:

The dbpedia naming scheme is http://dbpedia.org/resource/WikipediaArticleName (unique identifier) with spaces replaced by _.

http://dbpedia.org/page/ArticleName (the html preview) and http://dbpedia.org/data/ArticleName(.json/.jsod) are the JSON representation for the information about the article you want. (.rdf etc. might be confusing for you right now.)

For Ford Motor Company you should ask for:

http://dbpedia.org/data/Ford_Motor_Company.json

or:

http://dbpedia.org/data/Ford_Motor_Company.jsod

(Whichever is simpler for you)

Now, depending on the article type, person or company, there are different properties that define them that depend on the dbpedia ontology (http://wiki.dbpedia.org/Ontology).

A more advanced step could be to use SPARQL queries to get your data.

OTHER TIPS

Don't try to parse HTML with RegExp.

See: RegEx match open tags except XHTML self-contained tags

Use xpath or something similar.

I looked at their API, and it looks like there's a lot of detail, but the complexity is a hurdle. For long-term use it'd be best to figure it out, but for quick and dirty, here's a way to get at the data.

I'm using Nokogiri, which is a XML/HTML parser, and very flexible. For ease of use I'm using CSS accessors.

#!/usr/bin/env ruby

require 'open-uri'
require 'nokogiri'
require 'uri'

URL = 'http://en.wikipedia.org/wiki/Ford_Motor_Company'
doc = Nokogiri::HTML(open(URL))
infobox = doc.at('table[class="infobox vcard"]')
infobox_caption = infobox.at('caption').text

uri = URI.parse(URL)
infobox_agents = Hash[ *infobox.search('td.agent a').map{ |a| [ a.text, uri.merge(a['href']).to_s ] }.flatten ]

require 'ap'
ap infobox_caption
ap infobox_agents

The output looks like:

"Ford Motor Company"
{
              "Henry Ford" => "http://en.wikipedia.org/wiki/Henry_Ford",
    "William C. Ford, Jr." => "http://en.wikipedia.org/wiki/William_Clay_Ford,_Jr.",
      "Executive Chairman" => "http://en.wikipedia.org/wiki/Chairman",
        "Alan R. Mulally" => "http://en.wikipedia.org/wiki/Alan_Mulally",
              "President" => "http://en.wikipedia.org/wiki/President",
                    "CEO" => "http://en.wikipedia.org/wiki/Chief_executive_officer"
}

So, it's pulled the text of the caption, and returned a hash of the people's names, where the keys are their names and the values are the URLs.

You can use open-uri to download the HTML code of one wiki page and then interpret with Regexp. Look:

require 'open-uri'
infobox = {}
open('http://en.wikipedia.org/wiki/Wikipedia') do |page|
  page.read.scan(/<th scope="row" style="text-align:left;">(.*?)<\/th>.<td class="" style="">(.*?)<\/td>/m) do |key, value|
    infobox[key.gsub(/<.*?>/, '').strip] = value.gsub(/<.*?>/, '').strip # Removes tags (as hyperlink)
  end
end
infobox["Slogan"]                #=> "The free encyclopedia that anyone can edit."
infobox["Available language(s)"] #=> "257 active editions (276 in total)"

Should exist some better method. But this works.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow