Pergunta

i am trying to extract the parent company information (in infobox pane) for a page such as "KFC".

If you access the

http://en.wikipedia.org/wiki/KFC

url... the info box contains the property (Parent = Yum! Brands)

.. howver, when i access through the PHP API.. the parent info is not included.

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=KFC&rvsection=0

How do i ensure that Wikipedia API returns the "Parent = " information as well (for a brand term like "KFC"). Essentially, I want to extract info that Yum Brands is the parent of KFC through the wikipedia API.

Thanks!

Foi útil?

Solução

Take a look at the wikipedia wiki official ways of getting informations.

My suggestion would be to use the screen scraping throught PHP Simple HTML DOM Parser which will always be the best, even if it's deprecated. The only downside is that if Wikipedia changes how it looks like you will have to update your code.

A guide to PHP Simple HTML DOM Parser.

Edit:

At least i'm doing something instead of linking to non working resources and downvoting right answers ...

Here's the code I made to get the Parent company information from the Infobox pane with the PHP Simple HTML DOM Parser.

<?php

//The folder where you uploaded simple_html_dom.php
require_once('/homepages/../htdocs/simple_html_dom.php');

//Wikipedia page to parse
$html = file_get_html('http://en.wikipedia.org/wiki/KFC');


foreach ( $html->find ( 'tr th a[title=Holding company]' ) as $element ) {
    $element = $element->parent;
    $element = $element->parent;

    $tabella = $element->find ( 'td', 0 );

    //Now $parent contains "Yum! Brands"
    $parent = $tabella->plaintext;

    echo $parent;

}

?>

If this answer suit your needs please choose it as best answer and upvote it because it took me a lot of effort, about 1 hour =/

Thanks ;)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top