Content of infobox of Wikipedia

https://stackoverflow.com/questions/8088226

26-02-2021
|

Pergunta

I need to get the content of an infobox of any movie. I know the name of the movie. One way is to get the complete content of a Wikipedia page and then parse it until I find {{Infobox and then get the content of the infobox.

Is there any other way for the same using some API or parser?

I am using Python and the pywikipediabot API.

I am also familiar with the wikitools API. So instead of pywikipedia if someone has solution related to the wikitools API, please mention that as well.

Solução

Instead of reinventing the wheel, check out DBPedia, which has already extracted all Wikipedia infoboxes into an easily parsable database format.

Outras dicas

Another great MediaWiki parser is mwparserfromhell.

In [1]: import mwparserfromhell

In [2]: import pywikibot

In [3]: enwp = pywikibot.Site('en','wikipedia')

In [4]: page = pywikibot.Page(enwp, 'Waking Life')            

In [5]: wikitext = page.get()               

In [6]: wikicode = mwparserfromhell.parse(wikitext)

In [7]: templates = wikicode.filter_templates()

In [8]: templates?
Type:       list
String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox film\n| name           = Waking Life\n| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
Length:     31
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items

In [10]: templates[:2]
Out[10]: 
[u'{{Use mdy dates|date=September 2012}}',
 u"{{Infobox film\n| name           = Waking Life\n| image          = Waking-Life-Poster.jpg\n| image_size     = 220px\n| alt            =\n| caption        = Theatrical release poster\n| director       = [[Richard Linklater]]\n| producer       = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West\n| writer         = Richard Linklater\n| starring       = [[Wiley Wiggins]]\n| music          = Glover Gill\n| cinematography = Richard Linklater<br />[[Tommy Pallotta]]\n| editing        = Sandra Adair\n| studio         = [[Thousand Words]]\n| distributor    = [[Fox Searchlight Pictures]]\n| released       = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}\n| runtime        = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>\n| country        = United States\n| language       = English\n| budget         =\n| gross          = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>\n}}"]

In [11]: infobox_film = templates[1]

In [12]: for param in infobox_film.params:
             print param.name, param.value

 name             Waking Life

 image            Waking-Life-Poster.jpg

 image_size       220px

 alt             

 caption          Theatrical release poster

 director         [[Richard Linklater]]

 producer         [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West

 writer           Richard Linklater

 starring         [[Wiley Wiggins]]

 music            Glover Gill

 cinematography   Richard Linklater<br />[[Tommy Pallotta]]

 editing          Sandra Adair

 studio           [[Thousand Words]]

 distributor      [[Fox Searchlight Pictures]]

 released         {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}

 runtime          101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>

 country          United States

 language         English

 budget          

 gross            $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>

Don't forget that params are mwparserfromhell objects too!

You can get the wikipage content with pywikipdiabot, and then, you can search for the infobox with regex, a parser like mwlib [0], or even stick with pywikipediabot and use one of his template tools. For example on textlib you'll find some functions to deal with templates (hint: search for "# Functions dealing with templates"). [1]

[0] - http://pypi.python.org/pypi/mwlib

[1] - http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/pywikibot/textlib.py?view=markup

Any infobox is a template transcluded by curly brackets. Let's have a look to a template and how it is transcluded in wikitext:

Infobox film

{{Infobox film
| name           = Actresses
| image          = Actrius film poster.jpg
| alt            = 
| caption        = Catalan language film poster
| native_name      = ([[Catalan language|Catalan]]: '''''Actrius''''')
| director       = [[Ventura Pons]]
| producer       = Ventura Pons
| writer         = [[Josep Maria Benet i Jornet]]
| screenplay     = Ventura Pons
| story          = 
| based_on       = {{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}
| starring       = {{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna Lizaran]]|[[Mercè Pons]]}}
| narrator       = <!-- or: |narrators = -->
| music          = Carles Cases
| cinematography = Tomàs Pladevall
| editing        = Pere Abadal
| production_companies = {{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de Cultura]]|[[Televisión Española]]}}
| distributor    = [[Buena Vista International]]
| released       = {{film date|df=yes|1997|1|17|[[Spain]]}}
| runtime        = 100 minutes
| country        = Spain
| language       = Catalan
| budget         = 
| gross          = <!--(please use condensed and rounded values, e.g. "£11.6 million" not "£11,586,221")-->
}}

There are two high level Page methods in Pywikibot to parse the content of any template inside the wikitext content. Both use mwparserfromhell if installed; otherwise a regex is used but the regex may fail for nested templates with depth > 3:

raw_extracted_templates

raw_extracted_templates is a Page property with returns a list of tuples with two items each. The first item is the template identifier as str, 'Infobox film' for example. The second item is an OrderedDict with template parameters identifier as keys and their assignmets as values. For example the template fields

| name = FILM TITLE
| image = FILM TITLE poster.jpg
| caption = Theatrical release poster

results in an OrderedDict as

OrderedDict((name='FILM TITLE', image='FILM TITLE poster.jpg' caption='Theatrical release poster')

Now how get it with Pywikibot?

from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en')  # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.page.raw_extracted_templates
for tmpl, params in all_templates:
    if tmpl == 'Infobox film':
        pprint(params)

This will print

 OrderedDict([('name', 'Actresses'),
              ('image', 'Actrius film poster.jpg'),
              ('alt', ''),
              ('caption', 'Catalan language film poster'),
              ('native_name',
               "([[Catalan language|Catalan]]: '''''Actrius''''')"),
              ('director', '[[Ventura Pons]]'),
              ('producer', 'Ventura Pons'),
              ('writer', '[[Josep Maria Benet i Jornet]]'),
              ('screenplay', 'Ventura Pons'),
              ('story', ''),
              ('based_on',
               "{{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}"),
              ('starring',
               '{{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
               'Lizaran]]|[[Mercè Pons]]}}'),
              ('narrator', ''),
              ('music', 'Carles Cases'),
              ('cinematography', 'Tomàs Pladevall'),
              ('editing', 'Pere Abadal'),
              ('production_companies',
               '{{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
               'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - '
               'Departament de Cultura]]|[[Televisión Española]]}}'),
              ('distributor', '[[Buena Vista International]]'),
              ('released', '{{film date|df=yes|1997|1|17|[[Spain]]}}'),
              ('runtime', '100 minutes'),
              ('country', 'Spain'),
              ('language', 'Catalan'),
              ('budget', ''),
              ('gross', '')])

templatesWithParams()

This is similar to raw_extracted_templates property but the method returns a list of tuples with again two items. The first item is the template as a Page object. The second item is a list of template parameters. Have a look at the sample:

Sample code

from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en')  # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.templatestemplatesWithParams()
for tmpl, params in all_templates:
    if tmpl.title(with_ns=False) == 'Infobox film':
        pprint(tmpl)

This will print the list:

['alt=',
 "based_on={{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}",
 'budget=',
 'caption=Catalan language film poster',
 'cinematography=Tomàs Pladevall',
 'country=Spain',
 'director=[[Ventura Pons]]',
 'distributor=[[Buena Vista International]]',
 'editing=Pere Abadal',
 'gross=',
 'image=Actrius film poster.jpg',
 'language=Catalan',
 'music=Carles Cases',
 'name=Actresses',
 'narrator=',
 "native_name=([[Catalan language|Catalan]]: '''''Actrius''''')",
 'producer=Ventura Pons',
 'production_companies={{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
 'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de '
 'Cultura]]|[[Televisión Española]]}}',
 'released={{film date|df=yes|1997|1|17|[[Spain]]}}',
 'runtime=100 minutes',
 'screenplay=Ventura Pons',
 'starring={{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
 'Lizaran]]|[[Mercè Pons]]}}',
 'story=',
 'writer=[[Josep Maria Benet i Jornet]]']

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow