Pergunta

How can I get the list of depicted people from a wikipedia file?

Example: I have a file with article ID 5457009. The wikipedia link is http://commons.wikimedia.org/wiki/File:Bundesarchiv_B_145_Bild-F048807-0025,_Bonn,_Neubau_Kanzleramt,_Schmidt_im_Arbeitszimmer.jpg

What would the API request look like to extract the people meta data from this file (Schmidt, Helmut: Bundeskanzler, Verteidigungsminister, SPD, Bundesrepublik Deutschland)

Here is another example with 3 depicted people: http://commons.wikimedia.org/wiki/File:Bundesarchiv_B_145_Bild-F009740-0002,_Presseclub_Bonn,_Bildungspolitiker_aus_Finnland.jpg

Foi útil?

Solução

Unfortunately, this information is not stored in any structured manner — the table you see on the image description page is just a MediaWiki template that renders to an HTML table.

To extract the information from the template, you basically have three options:

  1. Fetch the raw wiki markup of the image description page using prop=revisions and rvprop=content and parse it yourself. Unfortunately, parsing wikitext reliably can be a bit tricky, but several MediaWiki bot frameworks come with pretty good parsers built in.

  2. Fetch the parsed HTML version of the page using action=parse and use a standard HTML parser to extract the text from the table.

  3. Since MediaWiki 1.20, you also have the option to tell MediaWiki to parse the template markup for you and return an XML parse tree by passing the parameter generatexml=1 to either prop=revisions or action=parse. The relevant part will look something like this (reformatted for readability):

<template>
  <title>BArch-image</title>
  ...
  <part>
    <name>depicted people</name> =
    <value>
      * Schmidt, Helmut: Bundeskanzler, Verteidigungsminister, SPD, Bundesrepublik Deutschland
    </value>
  </part>
  ...
</template>

This is not a perfectly clean representation of the data — it still contains some unparsed wikitext elements, like the * denoting a bulleted list item — but it should be much easier to parse than the completely raw MediaWiki template markup.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top