Export text (MediaWiki markup) from MediaWiki installation

https://stackoverflow.com/questions/7752392

09-02-2021
|

Pergunta

I want to export the MediaWiki markup for a number of articles (but not all articles) from a local MediaWiki installation. I want just the current article markup, not the history or anything else, with an individual text file for each article. I want to perform this export programatically and ideally on the MediaWiki server, not remotely.

For example, if I am interested in the Apple, Banana and Cupcake articles I want to be able to:

article_list = ["Apple", "Banana", "Cupcake"]
for a in article_list:
    get_article(a, a + ".txt")

My intention is to:

extract required articles
store MediaWiki markup in individual text files
parse and process in a separate program

Is this already possible with MediaWiki? It doesn't look like it. It also doesn't look like Pywikipediabot has such a script.

A fallback would be to be able to do this manually (using the Export special page) and easily parse the output into text files. Are there existing tools to do this? Is there a description of the MediaWiki XML dump format? (I couldn't find one.)

Solução 2

It looks like getText.php is a builtin server-side maintenance script for exporting the wikitext of a specific article. (Easier than querying the database.)

Found it via Publishing from MediaWiki which covers all angles on exporting from MediaWiki.

Outras dicas

On the server side, you can just export from the database. Remotely, Pywikipediabot has a script called get.py which gets the wikicode of a given article. It is also pretty simple to do manually, somehow like this (writing this from memory, errors might occur):

import wikipedia as pywikibot
site = pywikibot.getSite() # assumes you have a user-config.py with default site/user
article_list = ["Apple", "Banana", "Cupcake"]
for title in article_list:
    page = pywikibot.Page(title, site)
    text = page.get() # handling of not found etc. exceptions omitted
    file = open(title + ".txt", "wt")
    file.write(text)

Since MediaWiki's language is not well-defined, the only reliable way to parse/process it is through MediaWiki itself; there is no support for that in Pywikipediabot, and the few tools which try to do it fail with complex templates.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow