How do I use the Perl Text-MediawikiFormat to convert mediawiki to xhtml?

https://stackoverflow.com/questions/12630206

04-07-2021
|

Question

On an Ubuntu platform, I installed the nice little perl script

libtext-mediawikiformat-perl - Convert Mediawiki markup into other text formats

which is available on cpan. I'm not familiar with perl and have no idea how to go about using this library to write a perl script that would convert a mediawiki file to an html file. e.g. I'd like to just have a script I can run such as

./my_convert_script input.wiki > output.html

(perhaps also specifying the base url, etc), but have no idea where to start. Any suggestions?

La solution

The Perl library Text::MediawikiFormat isn't really intended for stand-alone use but rather as a formatting engine inside a larger application.

The documentation at CPAN does actually show a way how to use this library, and does note that other modules might provide better support for one-off conversions.

You could try this (untested) one-liner

perl -MText::MediawikiFormat -e'$/=undef; print Text::MediawikiFormat::format(<>)' input.wiki >output.html

although that defies the whole point (and customization abilities) of this module.

I am sure that someone has already come up with a better way to convert single MediaWiki files, so here is a list of alternative MediaWiki processors on the mediawiki site. This SO question coud also be of help.

Other markup languages, such as Markdown provide better support for single-file conversions. Markdown is especially well suited for technical documents and mirrors email conventions. (Also, it is used on this site.)

The libfoo-bar-perl packages in the Ubuntu repositories are precompiled Perl modules. Usually, these would be installed via cpan or cpanm. While some of these libraries do include scripts, most don't, and aren't meant as stand-alone applications.

Autres conseils

I believe @amon is correct that perl library I reference in the question is not the right tool for the task I proposed.

I ended up using the mediawiki API with the action="parse" to convert to HTML using the mediawiki engine, which turned out to be much more reliable than any of the alternative parsers I tried proposed on the list. (I then used pandoc to convert my html to markdown.) The mediawiki API handles extraction of categories and other metadata too, and I just had to append the base url to internal image and page links.

Given the page title and base url, I ended up writing this as an R function.

wiki_parse <- function(page, baseurl, format="json", ...){
  require(httr)
  action = "parse"
  addr <- paste(baseurl, "/api.php?format=", format, "&action=", action, "&page=", page, sep="")
  config <- c(add_headers("User-Agent" = "rwiki"), ...)
  out <- GET(addr, config=config)
  parsed_content(out)
}

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow