Extracting articles from Wiki Dump

https://stackoverflow.com/questions/20836597

22-09-2022
|

Question

I have a huge wiki dump (~ 50GB after extracting the tar.bz file), from which I want to extract the individual articles. I am using the wikixmlj library to extract the contents and it does gives the title, text, categories mentioned at the end and a few other attributes. But I am more interested in the external links/references associated with each article, for which this library doesnt provide any API for.

Is there any elegant and efficient way to extract that other than parsing the wikiText that we get with the getWikiText() API.
Or is there any other java library to extract from this dump file, which gives me the title, content, categories and the references/external-links.

Solution

The XML dump contains exactly what the library is offering you: the page text along with some basic metadata. It doesn't contain any metadata about categories or external links.

The way I see it, you have three options:

Use the specific SQL dumps for the data you need, e.g. categorylinks.sql for categories or externallinks.sql for external links. But there is no dump for references (because MediaWiki doesn't track those).
Parse the wikitext from the XML dump. This would have problems with templates.
Use your own instance of MediaWiki to parse the wikitext into HTML and then parse that. This could potentially handle templates too.

OTHER TIPS

May be too late but this link could help: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/wikiprep.html

Here is an example output of above program: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/sample.hgw.xml

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow