How can I read Wikipedia dump files similarly to how I can get information through the Mediawiki API?

StackOverflow https://stackoverflow.com/questions/22701254

Question

I've been trying to create a local Mediawiki instance of the english Wikipedia so that I can do lots of heavy and time consuming calls to the Mediawiki API (e.x. iterate over all pages and get their category and internal links)

So far it hasn't worked out, there is always some problem after the SQL importing of the dumps is done or mid process, so I'm giving up on that for now but am looking into another solution.

So say I want to go through all pages within a specific category and get each page internal links (what other wikipedia pages it links to), in mwclient it's easy:

import mwclient
site = mwclient.Site('en.wikipedia.org')

def getPages(c, p):
    for page in c:
        if page.namespace == 0:
            p.append(page)
        elif page.namespace == 14:
            getPages(page, p)
        else:
            pass

pages = []
c1 = site.Pages["Category:Mathematics"]
getPages(c1, pages)

But it requires me to have a running copy of Wikipedia, so I'm wondering if there is a similarly simple solution that can be executed against a dump (XML, SQL, DBPedia, or some other dump form), rather then communicating with a Mediawiki instance?

Was it helpful?

Solution

To use the API you need running MediaWiki with an imported database.

There are tools for parsing raw dumps, but they require separate coding. It's not very complicated, but it's separate.

For Python see the question Parsing a Wikipedia dump.

There's also a good library in Perl: https://metacpan.org/pod/MediaWiki::DumpFile .

Wikimedians care about the dumps' usability, so if you have trouble importing the dump, you should ask about it on the MediaWiki mailing list: https://lists.wikimedia.org/mailman/listinfo/mediawiki-l .

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top