Pergunta

I am parsing through wikipedia dump in java. In my module I want to know the page id of the internal pages of wiki those are referred by the current page. Getting the internal links and thus the url from it is easy. But how to get Page ID from url.

Do I have to use some mediaWiki for this? If yes how Any other alternative?

for eg: http://en.wikipedia.org/wiki/United_States I want to get its Page-Id i.e 3434750

Foi útil?

Solução

You can use the API for that. Specifically, the query would look something like:

http://en.wikipedia.org/w/api.php?action=query&titles=United_States

(You can also specify more than one page title in the titles parameter, separated by |.)

As an alternative, you could download the page.sql dump (1 GB compressed for the English Wikipedia), which also contains this information. To actually query it, you could either import it into an MySQL database and then query that, or you could directly parse the SQL.

Outras dicas

If you can't use the api you can always get the pageID from the info page reached by appending ?action=info to the url. Should make a better starting point for a parser.

For your example above: https://en.wikipedia.org/wiki/United_States?action=info

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top