I am parsing through wikipedia dump in java. In my module I want to know the page id of the internal pages of wiki those are referred by the current page. Getting the internal links and thus the url from it is easy. But how to get Page ID from url.

Do I have to use some mediaWiki for this? If yes how Any other alternative?

for eg: http://en.wikipedia.org/wiki/United_States I want to get its Page-Id i.e 3434750

有帮助吗?

解决方案

You can use the API for that. Specifically, the query would look something like:

http://en.wikipedia.org/w/api.php?action=query&titles=United_States

(You can also specify more than one page title in the titles parameter, separated by |.)

As an alternative, you could download the page.sql dump (1 GB compressed for the English Wikipedia), which also contains this information. To actually query it, you could either import it into an MySQL database and then query that, or you could directly parse the SQL.

其他提示

If you can't use the api you can always get the pageID from the info page reached by appending ?action=info to the url. Should make a better starting point for a parser.

For your example above: https://en.wikipedia.org/wiki/United_States?action=info

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top