Pergunta

I'm a bit stuck in all the options the Wikipedia api has. My goals is to get the amount of words of an wikipedia page. I have the url of the wiki.

The search option does return this value:

http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srsearch=camera&srlimit=1

Wil return

<api>
<query-continue>
<search sroffset="1"/>
</query-continue>
<query>
<searchinfo totalhits="68658"/>
<search>
<p ns="0" title="Camera" snippet="A <span class='searchmatch'>camera</span> is an optical instrument that records image s that can be stored directly, transmitted to another location, or both. <b>...</b> " size="43246" wordcount="6348" timestamp="2014-04-29T15:48:07Z"/>
</search>
</query>
</api>

(scroll a bit to the right and you find wordcount

But this query is making a search and shows 1 top result. However, when I search on the wikipedia name in the URL, it doesnt always find that record as the first result.

So is there a way to get this wordcount a Wikipedia page?

Foi útil?

Solução

No other APIs provide this information, so the kludge with list=search is the only way. If you know the exact title you can get better results by appending &srwhat=nearmatch to the query (it will always return 1 result though). See the docs and try the sandbox to learn more.

Note that word counts are not stored in database so the API has to go to Lucene/Elasticsearch for this information which is not exactly fast, so if you need this information en masse you should download a dump instead.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top