Hadoop for the Wikipedia pagecount dataset

https://stackoverflow.com/questions/12882312

07-07-2021
|

Domanda

I want to build a Hadoop-Job that basically takes the wikipedia pagecount-statistic as input and creates a list like

en-Articlename: en:count de:count fr:count

For that I need the different articlenames related to each language - i.e. Bruges(en, fr), Brügge(de), which the MediaWikiApi query articlewise(http://en.wikipedia.org/w/api.php?action=query&titles=Bruges&prop=langlinks&lllimit=500).

My question is to find the right approach to solve this problem.

My sketched approach would be:

Process the pagecount file line by line (line-example 'de Brugge 2 48824')
Query the MediaApi and write sth. like'en-Articlename: process-language-key:count'
Aggreate all en-Articlename-values to one line (maybe in a second job?)

Now it seems rather unhandy to query the MediaAPI for every line but currently can not get my head around a better solution.

Do you think the current approach for is feasible or can you think of a different one?

On a sidenote: The created job-chain shall be used to do some time-measuring on my (small) Hadoop-Cluster, so altering the task is still okay

Edit: Here is a quite similar discussion which I just found now..

Soluzione

I think it isn't a good idea to query MediaApi during your batch processing due to:

network latency (your processing will be considerably slowed down)
single point of failure (if the api or your internet connection goes down your calculation will be aborted)
external dependency (its hard to repeat the calculation and got the same result)
legal issues and a ban possibility

The possible solution to your problem is to download the whole wikipedia dump. Each article contains links to that article in other languages in a predefined format, so you can easily write a map/reduce job that collects that information and builds a correspondence between English article name and the rest.

Then you can use the correspondence in a map/reduce job processing pagecount-statistic. If you do that you'll become independent to mediawiki's api, speed up your data processing and improve debugging.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow