Domanda

I want to build a Hadoop-Job that basically takes the wikipedia pagecount-statistic as input and creates a list like

en-Articlename: en:count de:count fr:count

For that I need the different articlenames related to each language - i.e. Bruges(en, fr), Brügge(de), which the MediaWikiApi query articlewise(http://en.wikipedia.org/w/api.php?action=query&titles=Bruges&prop=langlinks&lllimit=500).

My question is to find the right approach to solve this problem.

My sketched approach would be:

  • Process the pagecount file line by line (line-example 'de Brugge 2 48824')
  • Query the MediaApi and write sth. like'en-Articlename: process-language-key:count'
  • Aggreate all en-Articlename-values to one line (maybe in a second job?)

Now it seems rather unhandy to query the MediaAPI for every line but currently can not get my head around a better solution.

Do you think the current approach for is feasible or can you think of a different one?

On a sidenote: The created job-chain shall be used to do some time-measuring on my (small) Hadoop-Cluster, so altering the task is still okay

Edit: Here is a quite similar discussion which I just found now..

È stato utile?

Soluzione

I think it isn't a good idea to query MediaApi during your batch processing due to:

  • network latency (your processing will be considerably slowed down)
  • single point of failure (if the api or your internet connection goes down your calculation will be aborted)
  • external dependency (its hard to repeat the calculation and got the same result)
  • legal issues and a ban possibility

The possible solution to your problem is to download the whole wikipedia dump. Each article contains links to that article in other languages in a predefined format, so you can easily write a map/reduce job that collects that information and builds a correspondence between English article name and the rest.

Then you can use the correspondence in a map/reduce job processing pagecount-statistic. If you do that you'll become independent to mediawiki's api, speed up your data processing and improve debugging.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top