Question

I have a task: speed up current implementation of inverted index. In my opinion the best approach is to run it in the cloud:

  1. Divide the input text for a few parts (or just grab a few different text files)
  2. Send texts to nodes
  3. Run the algorithm on each node for different input data
  4. Collect the results and merge them

My question is: what is the easiest way to implement it?

My current ideas are:

  • Windows Azure with worker roles - is it possible to send different data to nodes and later on merge them?
  • Windows Azure and HPC Scheduler - isn't it too powerful for a task like this? I am afraid of configuration and costs (new node = new worker role?)
  • Use any other cloud, like Amazon or Google - I'd like to code in c#, and I am familiar with Microsoft technologies, so I am a little afraid of them

Please give me any advices how would you achieve this goal, I am new to cloud computing (although I have some basics like mpi, soa, cuda, azure basics)

Was it helpful?

Solution

This is a case for MapReduce.

In fact, Hadoop was created out of the needs of Nutch (which does Inverted Index)

You could either use:

a) Amazon's Elastic MapReduce

or

b) Signup for HDInsights on Azure

There are other providers (picloud is one which comes to mind)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top