Question

I'm having a few million/billion (10^9) data-input-sets, that need to be processed. They are quiet small < 1kB. And they need about 1 second to be processed.

I have read a lot about Apache Hadoop, Map Reduce and StarCluster. But I am not sure what the most efficient and fastest way is, to process it?

I am thinking of using Amazon EC2 or a similar cloud service.

Was it helpful?

Solution

You might consider something like Amazon EMR which takes care of a lot of the plumbing with Hadoop. If your just looking to code something quickly, hadoop streaming, hive and PIG are all good tools for getting started with hadoop w/out requring you to know all of the ins and outs of MapReduce.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top