Domanda

I have been checking the algorithm of Mahout 0.9 k-means using MapReduce and I would like to know where can I check the code of what is happening inside the map function and in the reducer?

I was using debugging using NetBeans and I was not able to find what is exactly implemented in the Map and Reduce functions...

The reason what I am doing this is because I would like to know what is exactly implemented in the version of Mahout 0.9 in order to see which parts where optimized on the K-Means mapReduce algorithm.

If somebody knows which research paper the Mahout K-means were based on, that would also helped me a lot.

Thank you so much!

Best regards!

È stato utile?

Soluzione

Download source code for mahout-core. Search for java file org.apache.mahout.clustering.kmeans.KMeansDriver.

In this java file search for line ClusterIterator.iterateMR(conf, input, priorClustersPath, output, maxIterations);

iterateMR function in class org.apache.mahout.clustering.iterator.ClusterIterator is the class which defines all configuration required for Map Reduce.

org.apache.mahout.clustering.iterator.CIMapper and org.apache.mahout.clustering.iterator.CIReducer are the Map reduce classes you are looking for.

Hope this helps!! :)

However, I do not know which research paper is implemented.

Altri suggerimenti

K-means (more precisely, Lloyds algorithm) is naively parallel. I doubt there is a paper discussing the implementation used by Mahout, because it's the obvious way to do so. There is absolutely no trick involved: Lloyds algorithm consists mostly of a sum, and sums are trivially to parallelize.

Unfortunately (like much of Hadoop), Mahout is 10 layers thick abstraction. Which doesn't yield the best performance, but in particular makes it also really hard to dig through all the code and meta-code to the actual implementation. See the other answere here for pointers to the source code fragments scattered in a dozen classes.

When playing around with Mahout, make sure to also include non-Hadoop implementations of k-means in your experiments. You will be surprised how often they A) outperform Mahout, and B) provide better results.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top