Kmeans on hadoop

https://stackoverflow.com/questions/12604505

04-07-2021
|

Вопрос

After each iteration of Kmeans, does Hadoop store the output which is a bunch of clusters to HDFS and fetch them to memory in the next iteration?
Mappers are putting the observations in certain clusters. I mean every node has to know all the data, and hadoop only distribute the computing rather than data so that every node will give the observation set of certain node, right?

Thank you

Решение

Yes, if we have data to be passed from one MR job to another - HDFS (or to be precise DFS) is only option. It is not that problematic since we have here aggregated bandwidth of the cluster
K-Mean clustering does not require to send all data to all nodes - it has very efficient parallel implementation described here. http://blog.data-miners.com/2008/02/mapreduce-and-k-means-clustering.html In a nutshell idea is to aggregate distances from locally available group if rows to all centers and then send this small amount of information for centralized processing.

Другие советы

For such iterative processing Hadoop/MR as there is an overhead because of running the same job again and again till convergence of the cluster points is achieved. Hadoop is 10x slower compared to the other frameworks as mentioned below.

Iterative processing like K-Means can be efficiently and easily done using BSP. Apache Hama and Apache Giraph both implement BSP. While Apache Hama exposes the BSP primitives, Apache Giraph uses BSP internally and is mainly used for Graph processing, but doesn't expose the BSP primitives.

Google has released a paper on Pregel for large scale iterative processing and they use BSP as the underlying model.

Hadoop and MR is not really good choice for iterative algorithms, such as KMeans, even though it is still workable. I had a chance to implement Markov Decision Process upon Hadoop, which bring me huge pain, for the reason that each iteration involved disk IO, for both input and out put. Besides that, lauching an iteration (a MR job) cost tens of seconds in a Hadoop cluster.

Later I tried out Spark, which is a MR like framework that can work perfectly upon Hadoop. It uses the memory of all commodity computers in a cluster, to cache iteration invariants instead of reading and writing back them to the disk repeatedly. You may want to check it out :-)

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow