Distributed k-means in Spark

https://datascience.stackexchange.com/questions/10189

16-10-2019
|

Question

I want to implement K-means algorithm in Spark. I am looking for a starting point and I found Berkeley's naive implementation. However, is that distributed?

I mean I see no mapreduce operations. Or maybe, when submitted in Spark, the framework actually makes the needed tricks under the hood to distribute the algorithm?

_{I also found that Spark shows mapreduce the exit and I am using Spark 1.6.}

EDIT: This code produces a runtime error, check here.

Solution

In that link you posted, you can look at the python full solution here at the end and go through it to see what all is distributed. In short, some parts are distributed, like reading data from the file, but the very important parts like the distance computation are not.

Running down, we see:

sc = SparkContext("local[6]", "PythonKMeans")

This instantiates the context and creates a local cluster which the jobs will be submitted to

lines = sc.textFile(..)

This is still setting up. No operations have taken place yet. You can verify this by putting timing statements in the code

data = lines.map(lambda x: (x.split("#")[0], parseVector(x.split("#")[1])))

The lambda here will be applied to lines, so this operation will split the file in parallel. Note that the actual line also has a cache() at the end (see cache]). data is just a reference to the spark object in memory. (I may be wrong here, but I think the operation still doesn't happen yet)

count = data.count()

This forces the parallel computation to start, and the count to be stored. At the end, the reference data is still valid, and we'll use it for further computations. I'll stop with detailed explanations here, but wherever data is being used is a possible parallel computation. The python code itself is single threaded, and interfaces with the Spark cluster.

An interesting line is:

tempDist = sum(np.sum((centroids[x] - y) ** 2) for (x, y) in newCentroids.iteritems())

centroids is an object in python memory, as is newCentroids. So, at this point, all computations are being done in memory (and on the client, typically clients are slim, i.e. have limited capabilities, or the client is an SSH shell, so the computers resources are shared. You should ideally never do any computation here), so no parallelization is being used. You could optimize this method further by doing this computation in parallel. Ideally you want the python program to never directly handle individual points' $x$ and $y$ values.

OTHER TIPS

I don't know about that specific implementation, but we use the mllib k-means here at my work, to some degree of success. It is distributed and runs on Spark!

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange