Question

I am doing a project by using k-means and my professor suggested kdtree. I found this implementation of kdtree in python (i know that there is also in scipy, but i couldn't find any sample implementation). My question is the same as the tittle, is it used kdtree for speeding up k-means, or i am wrong?

data = [(2,2),(1,0),(2,3),(10,5),(59,8),(4,2)]

tree = KDTree.construct_from_data(data)
nearest = tree.query(query_point=(5,4), t=3)
print nearest

output:

[(4, 2), (2, 3), (2, 2)]
Was it helpful?

Solution

As "Making k-means even faster", p 137, paper shows, kd-tree can be used for k means algorithm speed up for low-dimensional data, while straightforward Lloyd's algorithm is more efficient for higher dimension.

With high-dimensional data, indexing schemes such as k-d tree do not work well

See explanations in paper.

I suggest you to use one of established k-means implementation and worry about speed improvement only if you run into severe issues. For example, afaik, sklearn's KMeans is based on Lloyd's original algorithm.

OTHER TIPS

It can be used, but it's nontrivial. Most people only implement the straightforward un-accelerated solution.

The problem is that most kd-tree implementations only support nearest-neighbor queries.

This only pays off if you have a large number of clusters k, and build the index on the clusters.

For full kd-tree-k-means acceleration, you would need to implement a bipartite NN-join, where you would both have an index on the points and on the cluster centers. I'm not aware of any kd-tree implementation that would support this.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top