Algorithm to merge the closest items

https://stackoverflow.com/questions/21719119

10-10-2022
|

Question

Below is the Hierarchy Clustering illustration.

First, I have several items as below:

enter image description here

I want to cluster the above items according to their distance in a hierarchical manner. In the above illustration, the clustering happens as steps below:

step1. b/c, d/e got clustered into (bc) and (de) because they are close to each other.

step2. (de)/f are clustered into (def) because (de) is close to f

step3. (bc) / (def) are clustered into (bcdef) because the cluster (bc) and (def) are close.

step4. a/(bcdef) are clustered into (abcdef) because they are the only 2 clusters now.

So the process can be illustrated like this:

enter image description here

I can think of an implementation of the algorithm as below:

Calculate the distance between every pair of items. Such as D(a,b), D(a,c), D(a,d)...The complexity will be O(n^2).
Sort all the distances in ascending order. The complexity will be O(n^2log(n^2)).
Iterate from the beginning of the sorted distances and merge. Once 2 items are merged, the distances after the merged distance which involve either of the 2 items is ignored. Do this iteration until no distance to merge. The complexity will be O(n^2)
Go back to step 1 with the merged clusters. If there's only 1 cluster, stop.

But this seems to be quite low efficiency. How to improve it?

ADD

I suddenly realized that my algorithm will cluster a and f very soon. i.e. the first round iteration will lead to (bc), (de) and (af). This is incorrect. It seems I need a way to make it progressively.

Solution

What you describe looks very similar to http://en.wikipedia.org/wiki/Single-linkage_clustering. This article includes a pointer to SLINK, which is optimal in the sense that it receives a matrix of distances between points of size N^2 and works in time O(N^2). Another attraction of the method is that, if you can compute distances on demand, you only need O(N) storage.

However, if your distances have some structure behind them, so that not all patterns of distance are possible, you can do better. This problem can also be stated as finding a minimum spanning tree on N points. If the points live in a euclidean space - so the distance between X and Y is ((X0-Y0)^2 + (X1-Y1)^2 + ...) you can take account of this to reduce the cost further - see http://en.wikipedia.org/wiki/Euclidean_minimum_spanning_tree.

You should be aware that there are a large number of different sorts of clustering, with different statistical pluses and minuses, and different amounts of time required for their construction. If you are not actually sure which one you want you will find an introduction at http://en.wikipedia.org/wiki/Category:Data_clustering_algorithms

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow