Question

I'm writing an agglomerative clustering algorithm in java and having trouble with a remove operation. It seems to always fail when the number of clusters reaches half the initial number.

In the sample code below, clusters is a Collection<Collection<Integer>>.

      while(clusters.size() > K){
           // determine smallest distance between clusters
           Collection<Integer> minclust1 = null;
           Collection<Integer> minclust2 = null;
           double mindist = Double.POSITIVE_INFINITY;

           for(Collection<Integer> cluster1 : clusters){
                for(Collection<Integer> cluster2 : clusters){
                     if( cluster1 != cluster2 && getDistance(cluster1, cluster2) < mindist){
                          minclust1 = cluster1;
                          minclust2 = cluster2;
                          mindist = getDistance(cluster1, cluster2);
                     }
                }
           }

           // merge the two clusters
           minclust1.addAll(minclust2);
           clusters.remove(minclust2);
      }

After a few runs through the loop, clusters.remove(minclust2) eventually returns false, but I don't understand why.

I tested this code by first creating 10 clusters, each with one integer from 1 to 10. Distances are random numbers between 0 and 1. Here's the output after adding a few println statements. After the number of clusters, I print out the actual clusters, the merge operation, and the result of clusters.remove(minclust2).

Clustering: 10 clusters
[[3], [1], [10], [5], [9], [7], [2], [4], [6], [8]]
[5] <- [6]
true
Clustering: 9 clusters
[[3], [1], [10], [5, 6], [9], [7], [2], [4], [8]]
[7] <- [8]
true
Clustering: 8 clusters
[[3], [1], [10], [5, 6], [9], [7, 8], [2], [4]]
[10] <- [9]
true
Clustering: 7 clusters
[[3], [1], [10, 9], [5, 6], [7, 8], [2], [4]]
[5, 6] <- [4]
true
Clustering: 6 clusters
[[3], [1], [10, 9], [5, 6, 4], [7, 8], [2]]
[3] <- [2]
true
Clustering: 5 clusters
[[3, 2], [1], [10, 9], [5, 6, 4], [7, 8]]
[10, 9] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4, 5, 6, 4] <- [5, 6, 4]
false

The the [10, 9, 5, 6, 4, 5, 6, 4, ...] set just grows infinitely from there.

Edit: to clarify, I'm using a HashSet<Integer> for each cluster in clusters (a HashSet<HashSet<Integer>>).

Was it helpful?

Solution

Ah. When you alter a value that is already in a Set (or a Map key), then it is not necessarily in the right position and hash codes will be cached. You need to remove it, alter it and then re-insert it.

OTHER TIPS

In the test shown, the remove fails the first time you try to remove a Collection containing more than one Integer. Is this always the case?

What is the concrete type of the Collection used?

The obvious problem there is that clusters.remove is probably using equals to find the element to remove. Unfortunately equals on collections generally compares whether the elements are the same, rather than if it is the same collection (I believe C# makes a better choice in this regard).

AN easy fix is to create clusters as Collections.newSetFromMap(new IdentityHashMap<Collection<Integer>, Boolean>()) (I think).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top