Remove from a HashSet failing after iterating over it
-
09-09-2019 - |
Question
I'm writing an agglomerative clustering algorithm in java and having trouble with a remove operation. It seems to always fail when the number of clusters reaches half the initial number.
In the sample code below, clusters
is a Collection<Collection<Integer>>
.
while(clusters.size() > K){
// determine smallest distance between clusters
Collection<Integer> minclust1 = null;
Collection<Integer> minclust2 = null;
double mindist = Double.POSITIVE_INFINITY;
for(Collection<Integer> cluster1 : clusters){
for(Collection<Integer> cluster2 : clusters){
if( cluster1 != cluster2 && getDistance(cluster1, cluster2) < mindist){
minclust1 = cluster1;
minclust2 = cluster2;
mindist = getDistance(cluster1, cluster2);
}
}
}
// merge the two clusters
minclust1.addAll(minclust2);
clusters.remove(minclust2);
}
After a few runs through the loop, clusters.remove(minclust2)
eventually returns false, but I don't understand why.
I tested this code by first creating 10 clusters, each with one integer from 1 to 10. Distances are random numbers between 0 and 1. Here's the output after adding a few println statements. After the number of clusters, I print out the actual clusters, the merge operation, and the result of clusters.remove(minclust2).
Clustering: 10 clusters
[[3], [1], [10], [5], [9], [7], [2], [4], [6], [8]]
[5] <- [6]
true
Clustering: 9 clusters
[[3], [1], [10], [5, 6], [9], [7], [2], [4], [8]]
[7] <- [8]
true
Clustering: 8 clusters
[[3], [1], [10], [5, 6], [9], [7, 8], [2], [4]]
[10] <- [9]
true
Clustering: 7 clusters
[[3], [1], [10, 9], [5, 6], [7, 8], [2], [4]]
[5, 6] <- [4]
true
Clustering: 6 clusters
[[3], [1], [10, 9], [5, 6, 4], [7, 8], [2]]
[3] <- [2]
true
Clustering: 5 clusters
[[3, 2], [1], [10, 9], [5, 6, 4], [7, 8]]
[10, 9] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4, 5, 6, 4] <- [5, 6, 4]
false
The the [10, 9, 5, 6, 4, 5, 6, 4, ...] set just grows infinitely from there.
Edit: to clarify, I'm using a HashSet<Integer>
for each cluster in clusters (a HashSet<HashSet<Integer>>)
.
Solution
Ah. When you alter a value that is already in a Set
(or a Map
key), then it is not necessarily in the right position and hash codes will be cached. You need to remove it, alter it and then re-insert it.
OTHER TIPS
In the test shown, the remove
fails the first time you try to remove a Collection containing more than one Integer. Is this always the case?
What is the concrete type of the Collection used?
The obvious problem there is that clusters.remove
is probably using equals
to find the element to remove. Unfortunately equals
on collections generally compares whether the elements are the same, rather than if it is the same collection (I believe C# makes a better choice in this regard).
AN easy fix is to create clusters
as Collections.newSetFromMap(new IdentityHashMap<Collection<Integer>, Boolean>())
(I think).