Question

I'm storing a large number of objects (with unique combinations of values stored in a byte array in the object) in a hashmap (~2.8million objects) and, when checking if I have any collision of hash code (32-bit hash), I'm very surprised to see there is none while statistically, I have nearly 100% chances of having at least one collision (cf. http://preshing.com/20110504/hash-collision-probabilities/).

I am thus wondering if my approach to detect collisions is bugged or if I'm extremely lucky...

Here is how I try to detect collisions from the 2.8million values stored in the map:

HashMap<ShowdownFreqKeysVO, Double> values;
(...fill with 2.8 mlns unique values...)
HashSet<Integer> hashes = new HashSet<>();
for (ShowdownFreqKeysVO key:values.keySet()){
    if (hashes.contains(key.hashCode())) throw new RuntimeException("Duplicate hash for:"+key);
    hashes.add(key.hashCode());
}

And here is the object's approach to create a hash value:

public class ShowdownFreqKeysVO {
    //Values for the different parameters
    public byte[] values = new byte[12];

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + Arrays.hashCode(values);
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        ShowdownFreqKeysVO other = (ShowdownFreqKeysVO) obj;
        if (!Arrays.equals(values, other.values))
            return false;
        return true;
    }
}

Any idea/hint on what I'm doing wrong would be greatly appreciated !

Thanks, Thomas

Was it helpful?

Solution

I don't believe in luck

This is the implementation of Arrays.hashCode that you use

public static int hashCode(int a[]) {
    if (a == null)
        return 0;

    int result = 1;
    for (int element : a)
        result = 31 * result + element;

    return result;
}

If your values happen to be smaller then 31 they are treated like distinct numbers in the base 31, so each result in a different numbers (if we ignore overflows for now). Lets call those pure hashes

Now of course 31^11 is way larger then the number of integers in Java, so we will get tons of overflows. But since the powers of 31 and the maximum integer are "very different" you don't get a almost random distribution, but a very regular uniform distribution.

Lets consider a smaller example. I assume you have only 2 elements in your array and the range from 0 to 5 each. I try to create "hashCode" between 0 and 37 by taking the modulo 38 of the "pure hash" The result is that I get streaks of 5 integers with small gaps in between, and not a single collision.

val hashes = for {
  i <- 0 to 4
  j <- 0 to 4
} yield (i * 31 + j) % 38

println(hashes.size) // prints 25
println(hashes.toSet.size) // prints 25

To verify if this is what happens to your numbers you might create a graph as follows: For each hash take the first 16 bits for x and and the second 16 bits for y, color that dot black. I bet you will see an extremely regular pattern.

OTHER TIPS

I see nothing wrong with your code, but the analysis you link to assumes that hashCodes are uniformly distributed, and that the hashCodes of different objects are independent random variables.

The latter may not be true: You know that the objects are unique (and therefore not independent). Perhaps that particular brand of uniqueness is preserved by the hashCode function.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top