Question

The JDK documentation for java.lang.String.hashCode() famously says:

The hash code for a String object is computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the *i*th character of the string, n is the length of the string, and ^ indicates exponentiation.

The standard implementation of this expression is:

int hash = 0;
for (int i = 0; i < length; i++)
{
    hash = 31*hash + value[i];
}
return hash;

Looking at this makes me feel like I was sleeping through my algorithms course. How does that mathematical expression translate into the code above?

Was it helpful?

Solution

I'm not sure if you missed where it says "^ indicates exponentiation" (not xor) in that documentation.

Each time through the loop, the previous value of hash is multipled by 31 again before being added to the next element of value.

One could prove these things are equal by induction, but I think an example might be more clear:

Say we're dealing with a 4-char string. Let's unroll the loop:

hash = 0;
hash = 31 * hash + value[0];
hash = 31 * hash + value[1];
hash = 31 * hash + value[2];
hash = 31 * hash + value[3];

Now combine these into one statement by substituting each value of hash into the following statement:

hash = 31 * (31 * (31 * (31 * 0 + value[0]) + value[1]) + value[2])
     + value[3];

31 * 0 is 0, so simplify:

hash = 31 * (31 * (31 * value[0] + value[1]) + value[2])
     + value[3];

Now multiply the two inner terms by that second 31:

hash = 31 * (31 * 31 * value[0] + 31 * value[1] + value[2])
     + value[3];

Now multiply the three inner terms by that first 31:

hash = 31 * 31 * 31 * value[0] + 31 * 31 * value[1] + 31 * value[2]
     + value[3];

and convert to exponents (not really Java anymore):

hash = 31^3 * value[0] + 31^2 * value[1] + 31^1 * value[2] + value[3];

OTHER TIPS

unroll the loop. Then you get:

int hash = 0;

hash = 31*hash + value[0];
hash = 31*hash + value[1];
hash = 31*hash + value[2];
hash = 31*hash + value[3];
...
return hash;

Now you can do some mathematical manipulation, plug in 0 for the initial hash value:

hash = 31*(31*(31*(31*0 + value[0]) + value[1]) + value[2]) + value[3])...

Simplify it some more:

hash = 31^3*value[0] + 31^2*value[1] + 31^1*value[2] + 31^0*value[3]...

And that is essentially the original algorithm given.

Proof by induction:

T1(s) = 0 if |s| == 0, else s[|s|-1] + 31*T(s[0..|s|-1])
T2(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
P(n) = for all strings s s.t. |s| = n, T1(s) = T2(s)

Let s be an arbitrary string, and n=|s|
Base case: n = 0
    0 (additive identity, T2(s)) = 0 (T1(s))
    P(0)
Suppose n > 0
    T1(s) = s[n-1] + 31*T1(s[0:n-1])
    T2(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] = s[n-1] + 31*(s[0]*31^(n-2) + s[1]*31^(n-3) + ... + s[n-2]) = s[n-1] + 31*T2(s[0:n-1])
    By the induction hypothesis, (P(n-1)), T1(s[0:n-1]) = T2(s[0:n-1]) so
        s[n-1] + 31*T1(s[0..n-1]) = s[n-1] + T2(s[0:n-1])
    P(n)

I think I have it, and a proof was requested.

Take a look at the first few iterations and you'll see the pattern start to emerge:

hash0 = 0 + s0 = s0
hash1 = 31(hash0) + s1 = 31(s0) + s1
hash2 = 31(hash1) + s2 = 31(31(s0) + s1) + s2 = 312(s0) + 31(s1) + s2
...

Isn't it useless at all to count the hashcode of the String out of all characters? Imagine filenames or classnames with their full path put into HashSet. Or someone who uses HashSets of String documents instead of Lists because "HashSet always beats Lists".

I would do something like:

int off = offset;
char val[] = value;
int len = count;

int step = len <= 10 ? 1 : len / 10;

for (int i = 0; i < len; i+=step) {
   h = 31*h + val[off+i];
}
hash = h

At the end hashcode is nothing more than a hint.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top