Why does Java's hashCode() in String use 31 as a multiplier?

https://stackoverflow.com/questions/299304

08-07-2019
|

Question

Per the Java documentation, the hash code for a String object is computed as:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation.

Why is 31 used as a multiplier?

I understand that the multiplier should be a relatively large prime number. So why not 29, or 37, or even 97?

Solution

According to Joshua Bloch's Effective Java (a book that can't be recommended enough, and which I bought thanks to continual mentions on stackoverflow):

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

(from Chapter 3, Item 9: Always override hashcode when you override equals, page 48)

OTHER TIPS

As Goodrich and Tamassia point out, If you take over 50,000 English words (formed as the union of the word lists provided in two variants of Unix), using the constants 31, 33, 37, 39, and 41 will produce less than 7 collisions in each case. Knowing this, it should come as no surprise that many Java implementations choose one of these constants.

Coincidentally, I was in the middle of reading the section "polynomial hash codes" when I saw this question.

EDIT: here is link to the ~10mb PDF book i'm referring to above. See section 10.2 Hash Tables (page 413) of Data Structures and Algorithms in Java

On (mostly) old processors, multiplying by 31 can be relatively cheap. On an ARM, for instance, it is only one instruction:

RSB       r1, r0, r0, ASL #5    ; r1 := - r0 + (r0<<5)

Most other processors would require a separate shift and subtract instruction. However, if your multiplier is slow this is still a win. Modern processors tend to have fast multipliers so it doesn't make much difference, so long as 32 goes on the correct side.

It's not a great hash algorithm, but it's good enough and better than the 1.0 code (and very much better than the 1.0 spec!).

By multiplying, bits are shifted to the left. This uses more of the available space of hash codes, reducing collisions.

By not using a power of two, the lower-order, rightmost bits are populated as well, to be mixed with the next piece of data going into the hash.

The expression n * 31 is equivalent to (n << 5) - n.

You can read Bloch's original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to choose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:

Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous.

So the reasoning was not as rational as many of the answers here seem to imply. But we're all good in coming up with rational reasons after gut decisions (and even Bloch might be prone to that).

Actually, 37 would work pretty well! z := 37 * x can be computed as y := x + 8 * x; z := x + 4 * y. Both steps correspond to one LEA x86 instructions, so this is extremely fast.

In fact, multiplication with the even-larger prime 73 could be done at the same speed by setting y := x + 8 * x; z := x + 8 * y.

Using 73 or 37 (instead of 31) might be better, because it leads to denser code: The two LEA instructions only take 6 bytes vs. the 7 bytes for move+shift+subtract for the multiplication by 31. One possible caveat is that the 3-argument LEA instructions used here became slower on Intel's Sandy bridge architecture, with an increased latency of 3 cycles.

Moreover, 73 is Sheldon Cooper's favorite number.

Neil Coffey explains why 31 is used under Ironing out the bias.

Basically using 31 gives you a more even set-bit probability distribution for the hash function.

From JDK-4045622, where Joshua Bloch describes the reasons why that particular (new) String.hashCode() implementation was chosen

The table below summarizes the performance of the various hash functions described above, for three data sets:

1) All of the words and phrases with entries in Merriam-Webster's 2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars).

2) All of the strings in /bin/, /usr/bin/, /usr/lib/, /usr/ucb/ and /usr/openwin/bin/* (66,304 strings, avg length 21 characters).

3) A list of URLs gathered by a web-crawler that ran for several hours last night (28,372 strings, avg length 49 characters).

The performance metric shown in the table is the "average chain size" over all elements in the hash table (i.e., the expected value of the number of key compares to look up an element).
                          Webster's   Code Strings    URLs
                          ---------   ------------    ----
Current Java Fn.          1.2509      1.2738          13.2560
P(37)    [Java]           1.2508      1.2481          1.2454
P(65599) [Aho et al]      1.2490      1.2510          1.2450
P(31)    [K+R]            1.2500      1.2488          1.2425
P(33)    [Torek]          1.2500      1.2500          1.2453
Vo's Fn                   1.2487      1.2471          1.2462
WAIS Fn                   1.2497      1.2519          1.2452
Weinberger's Fn(MatPak)   6.5169      7.2142          30.6864
Weinberger's Fn(24)       1.3222      1.2791          1.9732
Weinberger's Fn(28)       1.2530      1.2506          1.2439
Looking at this table, it's clear that all of the functions except for the current Java function and the two broken versions of Weinberger's function offer excellent, nearly indistinguishable performance. I strongly conjecture that this performance is essentially the "theoretical ideal", which is what you'd get if you used a true random number generator in place of a hash function.

I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the far simpler functions. Any of the remaining six functions seem like excellent choices, but we have to pick one. I suppose I'd rule out Vo's variant and Weinberger's function because of their added complexity, albeit minor. Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous.

Josh

I'm not sure, but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings.

Bloch doesn't quite go into this, but the rationale I've always heard/believed is that this is basic algebra. Hashes boil down to multiplication and modulus operations, which means that you never want to use numbers with common factors if you can help it. In other words, relatively prime numbers provide an even distribution of answers.

The numbers that make up using a hash are typically:

modulus of the data type you put it into (2^32 or 2^64)
modulus of the bucket count in your hashtable (varies. In java used to be prime, now 2^n)
multiply or shift by a magic number in your mixing function
The input value

You really only get to control a couple of these values, so a little extra care is due.

In latest version of JDK, 31 is still used. https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#hashCode()

The purpose of hash string is

unique (Let see operator ^ in hashcode calculation document, it help unique)
cheap cost for calculating

31 is max value can put in 8 bit (= 1 byte) register. is largest prime number can put in 1 byte register, is odd number.

Multiply 31 is <<5 then subtract itself, therefore need cheap resources.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow