Can't understand Poisson part of Hash tables from Sun documentation

Question 1

A HashMap is organized as an array of "buckets" based on the hashCode of the elements being inserted. Each bucket is (by default) a linked list of elements. Each bucket would have very few elements (ideally, at most one) so that finding a particular element requires very little searching down a linked list.

To take a simple example, let's say we have a HashMap of capacity 4 and a load factor of 0.75 (the default) which means that it can hold up to 3 elements before being resized. An ideal distribution of elements into buckets would look something like this:

bucket | elements
-------+---------
     0 | Z
     1 | X
     2 |
     3 | Y

so any element can be found immediately without any searching within a bucket. On the other hand, a very poor distribution of elements would look like this:

bucket | elements
-------+---------
     0 | 
     1 | Z -> X -> Y
     2 |
     3 |

This will occur if all of the elements happen to hash into the same bucket, so searching for element Y will require traversing down the linked list.

This might not seem like a big deal, but if you have a HashMap with a capacity of 10,000 elements and there are 7,500 elements in a single bucket on a linked list, searching for a particular element will degrade to linear search time -- which is what using a HashMap is trying to avoid.

One issue is that the hashCode for distributing elements into buckets is determined by the objects themselves, and objects' hashCode implementations aren't always very good. If the hashCode isn't very good, then elements can bunch up in certain buckets, and the HashMap will begin to perform poorly.

The comment from the code is talking about the likelihood of different lengths of linked lists appearing in each bucket. First, it assumes the hashCodes are randomly distributed -- which isn't always the case! -- and I think it also assumes that the number of elements in the HashMap is 50% of the number of buckets. Under these assumptions, according to that Poisson distribution, 60.6% of the buckets will be empty, 30.3% will have one element, 7.5% will have two elements, 1.2% will have three elements, and so forth.

In other words, given those (ideal) assumptions, the linked lists within each bucket will usually be very short.

In JDK 8 there is an optimization to turn a linked list into a tree above a certain threshold size, so that at least performance degrades to O(log n) instead of O(n) in the worst case. The question is, what value should be chosen as the threshold? That's what this discussion is all about. The current threshold value TREEIFY_THRESHOLD is 8. Again, under these ideal assumptions, a bucket with a linked list of length 8 will occur only 0.000006% of the time. So if we get a linked list that long, something is clearly not ideal!! It may mean, for instance, that the objects being stored have exceptionally bad hashCodes, so the HashMap has to switch from a linked list to a tree in order to avoid excessive performance degradation.

The link to the source file with the comment in question is here:

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/jdk8-b119/src/share/classes/java/util/HashMap.java

Question 2

The accepted answer is great but I just wanted to fill in why it's reasonable to use a Poisson distribution in particular since I had the exact same question when reading that piece of code.

In the case that we have a fixed number of items k being inserted into a fixed number of buckets n then the number of items in a fixed bucket should follow a Binomial Distribution with k trials and probality of success 1 / n. This is pretty easy to see; if the hash is random then each item is put into our bucket with probability 1 / n and there are k items.

When k is large and the mean of the Binomial Distribution is small then a good approximation is a Poisson Distribution with the same mean. In this case the mean is k / n, the the load factor of the hash table. Taking 0.5 for the mean is reasonable because the table tolerates a load factor of at most 0.75 before resizing so the table will be used a great deal with a load factor of around 0.5.