Question

I am trying to write a Hash Table in Java based on some Princeton article about it.

The most commonly used method for hashing integers is called modular hashing: we choose the array size M to be prime, and, for any positive integer key k, compute the remainder when dividing k by M. This function is very easy to compute (k % M, in Java), and is effective in dispersing the keys evenly between 0 and M-1

This part makes sense - whatever the number, the modulo of the Hash Table size will give some array index within that range.

Strings. Modular hashing works for long keys such as strings, too: we simply treat them as huge integers. For example, the code below computes a modular hash function for a String s, where R is a small prime integer (Java uses 31).

Then a code example is provided, which I don't get.

int hash = 0;
for (int i = 0; i < s.length(); i++)
    hash = (R * hash + s.charAt(i)) % M;

I refactored it as:

int someSmallPrimeInteger = 31;
int hash = 0;
for (int i = 0; i < key.length(); i++) {
    int unicodeCharAsInt = Character.getNumericValue(key.charAt(i));
    hash = (someSmallPrimeInteger * hash + unicodeCharAsInt) % hashTableCapacity;*

I don't understand a LOT about this:

  1. Why the loop? Why not just convert each char to its unicode value and add it up?
  2. How was this "small prime integer" chosen, why?
  3. Why does it need to be prime?
  4. (someSmallPrimeInteger * hash + unicodeCharAsInt) Why this at all? What's the significant of this function?

I understand this so poorly, I can't even phrase questions intelligently, even though it's so little code.

Was it helpful?

Solution

The article is not well written. Prime number sized arrays in hash tables were considered obsolete in the early 1990s. In this particular case, the modulo by a prime number is a poor attempt to convert a bad hash value into a better one, using the wishful thinking that the hash value computed is rarely a multiple of a prime number, therefore the prime number modulo will improve the distribution.

A good hash function consists of an initial value, a state (the larger the state, the better the hash), and a finalization operation. The final value is such that a good distribution is produced even if you truncate the hash value by reducing the number of bits. Modern, fast hash tables use power of two arrays typically, and have no need for prime-sized arrays.

Licensed under: CC-BY-SA with attribution
scroll top