GetHashCode and Buckets

Question 1

A Contains check basically:

Gets the hashcode of the item.
Finds the corresponding bucket - this is a direct array lookup based on the hashcode of the item.
If the bucket exists, tries to find the item in the bucket - this iterates over all the items in the bucket.

By restricting the number of buckets, you've increased the number of items in each bucket, and thus the number of items that the hashset must iterate through, checking for equality, in order to see if an item exists or not. Thus it takes longer to see if a given item exists.

You've probably decreased the memory footprint of the hashset; you may even have decreased the insertion time, although I doubt it. You haven't decreased the existence-check time.

Question 2

Reducing the number of buckets will not increase the performance. Actually, the GetHashCode method of Int32 returns the integer value itself, which is ideal for the performance as it will produce as many buckets as possible.

The thing that gives a hash table performance, is the conversion from the key to the hash code, which means that it can quickly elliminate most of the items in the collection. The only items it has to consider is the ones in the same bucket. If you have few buckets, it means that it can elliminate a lot fewer items.

The worst possible implementation of GetHashCode will cause all items to go in the same bucket:

public override int GetHashCode() {
  return 0;
}

This is still a valid implementation, but it means that the hash table gets the same performance as a regular list, i.e. it has to loop through all items in the collection to find a match.

Question 3

A simple HashSet<T> could be implemented like this(just a sketch, doesn't compile)

class HashSet<T>
{
    struct Element
    {
        int Hash;
        int Next;
        T item;
    }

    int[] buckets=new int[Capacity];
    Element[] data=new Element[Capacity];

    bool Contains(T item)
    {
        int hash=item.GetHashCode();
        // Bucket lookup is a simple array lookup => cheap
        int index=buckets[(uint)hash%Capacity];
        // Search for the actual item is linear in the number of items in the bucket
        while(index>=0)
        {
           if((data[index].Hash==hash) && Equals(data[index].Item, item))
             return true;
           index=data[index].Next;          
        }
        return false;
    }
}

If you look at this, the cost of searching in Contains is proportional to the number of items in the bucket. So having more buckets makes the search cheaper, but once the number of buckets exceeds the number of items, the gain of additional buckets quickly diminishes.

Having diverse hashcodes also serves as early out for comparing objects within a bucket, avoiding potentially costly Equals calls.

In short GetHashCode should be as diverse as possible. It's the job of HashSet<T> to reduce that large space to an appropriate number of buckets, which is approximately the number of items in the collection (Typically within a factor of two).