Pergunta

I am trying to get a better understanding how the internas of hashed sets, e.g. HashSet<T> do work and why they are performant. I discovered following article, implementing a simple example with a bucket list http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/.

As far as I understand this article (and I also thought that way before), the bucket list itself groups certain amount of elements in each bucket. One bucket is represented by the hashcode, namely by GetHashCode which is called on the element. I thought the better performance is based on the fact that there are less buckets than elements.

Now I have written following naive test-code:

    public class CustomHashCode
    {
        public int Id { get; set; }

        public override int GetHashCode()
        {
            //return Id.GetHashCode(); // Way better performance
            return Id % 40; // Bad performance! But why?
        }


        public override bool Equals(object obj)
        {
            return ((CustomHashCode) obj).Id == Id;
        }

    }

And here the profiler:

    public static void TestNoCustomHashCode(int iterations)
    {

        var hashSet = new HashSet<NoCustomHashCode>();
        for (int j = 0; j < iterations; j++)
        {
            hashSet.Add(new NoCustomHashCode() { Id = j });
        }

        var chc = hashSet.First();
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        for (int j = 0; j < iterations; j++)
        {
            hashSet.Contains(chc);
        }
        stopwatch.Stop();

        Console.WriteLine(string.Format("Elapsed time (ms): {0}", stopwatch.ElapsedMilliseconds));
    }

My naive thought was: Let's reduce the amount of buckets (with a simple modulo), that should increase performance. But it is terrible (on my system it takes about 4 seconds with 50000 iterations). I also thought if I simply return the Id as hashcode, performance should be poor since I would end up with 50000 buckets. But the opposite is the case, I guess I simply produced tones of so called collisions instead of improving anything. But then again, how do the bucket lists work?

Foi útil?

Solução

A Contains check basically:

  1. Gets the hashcode of the item.
  2. Finds the corresponding bucket - this is a direct array lookup based on the hashcode of the item.
  3. If the bucket exists, tries to find the item in the bucket - this iterates over all the items in the bucket.

By restricting the number of buckets, you've increased the number of items in each bucket, and thus the number of items that the hashset must iterate through, checking for equality, in order to see if an item exists or not. Thus it takes longer to see if a given item exists.

You've probably decreased the memory footprint of the hashset; you may even have decreased the insertion time, although I doubt it. You haven't decreased the existence-check time.

Outras dicas

Reducing the number of buckets will not increase the performance. Actually, the GetHashCode method of Int32 returns the integer value itself, which is ideal for the performance as it will produce as many buckets as possible.

The thing that gives a hash table performance, is the conversion from the key to the hash code, which means that it can quickly elliminate most of the items in the collection. The only items it has to consider is the ones in the same bucket. If you have few buckets, it means that it can elliminate a lot fewer items.

The worst possible implementation of GetHashCode will cause all items to go in the same bucket:

public override int GetHashCode() {
  return 0;
}

This is still a valid implementation, but it means that the hash table gets the same performance as a regular list, i.e. it has to loop through all items in the collection to find a match.

A simple HashSet<T> could be implemented like this(just a sketch, doesn't compile)

class HashSet<T>
{
    struct Element
    {
        int Hash;
        int Next;
        T item;
    }

    int[] buckets=new int[Capacity];
    Element[] data=new Element[Capacity];

    bool Contains(T item)
    {
        int hash=item.GetHashCode();
        // Bucket lookup is a simple array lookup => cheap
        int index=buckets[(uint)hash%Capacity];
        // Search for the actual item is linear in the number of items in the bucket
        while(index>=0)
        {
           if((data[index].Hash==hash) && Equals(data[index].Item, item))
             return true;
           index=data[index].Next;          
        }
        return false;
    }
}

If you look at this, the cost of searching in Contains is proportional to the number of items in the bucket. So having more buckets makes the search cheaper, but once the number of buckets exceeds the number of items, the gain of additional buckets quickly diminishes.

Having diverse hashcodes also serves as early out for comparing objects within a bucket, avoiding potentially costly Equals calls.

In short GetHashCode should be as diverse as possible. It's the job of HashSet<T> to reduce that large space to an appropriate number of buckets, which is approximately the number of items in the collection (Typically within a factor of two).

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top