.Net C# String.GetHashCode() alternative

Question 1

First of all, the documentation for string.GetHashCode specifically says to not use string hash codes for any application where they need to be stable over time, because they are not. You should be using string hash codes for one purpose only, and that is to put strings in a dictionary.

Second, hash codes are not unique. There are only four billion possible hash codes (because the hash code is a 32 bit integer) but obviously there are more than four billion strings, so there must be many strings that have the same hash code. A collection of only a few thousand strings has an extremely high probability of containing two strings with the same hash code. A graph of the probability is here:

http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx

So you might wonder how the dictionary works at all then, if it is using GetHashCode but there can be collisions. The answer is: when you put two things X and Y in a dictionary that have the same hash code, they go in the same "bucket". When you search for X the dictionary goes to the right bucket using the hash code, and then does the expensive equality check on each element in the bucket until it finds the right one. Since each bucket is small, this check is still fast enough most of the time.

I don't know how to solve your problem, but using a 32 bit hash is clearly not the right way to do it, so try something else. My suggestion would be to start using a database rather than CSV files if you have a lot of data to manage. That's what a database is for.

I have written many articles on string hashing that might interest you:

http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/

http://blogs.msdn.com/b/ericlippert/archive/2011/07/12/what-curious-property-does-this-string-have.aspx

http://blogs.msdn.com/b/ericlippert/archive/2005/10/24/do-not-use-string-hashes-for-security-purposes.aspx

http://blogs.msdn.com/b/ericlippert/archive/tags/hashing/

Question 2

You don't actually want to use GetHashCode. You should just compare the strings directly. However, comparing each of 3M strings against each of another 3M strings is going to be difficult in any reasonable time without sorting the lists first.

My approach would be to sort each list first (how to do that depends on a number of things), read the first sorted from each - lets call then A and B and:

if A = B then do whatever and read next A and next B and repeat
if A < B do whatever and read next A and repeat
if A > B do whatever and read next B and repeat

.. where 'do whatever' means do whatever is required in that situation and repeat means go back to step 1.

(This process is how mainframe computers used to merge stacks of cards and has a particular name, but I can't for the life of me remember it!)

Cheers -