Question

I have problem with comparing lot of string data (csv files). These files has uniqueID but are not sorted and they are quite big.

So I tried to create two dictionaries where key is uniqueID from file and Value is int which returns GetHashCode() of string which I'm interested in for changes.

But, short example:

if ("30000100153:135933:Wuchterlova:335:2:Praha:16000".GetHashCode() == 
    "30000263338:158364:Radošovická:1323:10:Praha:10000".GetHashCode())
{
    Console.WriteLine("Hmm that's strange");
}

So is there any other way how to do that.

I need as little footprit as possible (due to memory allocation of two dictionarie of two csv files which contains about 3M rows) Thank you

Was it helpful?

Solution

First of all, the documentation for string.GetHashCode specifically says to not use string hash codes for any application where they need to be stable over time, because they are not. You should be using string hash codes for one purpose only, and that is to put strings in a dictionary.

Second, hash codes are not unique. There are only four billion possible hash codes (because the hash code is a 32 bit integer) but obviously there are more than four billion strings, so there must be many strings that have the same hash code. A collection of only a few thousand strings has an extremely high probability of containing two strings with the same hash code. A graph of the probability is here:

http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx

So you might wonder how the dictionary works at all then, if it is using GetHashCode but there can be collisions. The answer is: when you put two things X and Y in a dictionary that have the same hash code, they go in the same "bucket". When you search for X the dictionary goes to the right bucket using the hash code, and then does the expensive equality check on each element in the bucket until it finds the right one. Since each bucket is small, this check is still fast enough most of the time.

I don't know how to solve your problem, but using a 32 bit hash is clearly not the right way to do it, so try something else. My suggestion would be to start using a database rather than CSV files if you have a lot of data to manage. That's what a database is for.

I have written many articles on string hashing that might interest you:

http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/

http://blogs.msdn.com/b/ericlippert/archive/2011/07/12/what-curious-property-does-this-string-have.aspx

http://blogs.msdn.com/b/ericlippert/archive/2005/10/24/do-not-use-string-hashes-for-security-purposes.aspx

http://blogs.msdn.com/b/ericlippert/archive/tags/hashing/

OTHER TIPS

You don't actually want to use GetHashCode. You should just compare the strings directly. However, comparing each of 3M strings against each of another 3M strings is going to be difficult in any reasonable time without sorting the lists first.

My approach would be to sort each list first (how to do that depends on a number of things), read the first sorted from each - lets call then A and B and:

  1. if A = B then do whatever and read next A and next B and repeat
  2. if A < B do whatever and read next A and repeat
  3. if A > B do whatever and read next B and repeat

.. where 'do whatever' means do whatever is required in that situation and repeat means go back to step 1.

(This process is how mainframe computers used to merge stacks of cards and has a particular name, but I can't for the life of me remember it!)

Cheers -

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top