IEqualityComparer<T> GetHashCode return same values when items are not equal?

https://stackoverflow.com/questions/7656702

06-02-2021
|

Question

In the "Notes to Implementers" section in the documentation for the GetHashCode method of the IEqualityComparer<T> interface, it states:

Implementations are required to ensure that if the Equals method returns true for two objects x and y, then the value returned by the GetHashCode method for x must equal the value returned for y.

It's well known why you want two instances of T to return the same hash code in the event that two items are equal; them being different would mean they are not equal, while them being the same means that they are potentially equal.

I interpret the quote to be undefined for the return value when two instances are not equal (even though their values might suggest so).

Take the following for example. I have a sequence of int? which I want to use for statistical classification where each non-null int? represents an attribute on a class (think enum values). In the case where these values are null, you don't want the values to be considered equal, as they would bias the training set towards missing values. If anything, you would want null values in this case when compared to other null values to return false.

The thing is, in the GetHashCode method, when given a null, I'd probably want to return 0 (or some other number, maybe Int32.MinValue). Now, I know that when anything is keyed using this IEqualityComparer<T> implementation, the performance for checking for existence of a key in the dictionary is not going to be optimal for those cases.

That said, is it valid to return a value known to conflict for other values when GetHashCode is called, when the call to Equals returns false? I'm leaning towards yes, as the quote above is undefined in this matter.

Solution

It is absolutely necessary for almost all types for there to be two values v1 and v2 such that

v1.Equals(v2) == false
v1.GetHashCode() == v2.GetHashCode()

... or the equivalent with an IEqualityComparer<T>. The only cases where that isn't the case are where there are at most 2³² distinct (non-equal) values. As soon as there are more values than that, the pigeon-hole principle forces hash codes to be reused - there just aren't enough hash codes to go round!

Eric Lippert had a great blog post on hash codes which is well worth a read. Basically I think you've got the right ideas, but it's worth reinforcing them.

The issue of nulls is an interesting one, by the way. IEqualityComparer<T> allows GetHashCode to throw an exception, but I believe the built-in Comparer<T> implementations never do. It sounds like you do have one problem though - that Equals should be reflexive - so a null value should be equal to itself. You may need to think about that one carefully... can you represent "different" null values?

OTHER TIPS

IMO: Since Equals is always the final arbiter on the equality of objects, GetHashCode is only ever a shortcut for non-equal values. In the case of identical values returned from GetHashCode (regardless of whether the objects are actually equal), Equals will then always be called to compare. It's expected that GetHashCode is likely to conflict between non-equal values. I don't see anything ambiguous or undefined about this behavior.

You can return any value you want, as long as the condition you quote is satisfied. Otherwise, classes that depend on this condition will not work correctly.

For example, take a dictionary that is indexed by a case-insensitive key, and say your implementation of GetHashCode returns the value of the first character. So "A" and "a" are equal but have different hash values (65 and 97). In other words: you violate the rule. If you then do something like:

dict["A"] = "something";
Console.WriteLine(dict["a"]);

then the second line will likely fail with a KeyNotFoundException even though the keys "A" and "a" are equal.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow