What would be a good hashCode for a DateRange class

https://stackoverflow.com/questions/3525948

29-09-2019
|

Question

I have the following class

public class DateRange
{
    private DateTime startDate;
    private DateTime endDate;
    public override bool Equals(object obj)
    {
        DateRange other = (DateRange)obj;
        if (startDate != other.startDate)
            return false;
        if (endDate != other.endDate)
            return false;
        return true;
    }
    ...
}

I need to store some values in a dictionary keyed with a DateRange like:

Dictionary<DateRange, double> tddList;

How should I override the GetHashCode() method of DateRange class?

Solution

It depends on the values I expect to see it used with.

If it was most often going to have different day values, rather than different times on the same day, and they were within a century of now, I would use:

unchecked
{
    int hash = startDate.Year + endDate.Year - 4007;
    hash *= 367 + startDate.DayOfYear;
    return hash * 367 + endDate.DayOfYear;
}

This distributes the bits well with the expected values, while reducing the number of bits lost in the shifting. Note that while there cases where dependency on primes can be surprisingly bad at collisions (esp. when the hash is fed into something that uses a modulo of the same prime in trying to avoid collisions when producing a yet-smaller hash to distribute among its buckets) I've opted to go for primes above the more obvious choices, as they're only just above and so still pretty "tight" for bit-distribution. I don't worry much about using the same prime twice, as they're so "tight" in this way, but it does hurt if you've a hash-based collection with 367 buckets. This deals well (but not as well) with dates well into the past or future, but is dreadful if the assumption that there will be few or no ranges within the same day (differing in time) is wrong as that information is entirely lost.

If I was expecting (or writing for general use by other parties, and not able to assume otherwise) I'd go for:

int startHash = startDate.GetHashCode();
return (((startHash >> 24) & 0x000000FF) | ((startHash >> 8) & 0x0000FF00) | ((startHash << 8) & 0x00FF0000) | (unchecked((int)((startHash << 24) & 0xFF000000)))) ^ endDate.GetHashCode();

Where the first method works on the assumption that the general-purpose GetHashCode in DateTime isn't as good as we want, this one depends on it being good, but mixes around the bits of one value.

It's good in dealing with the more obvious tricky cases such as the two values being the same, or a common distance from each other (e.g. lots of 1day or 1hour ranges). It's not as good at the cases where the first example works best, but the first one totally sucks if there are lots of ranges using the same day, but different times.

Edit: To give a more detailed response to Dour's concern:

Dour points out, correctly, that some of the answers on this page lose data. The fact is, all of them lose data.

The class defined in the question has 8.96077483×10³⁷ different valid states (or 9.95641648×10³⁶ if we don't care about the DateTimeKind of each date), and the output of GetHashCode has 4294967296 possible states (one of which - zero - is also going to be used as the hashcode of a null value, which may be commonly compared with in real code). Whatever we do, we reduce information by a scale of 2.31815886 × 10²⁷. That's a lot of information we lost!

It's likely true that we can lose more with some than in others. Certainly, it's easy to prove some solutions can lose more than others by writing a valid, but really poor, answer.

(The worse possible valid solution is return 0; which is valid as it never errors or mismatches on equal objects, but as poor as possible as it collides for all values. The performance of a hash-based collection becomes O(n), and slow as O(n) goes, as the constants involved are higher than such O(n) operations as searching an unordered list).

It's difficult to measure just how much is lost. How much more does shifting of some bits before XORing lose than swapping bits, considering that XOR halves the amount of information left. Even the naïve x ^ y doesn't lose more than a swap-and-xor, it just collides more on common values; swap-and-xor will collide on values where plain-xor does not.

Once we've got a choice between solutions that are not losing much more information than possible, but returning 4294967296 or close to 4294967296 possible values with a good distribution between those values, then the question is no longer how much information is lost (the answer that only 4.31376821×10^-28 of the original information remains) but which information is lost.

This is why my first suggestion above ignores time components. There are 864000000000 "ticks" (the 100nanosecond units DateTime has a resolution of) in a day, and I throw away two chunks of those ticks (7.46496×10²³ possible values between the two) on purpose because I'm thinking of a scenario where that information is not used anyway. In this case I've deliberately structured the mechanism in such a way as to pick which information gets lost, that improves the hash for a given situation, but makes it absolutely worthless if we had different values all with start and end dates happening no the same days but at different times.

Likewise x ^ y doesn't lose any more information than any of the others, but the information that it does lose is more likely to be significant than with other choices.

In the absence of any way to predict which information is likely to be of importance (esp. if your class will be public and its hash code used by external code), then we are more restricted in the assumptions we can safely make.

As a whole prime-mult or prime-mod methods are better in which information they lose than shift-based methods, except when the same prime is used in a further hashing that may take place inside a hash-based method, ironically with the same goal in mind (no number is relatively prime to itself! even primes) in which case they are much worse. On the other hand shift-based methods really fall down if fed into a shift-based further hash. There is no perfect hash for arbitrary data and arbitrary use (except when a class has few valid values and we match them all, in which case it's more strictly an encoding than a hash that we produce).

In short, you're going to lose information whatever you do, it's which you lose that's important.

OTHER TIPS

I use this approach from Effective Java for combining hashes:

unchecked
{
    int hash = 17;
    hash = hash * 31 + field1.GetHashCode();
    hash = hash * 31 + field2.GetHashCode();
    ...
    return hash;
}

There's no reason that shouldn't work fine in this situation.

Well, consider what characteristics a good hash function should have. It must:

be in agreement with Equals - that is, if Equals is true for two objects then the two hash codes have to also be the same.
never crash

And it should:

be very fast
give different results for similar inputs

What I would do is come up with a very simple algorithm; say, taking 16 bits from the hash code of the first and 16 bits from the hash code of the second, and combining them together. Make yourself a test case of representative samples; date ranges that are likely to be actually used, and see if this algorithm does give a good distribution.

A common choice is to xor the two hashes together. This is not necessarily a good idea for this type because it seems likely that someone will want to represent the zero-length range that goes from X to X. If you xor the hashes of two equal DateTimes you always get zero, which seems like a recipe for a lot of hash collisions.

You have to shift one end of the range, otherwise two equal dates will hash to zero, a pretty common scenario I imagine:

return startDate.GetHashCode() ^ (endDate.GetHashCode() << 4);

return startDate.GetHashCode() ^ endDate.GetHashCode();

might be a good start. You have to check that you get good distribution when there is equal distance between startDate and endDate, but different dates.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow