Generate identical Hashcodes for approximately-similar numbers

https://stackoverflow.com/questions/2250659

20-09-2019
|

Question

I'm creating an application in C# 3.5 that uses the AutoCAD API to read a 2D AutoCAD drawing, make changes to the drawing using defined business logic, then adjust it back in AutoCAD. Due to the nature of the logic, the shape of the drawing has to be re-constructed - e.g. a rectangle is made up of 4 connecting straight lines.

I'm creating these shapes using the start and end co-ordinates of each line from AutoCAD, but some of the co-ordinates don't exactly match up. For example, one point could at 0.69912839 (on one axis), but a line starting from the same point could be 0.69990821. These are in mm so the distance is minute (0.00078mm!)

I've created my own class (call it MyPoint, similar to PointF) because I've needed to add some additional logic to it. In that class I've created a method that takes two doubles and returns true or false depending on if the two points are within 0.001mm of each other. I've then overridden the Equals method, == and != operators so I can do (point1 == point2 or point1.Equals(point2)) which checks if all axis are within 0.001mm of each other - if they are, I class it as being the same point.

That's fine and working brilliantly. Now, I need to check a collection of these point classes to get rid of all duplicates, so I'm using LINQ's Distinct() method on my collection. However this method uses GetHashcode(), not Equals() to determine if the instances are equal. So, I've overriden GetHashcode() which uses the GetHashcode of the double class.

But, the above example fails because obviously they're different values and therefore generate different hashcodes. Is there any way that two numbers that are within 0.001 of each other can generate the same hashcode? (Note the numbers don't know about each other as GetHashcode is called separately on different class instances.) I've tried numerous ways which work for some examples but not for others.

One example is truncating the number to 3dp (multiply it by 10^3, then truncate it) and creating the hashcode on the result - which works for the above example (699 == 699.) But this doesn't work for 0.69990821 and 0.70000120 (699 != 700.) I've tried rounding, which works for the second set of numbers (0.700 == 0.700) but not for the first (0.699 != 0.700.) I've even tried truncating the number to 3dp then adjusting it up to the next even number, which works for both the previous examples, but not for 12.9809 and 12.9818 (12980 != 12982.)

Is there another way, or should I scrap the Equals, ==, != and GetHashcode overrides, and create my own MyPoint.IsEqualTo() and MyPointCollection.Distinct() methods?

Solution

It would be easier just to remove the dependency on the Distinct method. Implement a System.Collections.IComparer (or the generic equivalent) and use a simple collection like a list. Then determine if the item is in the list with the comparer, and not add it if it already is contained.

OTHER TIPS

It's unable to write correct hashcode. let proof it: we have 2 points. var a = point1.GetHashCode(); var b = point2.GetHashCode();

if a!= b let create point between point1 and point2. and so on.

After such operations we will create line, where each point is near to some other point and their hashcodes will be the same. So hash code for point1 and point2 should be equals.

So overrtide like this:

public override int GetHashCode()
{
    return 0;
}

and implement you equals.

I think you should not override Equals(), ==, != or GetHashCode()

If you override any of these you should make sure that their semantics do not change. In your example they do.

For instance your == cannot be transitive for it it is then if P1 is 0.001 mm from P2, P2 is 0.001 mm from P3 and P1 is 0.002 mm from P3 then P1 == P2, P2 == P3 and P1 == P3, which is not what you want. In general you end up with all points being equal to all other points.

I would just stick to using a separate method for determining if points are close enough.

EDIT

With your override of == you can now write code like this:

if(P1 == P2 && P2 == P3 && P1 != P3)
{
    // Code here gets executed
}

I guess that if you return always the same hash (say 0), LinQ will try to compare all elements with equals. After all, a hash is useful to prove that two elements are distinct, not equal.

But anyway, I would recommend you to use better suited structures and algorithms for this domain, like Binary Splitting Partition (BSP) trees (for example).

This should be a clearer explanation of what Steck and Paolo said.

Suppose you did manage to write the GetHashCode method in the way you want.

Then for any points a and b, whatever the distance between them, a.GetHashCode() == b.GetHashCode().

Proof: suppose a < b. Split the distance between a and b into segments smaller than 0.001. I.e. a0 = a, a1 = a0 + 0.0005, a2 = a1 + 0.0005, etc. until you get to b.

Then a.GetHashCode() == a1.GetHashCode() == a2.GetHashCode() == ... == b.GetHashCode().

should I scrap the Equals, ==, != and GetHashcode overrides, and create my own MyPoint.IsEqualTo() and MyPointCollection.Distinct() methods?

Yes.

It needn't necessarily be some completely different data structure, though. In checking for a duplicate, you need to check the neighboring hashcodes, e.g. the hashes for (x+0.001,y), (x,y-0.001), and so on. This makes for just a constant-factor slowdown compared to the usual deduplication lookup, and it's not complicated, so it might be the way to go. (An obvious point, but I don't see it made explicitly here yet.)

Update: To clarify, let's look at a 1-dimensional version of the problem. The 'points' are single numbers, x. We consider x1 and x2 to match when abs(x1 - x2) < .001. Now we want to find out if x matches any of {x_0, ..., x_n}. The x_i are kept in a hashtable, where hash(x) = h(floor(1000*x)) for some function h() that spreads things around. To see if x is already in the table, we compute hash(x-.001), hash(x), and hash(x+.001), then we test if x matches any of the x_i in any of the three buckets. Any matching x_i cannot be in any other bucket.

In the 2-d variant, there are 9 neighboring buckets to check (counting the middle); in 3-d, 27.

Here's some code to show what I'm doing. Each pair of numbers in "original" should return the same value.

int tolerance = 3;
double[] original = new double[] {
0.69912839,
0.69990821,

0.69990821,
0.70000120,

12.980984087,
12.981808908
};
double[] modified = new double[original.Length];

for (int i = 0; i < original.Length; i++)
{
modified[i] = original[i];

/* Begin number adjustment logic */
modified[i] *= Math.Pow(10, tolerance);
modified[i] = Math.Truncate(modified[i]);

if (modified[i] % 2 != 0)
{
modified[i]++;
}
/* End number adjustment logic */

Console.WriteLine(modified[i]);

if (i % 2 != 0)
{
Console.WriteLine(string.Empty);
}
}

That above method is the "truncate to 3dp then adjust to the nearest even" method. Next is the truncate method (replace code between the begin/end comments):

/* Begin number adjustment logic */
modified[i] *= Math.Pow(10, tolerance);
modified[i] = Math.Truncate(modified[i]);
/* End number adjustment logic */

This is the round method:

/* Begin number adjustment logic */
modified[i] = Math.Round(modified[i], tolerance);
/* End number adjustment logic */

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow