Question

We're using the EuclideanDistanceSimilarity class to calculate the similarity of a bunch of items using Hadoop.

Unfortunately some items are getting zero or very few resulting similar items despite being highly similar to items.

I think I've tracked it down to this line in the EuclideanDistanceSimilarity class:

double euclideanDistance = Math.sqrt(normA - 2 * dots + normB);

The value passed to sqrt is sometimes negative, in which case NaN is returned. I figure perhaps there should be a Math.abs in there somewhere but my maths aren't strong enough to understand how the Euclidean calculation has been rearranged so not sure what the effect would be.

Can anyone explain the maths any better and confirm whether

double euclideanDistance = Math.sqrt(Math.abs(normA - 2 * dots + normB));

would be an acceptable fix?

Was it helpful?

Solution

The code is in org.apache.mahout.math.hadoop.similarity.cooccurrence.measures. EuclideanDistanceSimilarity.

Yes it's written in this way because at that point in the computation it has the norms of vectors A and B, and their dot product, so it's much faster to compute the distance that way.

The identity is pretty easy. Let C = A - B and let a, b and c be the lengths of the corresponding vectors. We need c. From the law of cosines, c2 = a2 + b2 - 2ab·cos(θ), and ab·cos(θ) is just the value of the dot product. Note that normA in the code is actually the square of the norm (length) -- really should have been better named.

Back to the question: you are right there's a bug here, in that rounding can make the argument negative. The fix isn't abs(), but:

double euclideanDistance = Math.sqrt(Math.max(0.0, normA - 2 * dots + normB));

It just needs to be capped to 0. I can commit that.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top