Question

I am new to Mahout, and have been lately transforming a lot of my previous machine learning code to this framework. In many places, I am using cosine similarity between vectors for clustering, classification, etc. Investigating Mahout's distance method, however, gave me quite a surprise. In the following code snippet, the dimension and the float values are taken from an actual output of one of my programs (not that it matters here):

import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.common.distance.CosineDistanceMeasure;

public static void main(String[] args) {
    RandomAccessSparseVector u = new RandomAccessSparseVector(373);
    RandomAccessSparseVector v = new RandomAccessSparseVector(373);
    u.set(24, 0.4526985183337534);
    u.set(55, 0.5333219834564495);
    u.set(54, 0.5333219834564495);
    u.set(53, 0.4756042214095471);

    v.set(57, 0.6653016370845252);
    v.set(56, 0.6653016370845252);
    v.set(11, 0.3387439495921685);

    CosineDistanceMeasure cosineDistanceMeasure = new CosineDistanceMeasure();
    System.out.println(cosineDistanceMeasure.distance(u, v));
}

The output is 1.0. Shouldn't it be 0.0?

Comaring this with the output of cosineDistanceMeasure.distance(u, u), I realize that what I am looking for is 1 - cosineDistanceMeasure.distance(u, v). But this reversal just doesn't make sense to me. Any idea why it was implemented this way? Or am I missing something very obvious?

Was it helpful?

Solution

When two points are "close", the angle they form when viewed as vectors from the origin is small, near zero. The cosine of angles near zero is near 1, and the cosine decreases as the angle goes towards 90 and then 180 degrees.

So cosine decreases as distance increases. This is why the cosine of the angle between two vectors itself can't make sense as a distance metric. The 'canonical' way to make a distance metric is 1 - cosine; it's a proper metric.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top