Cosine similarity result above one

https://stackoverflow.com/questions/16903048

30-05-2022
|

Question

I am coding cosine similarity in PHP. Sometimes the formula gives a result above one. In order to derive a degree from this number using inverse cos, it needs to be between 1 and 0.

I know that I don't need a degree, as the closer it is to 1, the more similar they are, and the closer to 0 the less similar.

However, I don't know what to make of a number above 1. Does it just mean it is totally dissimilar? Is 2 less similar than 0?

Could you say that the order of similarity kind of goes:

Closest to 1 from below down to 0 - most similar as it moves from 0 to one. Closest to 1 from above - less and less similar the further away it gets.

Thank you!

My code, as requested is:

$norm1 = 0;
foreach ($dict1 as $value) {
    $valuesq = $value * $value;
    $norm1 = $norm1 + $valuesq;
}
$norm1 = sqrt($norm1);
$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/($norm1*$norm2);

To give you an idea of the kinds of values I'm getting:

0.9076645291077

2.0680991116095

1.4015600717928

1.0377360186767

1.8563586243689

1.0349674872379

1.2083865384822

2.3000034036913

0.84280491429133

Solution

Your math is good but I'm thinking you're missing something calculating the norms. It works great if you move that math to its own function as follows:

<?php
function calc_norm($arr) {
    $norm = 0;
    foreach ($arr as $value) {
        $valuesq = $value * $value;
        $norm = $norm + $valuesq;
    }
    return(sqrt($norm));
}

$dict1 = array(5,0,97);
$dict2 = array(300,2,124);

$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/(calc_norm($dict1)*calc_norm($dict2));

print_r($cospheta);

OTHER TIPS

I don't know if I'm missing something but I think you are not applying the sum and the square root to the values in the dict2 (the query I assume).

If you do not normalised per query you can get results greater than one. However, this is done some times as it is ranking equivalent (proportional) to the correct result and it is quicker to compute.

I hope this helps.

Due to the vagaries of floating point arithmetic, you could have calculations which, when represented in the binary form that computers use, are not exact. Probably you can just round down. Likewise for numbers slightly less than zero.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow