Why is the cosine distance used to measure the similatiry between word embeddings?

https://datascience.stackexchange.com/questions/81169

13-12-2020
|

Pergunta

While computing the similarity between the words, cosine similarity or distance is computed on word vectors. Why aren't other distance metrics such as Euclidean distance suitable for this task.

Let us consider 2 vectors a and b. Where, a = [-1,2,-3] and b = [-3,6,-9], here b = 3*a, i.e, both the vectors have same direction but different magnitude. The cosine similarity between a and b is 1, indicating they are identical. While the euclidean distance between a and b is 7.48.

Does this mean the magnitude of the vectors is irrelevant for computing the similarity in the word vectors?

Solução

You're asking two questions here.

Does this mean the magnitude of the vectors is irrelevant?

Yes. Cosine distance is $ D_{cos} = \frac{A \cdot B}{\|A\|\|B\|} $, which just comes from the definition of inner product, $A \cdot B = \|A\|\|B\|\cos\theta$.

Why is the cosine distance used? Or, to think of it another way, why is the answer to (1) a desirable property to have in a distance metric?

In a word embedding, we choose a dimensionality $d$ for the embedding. This is the number of components in our embedding space. The components (or, linear combinations of the components) are meant to encode some kind of semantic meaning. Classic examples are like that the vector for "queen" plus the vector for "man" should be near the vector for "king". That sort of thing. There's a direction that roughly corresponds to "royalty" and a direction for gender.

Look at your example, where $b = 3a$, $a=[-1,2,-3], b=[-3,6,-9]$. This is a perfect illustration of why we use cosine similarity. They have very different magnitudes but point in the same direction. They have cosine distance 1, and we want that because it means that they have the same relative proportion of each component.

If we use euclidean distance, $a$ and $b$ are $\sim7.48$ units apart. It would be easy to find another vector $c$ that is around the same distance from $a$ as $b$ is, in a completely different direction. If our space has been learned properly, $c$ should have completely different semantic meaning from $b$, but they're both the same distance from $a$. The euclidean distance doesn't measure the similarity that we want very well.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange