Question

I'm clustering text documents. I'm using tf-idf and cosine similarity. However there's something I don't really understand even tho I'm using these measures. Do the tf-idf weights affect the similarity calculations between two documents?

Suppose I have these two documents:

1- High trees.

2- High trees High trees High trees High trees.

Then the similarity between the two documents will be 1, although the tf-idf vectors of the two documents are different. Where the second should normally have higher weights for the terms compared to the first document.

Suppose the weights for the two vectors are (just suppose):

v1(1.0, 1.0)

v2(5.0, 8.0)

calculating the cosine similarity gives 1.0.

Here is a sketch of two random vectors that share the same terms but with different weights.

There's an obvious angel between the vectors, so the weights should play a role!

enter image description here

This triggers the question, where do the tf/idf weights play a role in the similarity calculations? Because what I understood so far is that the similarity here only cares about the presence and absence of the terms.

Was it helpful?

Solution 2

I think you are mixing two different concepts here.

  1. Cosine similarity measures the angle between two different vectors in a Euclidean space, independently of how the weights have been calculated.

  2. TF-IDF decides, for each term in a document and a given collection, the weights for each one of the components of a vector that can be used for cosine similarity (among other things).

I hope this helps.

OTHER TIPS

First off, your calculations are flawed. The cosine similarity between (1, 1) and (5, 8) is

1*5 + 1*8 / ||(1, 1)|| * ||(5, 8)||
= 13 / (1.4142 * 9.434)
= .97

where ||x|| is the Euclidean norm of x.

Because what I understood so far is that the similarity here only cares about the presence and absence of the terms.

That's not true. Consider

d1 = "hello world"
d2 = "hello world hello"

with tf vectors (no idf here)

v1 = [1, 1]
v2 = [2, 1]

The cosine similarity is 0.95, not 1.

Idf can have a further effect. Suppose we add

d3 = "hello"

then df("hello") = 3 and df("world") = 2, and the tf-idf vectors for d1, d2 become

v1' = [ 1.        ,  1.28768207]
v2' = [ 2.        ,  1.28768207]

with a slightly smaller cosine similarity of 0.94.

(Tf-idf and cosine similarities computed with scikit-learn; other packages may give different numbers due to the different varieties of tf-idf in use.)

see my reply to this question and also the question

Python: tf-idf-cosine: to find document similarity

Basically if you want to use both tf-idf and cosine similarity then you can get the tf-idf vector and apply cosine similarity to that to get final result. So here you are applying cosine similarity(in this case dot product of tf - idf vectors) onto the tf-idf scores.

The answer also had 3 tutorials which you can refer to. They explain how this can work. thankz.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top