First off, your calculations are flawed. The cosine similarity between (1, 1) and (5, 8) is
1*5 + 1*8 / ||(1, 1)|| * ||(5, 8)||
= 13 / (1.4142 * 9.434)
= .97
where ||x||
is the Euclidean norm of x
.
Because what I understood so far is that the similarity here only cares about the presence and absence of the terms.
That's not true. Consider
d1 = "hello world"
d2 = "hello world hello"
with tf vectors (no idf here)
v1 = [1, 1]
v2 = [2, 1]
The cosine similarity is 0.95, not 1.
Idf can have a further effect. Suppose we add
d3 = "hello"
then df("hello") = 3
and df("world") = 2
, and the tf-idf vectors for d1
, d2
become
v1' = [ 1. , 1.28768207]
v2' = [ 2. , 1.28768207]
with a slightly smaller cosine similarity of 0.94.
(Tf-idf and cosine similarities computed with scikit-learn; other packages may give different numbers due to the different varieties of tf-idf in use.)