TS-SS and Cosine similarity among text documents using TF-IDF in Python

https://datascience.stackexchange.com/questions/62143

02-11-2019
|

Pergunta

A common way of calculating the cosine similarity between text based documents is to calculate tf-idf and then calculating the linear kernel of the tf-idf matrix.

TF-IDF matrix is calculated using TfidfVectorizer().

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix_content = tfidf.fit_transform(article_master['stemmed_content'])

Here article_master is a dataframe containing the text content of all the documents.
As explained by Chris Clark here, TfidfVectorizer produces normalised vectors; hence the linear_kernel results can be used as cosine similarity.

cosine_sim_content = linear_kernel(tfidf_matrix_content, tfidf_matrix_content)

This is where my confusion lies.

Effectively the cosine similarity between 2 vectors is:

InnerProduct(vec1,vec2) / (VectorSize(vec1) * VectorSize(vec2))

Linear kernel calculates the InnerProduct as stated here

So the questions are:

Why am I not divding the inner product with the product of the magnitude of the vectors ?
Why does the normalisation exempt me of this requirement ?
Now if I wanted to calculate ts-ss similarity, could I still use the normalised tf-idf matrix and the cosine values (calculated by linear kernel only) ?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange