Pergunta

A common way of calculating the cosine similarity between text based documents is to calculate tf-idf and then calculating the linear kernel of the tf-idf matrix.

TF-IDF matrix is calculated using TfidfVectorizer().

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix_content = tfidf.fit_transform(article_master['stemmed_content'])

Here article_master is a dataframe containing the text content of all the documents.
As explained by Chris Clark here, TfidfVectorizer produces normalised vectors; hence the linear_kernel results can be used as cosine similarity.

cosine_sim_content = linear_kernel(tfidf_matrix_content, tfidf_matrix_content)


This is where my confusion lies.

Effectively the cosine similarity between 2 vectors is:

InnerProduct(vec1,vec2) / (VectorSize(vec1) * VectorSize(vec2))

Linear kernel calculates the InnerProduct as stated here

Linear Kernel Formulae

So the questions are:

  1. Why am I not divding the inner product with the product of the magnitude of the vectors ?

  2. Why does the normalisation exempt me of this requirement ?

  3. Now if I wanted to calculate ts-ss similarity, could I still use the normalised tf-idf matrix and the cosine values (calculated by linear kernel only) ?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top