Question

I want to get the semantic similarity of two words using cosine similarity method using TF-IDF. For that first I want to take the meaning of those words from wikipedia or word-net.After that I want to pre-process the text and find the TF-IDF. When I googled the problem I found that for finding the TF-IDF we should have a train set and test set. In my case which one is train set and which one is test set? How can I calculate cosine similarity using computed result?

Was it helpful?

Solution

The training phase is finding the weights in TF-IDF, which is calculated based on the frequency of a given word in a document vs. all documents. Once you have all the weights, it means that you turned each document into a vector of N words.

Now, given two documents i and j, you calculate their similarity by the Cosine function. A cosine similarity measure on two vectors is calculated by their dot product over their magnitudes. Look here for more info.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top