Weighted sum of word vectors for document similarity

https://datascience.stackexchange.com/questions/24855

31-10-2019
|

Question

I have trained a word2vec model on a corpus of documents. I then compute the term frequency (the same Tf in TfIDF) of each word in each document, multiply each words Tf by its corresponding word vector (this is the weighted part), and sum each of these weighted vectors together element wise to obtain a single vector for a document.

Is this method valid?

An example, to make it clearer. Take the document:

"The car drives on the road"

The Tf for each word would be {'the':0.3333, 'car':0.1666, 'drives':0.1666, 'on':0.1666, 'road':0.16666} (obtained by taking the word count of a word and dividing by the total number of words in the document). If we have a trained word2vec model, we can do the following

$$0.333*\begin{bmatrix} the_0 \\ the_1 \\ \vdots \\ the_n \end{bmatrix} + 0.1666*\begin{bmatrix} car_0 \\ car_1 \\ \vdots \\ car_n \end{bmatrix} + ...$$

where each of the column vectors is the word vector for that word. The final result is an $n\times 1$ dimensional vector representing the document.

$$ \begin{bmatrix} 0.333*the_0 + 0.1666*car_0 + \dots \\ 0.333*the_1 + 0.1666*car_1 + \dots \\ \vdots \\ 0.333*the_n + 0.1666*car_n + \dots \end{bmatrix} $$

I appreciate there are other methods such as doc2vec that aim to do very much the same, but in a much more sophisticated way. But is my method valid / is there anything blatantly wrong here?

I have tested this method and even used some document similarity metrics (yes, after normalizing the vectors of course), and yielded some good results for my industrial application. But I want to know more about the academics of this method.

The nice thing about this way, is that by using the word2vec vectors, similarity queries between documents yield very good results due to semantic similarities (euclidean closeness) between word vectors, even if different words are used across documents; this is something TfIDF cannot do as each word is treated differently.

Thanks in advance!

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange