Features of word vectors in word2vec

https://datascience.stackexchange.com/questions/13923

16-10-2019
|

Question

I am trying to do sentiment analysis. In order to convert the words to word vectors I am using word2vec model. Suppose I have all the sentences in a list named 'sentences' and I am passing these sentences to word2vec as follows :

model = word2vec.Word2Vec(sentences, workers=4 , min_count=40, size=300,   window=5, sample=1e-3)

Since I am noob to word vectors I have two doubts.
1- Setting the number of features to 300 defines the features of a word vector. But what these features signify? If each word in this model is represented by a 1x300 numpy array, then what do these 300 features signify for that word?

2- What does down sampling as represented by 'sample' parameter in the above model do in actual?

Thanks in advance.

Solution

1- The number of features: In terms of neural network model it represents the number of neurons in the projection(hidden) layer. As the projection layer is built upon distributional hypothesis, numerical vector for each word signifies it's relation with its context words.

These features are learnt by the neural network as this is unsupervised method. Each vector has several set of semantic characteristics. For instance, let's take the classical example, V(King) -V(man) + V(Women) ~ V(Queen) and each word represented by 300-d vector. V(King) will have semantic characteristics of Royality, kingdom, masculinity, human in the vector in a certain order. V(man) will have masculinity, human, work in a certain order. Thus when V(King)-V(Man) is done, masculinity,human characteristics will get nullified and when added with V(Women) which having femininity, human characteristics will be added thus resulting in a vector much similar to V(Queen). The interesting thing is, these characteristics are encoded in the vector in a certain order so that numerical computations such as addition, subtraction works perfectly. This is due to the nature of unsupervised learning method in neural network.

2- There are two approximation algorithms. Hierarchical softmax and negative sampling. When the sample parameter is given, it takes negative sampling. In case of hierarchical softmax, for each word vector its context words are given positive outputs and all other words in vocabulary are given negative outputs. The issue of time complexity is resolved by negative sampling. As in negative sampling, rather than the whole vocabulary, only a sampled part of vocabulary is given negative outputs and the vectors are trained which is so much faster than former method.

OTHER TIPS

According to distributional hypothesis, individual dimension in the vector of the word does not signify much about the word in real world. You need to worry about the individual dimensions. If your question is so how should I select the number of dimesions, it is purely based on experiment for your data and it can go from 100 to 1000. For many experiments where the training is done on wiki text the 300 dimension mostly give the best result.
Sample param is the parameter used to prune the words having high frequency. Eg "the" "is" "was", these stopwords are not considered in the window while predicting the inside word and the default value works well to identify these stop words whose frequency is higher.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange