Caratteristiche di vettori di parole in word2vec

https://datascience.stackexchange.com/questions/13923

16-10-2019
|

Domanda

che sto cercando di fare sentiment analysis. Al fine di convertire le parole di vettori di parole che sto usando il modello word2vec. Supponiamo che io sono tutte le frasi in una lista di nome 'frasi' e sto passando queste frasi a word2vec come segue:

model = word2vec.Word2Vec(sentences, workers=4 , min_count=40, size=300,   window=5, sample=1e-3)

Dato che io sono niubbo ai vettori di parole ho due dubbi.
1- Impostazione del numero di caratteristiche per 300 definisce le caratteristiche di una parola vettore. Ma cosa significano queste caratteristiche? Se ogni parola in questo modello è rappresentato da una matrice 1x300 NumPy, allora cosa fare questi 300 caratteristiche significano per quella parola?

2- Che cosa significa giù di campionamento come rappresentato dal parametro 'campione' nel modello di cui sopra nel reale?

Grazie in anticipo.

Soluzione

1- The number of features: In terms of neural network model it represents the number of neurons in the projection(hidden) layer. As the projection layer is built upon distributional hypothesis, numerical vector for each word signifies it's relation with its context words.

These features are learnt by the neural network as this is unsupervised method. Each vector has several set of semantic characteristics. For instance, let's take the classical example, V(King) -V(man) + V(Women) ~ V(Queen) and each word represented by 300-d vector. V(King) will have semantic characteristics of Royality, kingdom, masculinity, human in the vector in a certain order. V(man) will have masculinity, human, work in a certain order. Thus when V(King)-V(Man) is done, masculinity,human characteristics will get nullified and when added with V(Women) which having femininity, human characteristics will be added thus resulting in a vector much similar to V(Queen). The interesting thing is, these characteristics are encoded in the vector in a certain order so that numerical computations such as addition, subtraction works perfectly. This is due to the nature of unsupervised learning method in neural network.

2- There are two approximation algorithms. Hierarchical softmax and negative sampling. When the sample parameter is given, it takes negative sampling. In case of hierarchical softmax, for each word vector its context words are given positive outputs and all other words in vocabulary are given negative outputs. The issue of time complexity is resolved by negative sampling. As in negative sampling, rather than the whole vocabulary, only a sampled part of vocabulary is given negative outputs and the vectors are trained which is so much faster than former method.

Altri suggerimenti

According to distributional hypothesis, individual dimension in the vector of the word does not signify much about the word in real world. You need to worry about the individual dimensions. If your question is so how should I select the number of dimesions, it is purely based on experiment for your data and it can go from 100 to 1000. For many experiments where the training is done on wiki text the 300 dimension mostly give the best result.
Sample param is the parameter used to prune the words having high frequency. Eg "the" "is" "was", these stopwords are not considered in the window while predicting the inside word and the default value works well to identify these stop words whose frequency is higher.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a datascience.stackexchange