Pergunta

In Word2Vec, If I train a set of sentences multiple times with change in order (as it increases the vector representations), will the frequency of a word get changed due to it.?

For example, if I have the word "deer" in my corpus 4 times and If I set the min_count to be 5, does training the model 3 times repeatedly count "deer" with frequency 12 and will be included in the model ?

If it knows it is the same corpus then how it is possible to differentiate, if I retrain the model with a new corpus.

Foi útil?

Solução

The question has been answered in google groups by Gordon mohr.

Normally there's one read of the corpus to build the vocabulary (which includes initializing the model based on the learned vocabulary size), then any number of extra passes for training. It's only after the one vocabulary-learning scan that word counts are looked at (and compared to min_count for trimming).

If you supply a corpus (as a restartable iterator) as one of the arguments to the initial creation of the Word2Vec model, all these steps are done automatically: one read of the corpus (through the build_vocab() method) to collect words/counts, then one or more passes (as controlled by the 'iter' parameter and done through the train() method) for training. Still, only the count for the single pass over the supplied corpus matters for frequency decisions.

If you don't supply a corpus at model-initialization, you can then call build_vocab(…) and train(…) yourself. It's only what's passed to build_vocab() that matters for retained frequency counts (and the estimate of corpus size). You can then call train(…) in other ways, or repeatedly – it just keeps using the vocabulary from the one earlier build_vocab(…) call.

(Note that train(…) does try to reuse the single-pass corpus size, remembered from the vocab-scanning pass, to give accurate progress-estimates and schedule the decay of the training-rate alpha. So if you give a different-sized corpus to train(…), you should also use its other optional parameters to give it a hint of the size.)

Licenciado em: CC-BY-SA com atribuição
scroll top