Pergunta

I'm running word2vec over collection of documents. I understand that the size of the model is the number of dimensions of the vector space that the word is embedded into. And that different dimensions are somewhat related to different, independent "concepts" that a word could be grouped into. But beyond this I can't find any decent heuristics for how exactly to pick the number. There's some discussion here about the vocabulary size: https://stackoverflow.com/questions/45444964/python-what-is-the-size-parameter-in-gensim-word2vec-model-class However, I suspect that vocabulary size is not most important, but more important is how many sample documents you have and how long they are. Surely each "dimension" should have sufficient examples to be learnt?

I have a collection of 200 000 documents, averaging about 20 pages in length each, covering a vocabulary of most of the English language. I'm using the word2vec embedding as a basis for finding distances between sentences and the documents. I'm using Gensim, if it matters. I'm using a size of 240. Is this reasonable? Are there any studies on what heuristics to use to choose the size parameter? Thanks.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top