Pergunta

I would like to know which is the correct procedure for inferring vectors in Gensim doc2vec.

I have a dataframe df with a feature, called name, and composed of two subsets train and test.

df = train + test

My aim is to find the most similar name in train given a name in test. For doing this I have to train the doc2vec model, and I have two possible choices:

  • train the model on the entire df and then infer the most similar name by model.infer_vector() on test.
  • train the model on train, letting out test, and then use model.infer_vector() on test.

I suppose that the correct procedure is first one, but I am not sure.

Also, so doing, there is the possibility that the most similar name given test will be again in test and not in train.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top