Pergunta

I am using Gensim python toolkit to build tf-idf model for documents. So I need to create a dictionary for all documents first. However, I found Gensim does not use stemming before creating the dictionary and corpus. Am I right ?

Foi útil?

Solução

You are correct. Gensim doesn't do anything special other than convert what you give it into different models.

Here is the relevant quote and the link that it is from:

The ways to process documents are so varied and application- and language-dependent that I decided to not constrain them by any interface. Instead, a document is represented by the features extracted from it, not by its “surface” string form: how you get to the features is up to you.

From Strings to Vectors

Outras dicas

I was also struggling with the same case. To overcome i first stammed documents using NLTK and later processed it with gensim. Probably it can be a easier and handy way to perform your task.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top