I am using Gensim python toolkit to build tf-idf model for documents. So I need to create a dictionary for all documents first. However, I found Gensim does not use stemming before creating the dictionary and corpus. Am I right ?

有帮助吗?

解决方案

You are correct. Gensim doesn't do anything special other than convert what you give it into different models.

Here is the relevant quote and the link that it is from:

The ways to process documents are so varied and application- and language-dependent that I decided to not constrain them by any interface. Instead, a document is represented by the features extracted from it, not by its “surface” string form: how you get to the features is up to you.

From Strings to Vectors

其他提示

I was also struggling with the same case. To overcome i first stammed documents using NLTK and later processed it with gensim. Probably it can be a easier and handy way to perform your task.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top