Why I would use TF-IDF after Bag-of-Words (CountVectorizer)?

https://datascience.stackexchange.com/questions/85716

16-12-2020
|

Pergunta

In my recent studies over Machine Learning NLP tasks I found this very nice tutorial teaching how to build your first text classifier:

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

The point is that I always believed that you have to choose between using Bag-of-Words or WordEmbeddings or TF-IDF, but in this tutorial the author uses Bag-of-Words (CountVectorizer) and then uses TF-IDF over the features generated by Bag-of-Words.

text_clf = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf', MultinomialNB()),
... ])

Is that a valid technique? Why would I do it?

Solução

This is the standard TF-IDF feature extraction: you transform the document counts. It just looks odd to separate the two steps like this. sklearn provides both TfidfTransformer and TfidfVectorizer; note the documentation of the latter:

Equivalent to CountVectorizer followed by TfidfTransformer.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange