Will a Count vectorizer ever perform (slightly) better than tf-idf?
-
01-11-2019 - |
Pergunta
For the task of binary classification, I have a small data-set of a total 1000 texts (~590 positive and ~401 negative instances). With a training set of 800 and test set of 200, I get a (slightly) better accuracy for count vectorizer compared to the tf-idf.
Additionally, count vectorizer picks out the relevant "words" training the model, while tf-idf does not pick those relevant words out. Even the confusion matrix for count vectorizer shows marginally better numbers compared to tf-idf.
TFIDF confusion matrix
[[ 80 11]
[ 6 103]]
BoW confusion matrix
[[ 81 10]
[ 6 103]]
I haven't tried cross-validation yet though it came to me as shock that count vectorizer performed a bit better than tfidf. Is it because my data set is too small or if I have't used any dimensionality reduction to reduce the number of words taken into account by both the classifiers. What is it that I am doing wrong?
I am sorry, if it is an immature question, but I am really new to ML.
Nenhuma solução correta