Pergunta

For the task of binary classification, I have a small data-set of a total 1000 texts (~590 positive and ~401 negative instances). With a training set of 800 and test set of 200, I get a (slightly) better accuracy for count vectorizer compared to the tf-idf.

Additionally, count vectorizer picks out the relevant "words" training the model, while tf-idf does not pick those relevant words out. Even the confusion matrix for count vectorizer shows marginally better numbers compared to tf-idf.

TFIDF confusion matrix
[[ 80  11]
 [  6 103]]
BoW confusion matrix
[[ 81  10]
 [  6 103]] 

I haven't tried cross-validation yet though it came to me as shock that count vectorizer performed a bit better than tfidf. Is it because my data set is too small or if I have't used any dimensionality reduction to reduce the number of words taken into account by both the classifiers. What is it that I am doing wrong?

I am sorry, if it is an immature question, but I am really new to ML.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top