Question

I was looking at text classification, and for curiosity I was searching online for which were the best models for text classifications. About this, I found that they are linear support vector machines and naive bayes.

But which are the worse models to use in text classification? And, if possible, why?

Was it helpful?

Solution

First, the question is too broad because there are many different kinds of text classification tasks. For example one wouldn't use the same approach for say spam detection and author profiling (e.g. predict the gender of the author), two tasks which are technically text classification but have little in common (and there are many others).

Second, even with a more specific kind of problem, the question of the type of model is misleading because a lot of what makes a ML system perform better than another in text classification is due to other things: the type and amount of training data of course, but also crucially the features being used. There are many options in terms of representing text as features, and these different options usually have a massive impact on performance. I even think that most of the time the choice of a type of classification model does not matter as much as the design of the features.

Finally I'm actually going to answer the question but probably not in the way OP expects: the worst model in any classification task is exactly like the best model, but it swaps the answers in order to have as many wrong predictions as possible (e.g. class 1 -> class 2, class 2 -> class 3, .., class N -> class 1). Since it's a lot of work to implement the best classifier just to obtain the worst one, a close to worst one can be done with a minority baseline classifier: just predict every instance as the least frequent class in the training data.

I hope a few of the things I said will be helpful, even though it's probably not what OP wished for! :)

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top