Choose binary classification algorithm

https://datascience.stackexchange.com/questions/384

16-10-2019
|

Question

I have a binary classification problem:

Approximately 1000 samples in training set
10 attributes, including binary, numeric and categorical

Which algorithm is the best choice for this type of problem?

By default I'm going to start with SVM (preliminary having nominal attributes values converted to binary features), as it is considered the best for relatively clean and not noisy data.

Solution

It's hard to say without knowing a little more about your dataset, and how separable your dataset is based on your feature vector, but I would probably suggest using extreme random forest over standard random forests because of your relatively small sample set.

Extreme random forests are pretty similar to standard random forests with the one exception that instead of optimizing splits on trees, extreme random forest makes splits at random. Initially this would seem like a negative, but it generally means that you have significantly better generalization and speed, though the AUC on your training set is likely to be a little worse.

Logistic regression is also a pretty solid bet for these kinds of tasks, though with your relatively low dimensionality and small sample size I would be worried about overfitting. You might want to check out using K-Nearest Neighbors since it often performs very will with low dimensionalities, but it doesn't usually handle categorical variables very well.

If I had to pick one without knowing more about the problem I would certainly place my bets on extreme random forest, as it's very likely to give you good generalization on this kind of dataset, and it also handles a mix of numerical and categorical data better than most other methods.

OTHER TIPS

For low parameters, pretty limited sample size, and a binary classifier logistic regression should be plenty powerful enough. You can use a more advanced algorithm but it's probably overkill.

When categorical variables are in the mix, I reach for Random Decision Forests, as it handles categorical variables directly without the 1-of-n encoding transformation. This loses less information.

Linear SVM should be a good starting point. Take a look at this guide to choose the right estimator.

Wouldn't recommend use of complex methods first. Use faster simple approaches initially (kNN, NBC, etc.), then progress through linear regression, logistic regression, LDA, CART(RF), KREG, and then to least squares SVM, gradient ascent SVM, ANNs, and then metaheurustics (greedy heuristic hill climbing with GAs, swarm intelligence, ant colony optimization, etc.)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange