Question

I have a data set with huge number of features ( Approximately 3000) and a binary target variable . The reason I have too many features is because of one hot encoding many categorical variables in my data set .

I think logistic regression might only work with small number of features .

So , given that I have many features , which algorithm should I use for better classification score ?

My aim is to increase the ROC-AUC metric for this classification task .

Is it better to use SVM or Neural networks ?

Was it helpful?

Solution

First thing that comes to my mind is to do different encodings. There are some ways to deal with high cardinality categorical data such as: Label Encoding or the famous target encoding. Before anything else I will recommend changing the encoding type.

But, since your question about which predictor use with small and space data. I will go still with logistic regression, decision tree or SVM. When data is small all algorithms tend to work quite similar.

Things like Random Forest might perform well since they do bootstrapping what tends to be a way to sample your data with replacement.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top