Scikit-learn OneHotEncoder effect on feature selection

https://datascience.stackexchange.com/questions/68931

09-12-2020
|

Question

If I need to run feature selection on my dataset isn't it problematic to use OneHotEncoder? Couldn't it then decide to remove a one of the encoding columns? How should I deal with this? Thank you.

Solution

Yes it would be possible that it happens. It means that this event happening has no importance for the target.

Imagine a categorical feature with a lot of categories(high cardinality). Maybe only one of them does not have any influence in the target so if you do feature selection this feature might have a high chance of getting dropped. So this is normal thing, just make sure that it doesnt contain any information at any point of the dataset.

OTHER TIPS

I agree with @CarlosMougan that the answer to your question "How should I deal with this?" may well be "There is no problem to deal with." This is essentially allowing your feature selection method to lump categories into the baseline.

However, some folks seem to prefer categoricals' dummy variables to be kept together. I know of three ways to try to do this.

Group lasso, where the penalty tries to enforce that certain groups of variables are either all kept or all dropped from (er, have zero coefficient in) the model.
https://stats.stackexchange.com/q/214325/232706
https://stats.stackexchange.com/q/209009/232706
You could also apply a categorical-friendly univariate analysis (e.g. chi-squared), or a model-based feature selection method in which the model can deal natively with the categorical variables (tree-based models in some implementations).
Finally, with model-based selection techniques, you could try to aggregate the feature importances of dummy variables into one score for each original categorical. It's not clear how well that can work; see
https://stats.stackexchange.com/q/314567/232706
Average of importance gain for a categorical variable

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange