Question

If I need to run feature selection on my dataset isn't it problematic to use OneHotEncoder? Couldn't it then decide to remove a one of the encoding columns? How should I deal with this? Thank you.

Was it helpful?

Solution

Yes it would be possible that it happens. It means that this event happening has no importance for the target.

Imagine a categorical feature with a lot of categories(high cardinality). Maybe only one of them does not have any influence in the target so if you do feature selection this feature might have a high chance of getting dropped. So this is normal thing, just make sure that it doesnt contain any information at any point of the dataset.

OTHER TIPS

I agree with @CarlosMougan that the answer to your question "How should I deal with this?" may well be "There is no problem to deal with." This is essentially allowing your feature selection method to lump categories into the baseline.

However, some folks seem to prefer categoricals' dummy variables to be kept together. I know of three ways to try to do this.

  1. Group lasso, where the penalty tries to enforce that certain groups of variables are either all kept or all dropped from (er, have zero coefficient in) the model.
    https://stats.stackexchange.com/q/214325/232706
    https://stats.stackexchange.com/q/209009/232706

  2. You could also apply a categorical-friendly univariate analysis (e.g. chi-squared), or a model-based feature selection method in which the model can deal natively with the categorical variables (tree-based models in some implementations).

  3. Finally, with model-based selection techniques, you could try to aggregate the feature importances of dummy variables into one score for each original categorical. It's not clear how well that can work; see
    https://stats.stackexchange.com/q/314567/232706
    Average of importance gain for a categorical variable

See also these questions that are close to your question:
How to implement feature selection for categorical variables (especially with many categories)?
Feature Selection with one-hot-encoded categorical data
https://stats.stackexchange.com/q/78644/232706
https://stats.stackexchange.com/q/154266/232706

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top