Sklearn Pipeline for mixed features: numerical and (skewed) categorical

https://datascience.stackexchange.com/questions/69914

09-12-2020
|

Question

I am working on a dataset from Kaggle (housing price prediction). I have done some pre-processing on the data (missing values, category aggregation, selecting ordinal vs one-hot). I am trying to implement a pipeline to streamline the code. The pipeline consists of a ColumnTransformer with two components: one component contains a standard scaler applied to numerical and ordinal features; the second component has a one-hot encoder for the remaining set of features. I am passing this transformer to a GridSearchCV object to tune hyperparameters. In this case, it is a LASSO model. So, I am trying to tune the coefficient of the penalty term. The problem is some of the one-hot encoded features are highly skewed with the count in mostly one category. When GridSearchCV tries to run cross-validation, it raises an error saying that unknown categories are found while validating the model. I think this happens because while fitting the one-hot encoder the train set doesn't contain data points with specific labels that show up in the validation set. One obvious way to handle this would be to fit a one-hot encoder, keep it aside and then build a pipeline and carry on with the grid search (/validation) steps. This seems a bit disconnected to me considering the notion of pipeline was defined for exactly this purpose. Maybe I am missing something here. Is there a better (/efficient) way to achieve the above rather than separating the one-hot encoder from the pipeline?

For reference, the data for the above histogram,

Category Counts
CompShg 1434
Tar&Grv 11
Tar&Grv 11
WdShngl 6
WdShake 5
Roll 1
ClyTile 1
Metal 1
Membran 1

Solution

You can use handle_unknown='ignore' in the OneHotEncoder; levels present in the test set but not the train set will be encoded as all-zeros rather than raising an error.

But... that example you provide, I rather doubt it's worth it to keep all the levels. The coefficient learned for such a small level's dummy variable will be overly specialized (overfit). Consider alternatives.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange