Pipeline heterogeneous data

https://datascience.stackexchange.com/questions/72286

10-12-2020
|

Pregunta

I'm facing an issue I don't know how to solve, but as I'm a beginner probably there is an easy solution I can't find.

I'm playing with the titanic dataset and I want to work with pipelines (In order to avoid data leakage using cross validation). For that reason I'm using two pipelines (one for numerical, one for categorical) + FeatureUnion().

What's the problem? In the numerical pipeline I fill the NaN values of Age and then I create some buckets for that variable. The result of this pipeline would be a dataframe containing all the numerical features + 1 categorical variable. For encoding categorical variables, I use a pipeline for the categorical variables, and then use FeatureUnion to join both datasets. But the problem is that the new variable I create in the numerical pipeline doesn't go into the categorical pipeline, resulting with a dataframe with one categorical variable that hasn't been encoded. How can I solve this?

CODE:

num_pipeline = Pipeline(steps = [
('selector', DataFrameSelector(numerical_features)),
('imputer', df_imputer(strategy="median")), #Numerical
('new_variables', df_new_variables()) #Numerical
])

cat_pipeline = Pipeline(steps = [
('selector', DataFrameSelector(categorical_features)),
('label_encoder', MultiColumnLabelEncoder()) #Categorical
])

full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline)
])

Thank you for your time

Best regards

EDIT:

I was thinking about using ColumnTransformer as I think it suits better in my example as I have to apply different transformations for different columns, but the problem is that when working with ColumnTransformer the output would be an array with no columns' names, which I think would be hard to deal if we want to use feature selection. That's why I chose Pipelines rather than ColumnTransformer.

Talking about the option of creating the bucket before going into the pipeline, I can't because it's created based on the variable I'm dealing with missing values.

What would be the best option in this case?

Solución

Approach 1: create features before transforming

If you want to create a categorical variable based on a numerical variable and then treat it in cat_pipeline, you need to do create it before the column transformer.

Implement a transformer (called "bucketer" ?) that takes p variables and transforms it into p+1 (if you want to add the categorical representation and keep the initial numerical feature). This transformer is the FIRST step of your pipe.

Then, create a ColumnTransformer (I think it is more suited for your case, but have not enough details to be sure. I suggest you read this to be sure). This second transformation is the second step in your pipe.

Each branch is fed with respect to what it should output because feature creation (bucketing) was done prior to column transformer.

Approach 2: create features in transforming

Otherwise, you can create two main paths :

One branch will output only categorical features, whatever the input type.
The other will output only numerical features, whatever the input type.

You may want to re-use your feature selector in some branches, I just wanted to illustrate the two possible approaches

Hope this helps

Licenciado bajo: CC-BY-SA con atribución

No afiliado a datascience.stackexchange