sklearn - How to create a sequential pipeline

https://datascience.stackexchange.com/questions/81479

13-12-2020
|

Question

Update: The examples in this post were updated

I am reposting this question here after not getting a clear answer in a previous SO post

I am looking for a help building a data preprocessing pipleline using sklearn's ColumnTransformer functions where the some features are preprocesses sequentially. I am well aware of how to build separate pipelines for different subsets of features. For example, my pipleline may look something like this:

from sklearn.compose import ColumnTransformer 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer

ColumnTransformer(remainder='passthrough',
                  transformers=[
                              ('num_impute', SimpleImputer(strategy='median'), ['feat_1', 'feat_2'])
                                ('Std', StandardScaler(), ['feat_3', 'feat_4']),
                                ('Norm', Normalizer(), ['feat_5', 'feat_6']),
                                ])

Notice that each transformer is provided a unique set of features.

The issue I am encountering is how to apply sequential processing for the same features (different combinations of transformations and features). For example,

ColumnTransformer(remainder='passthrough',
                  transformers=[
                              ('num_impute', SimpleImputer(strategy='median'), ['feat_1', 'feat_2', , 'feat_5'])
                              ('Std', StandardScaler(), ['feat_1', 'feat_2','feat_3', 'feat_4', 'feat_6']),
                              ('Norm', Normalizer(), ['feat_1', 'feat_6'])

                                ])

Notice that feat_1 was provided to three transformations, feat_2 was provided to two transformers (impute and Std), and feat_6 was provided to two transformers (Std and Norm)

A pipeline like this will two duplicate columns for feat_2 and feat_3, and three duplicate columns for feat_1. Building a separate pipeline for each transformation/feature combination is not scalable.

Solution

When you want to do sequential transformations, you should use Pipeline.

imp_std = Pipeline(
    steps=[
        ('impute', SimpleImputer(strategy='median')),
        ('scale', StandardScaler()),
    ]
)

ColumnTransformer(
    remainder='passthrough',
    transformers=[
        ('imp_std', imp_std, ['feat_1', 'feat_2']),
        ('std', StandardScaler(), ['feat_3']),
    ]
)

imp = ColumnTransformer(
    remainder='passthrough',
    transformers=[
        ('imp', SimpleImputer(strategy='median'), ['feat_1', 'feat_2']),
    ]
)

Pipeline(
    steps=[
        ('imp', imp),
        ('std', StandardScaler()),
    ]
)

OTHER TIPS

One way to do this is by creating separate preprocessing steps for each data type, the most common case is you have categorical and continuous variables

from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector
from sklearn.linear_model import LogisticRegression

cont_prepro = Pipeline([("imputer",SimpleImputer(strategy = "median")),("scaler",StandarScaler())])

cat_prepro = Pipeline([("imputer",SimpleImputer(strategy = "most_frequent")),("encoder",OneHotEncoder(handle_unknown = "ignore"))])

preprocessing = make_column_transformer((cont_prepro,selector(dtype_exclude = "object")),(cat_prepro,selector(dtype_include = "object"))

pipe = Pipeline([("preprocessing",preprocessing),("model",LogisticRegression())])

If you want to separate features on each step by listing instead of by type you should create a list with the specific columns as you already did in your example and remove the selector part.

In your case:

pipe_one = Pipeline([("num_impute",SimpleImputer(strategy='median')),('Std', StandardScaler())])

preprocessing = make_column_transformer((pipe_one,["feat_1","feat_2"]),remainder='passthrough')

pipe = Pipeline([("preprocessing",preprocessing),("model",LogisticRegression())])

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange