how to use standardization / standardscaler() for train and test?

https://datascience.stackexchange.com//questions/63717

06-12-2019
|

Question

At the moment I perform the following:

estimators = []
estimators.append(('standardize', StandardScaler()))
prepare_data = Pipeline(estimators)

n_splits = 5
tscv = TimeSeriesSplit(n_splits = n_splits)

for train_index, val_index in tscv.split(df_train):
    X_train, X_val = prepare_data.fit_transform(df_train[train_index]), prepare_data.fit_transform(df_train[val_index])

X_test = prepare_data.fit_transform(df_test)

Now I would like to know if this is correct. My concern is that X_train and X_test are transformed separately. While in the first instance I thought this is how it should be I'm about to change my mind as I think I have to use the mean and std of the train set to use within the test set?

Solution

The recommended way (see 'Elements of Statistical Learning', chapter 'The Wrong and Right Way to Do Cross-validation') is to calculate the mean and the standard deviation of the values in the training set and then apply them for standardizing both the training and testing sets.

The idea behind this is to prevent data leakage from the testing to the training set because the aim of model validation is to subject the testing data to the same conditions as the data used for the model training.

OTHER TIPS

I guess you are using scikit-learn...

What you have to do is to fit the pipeline with X_train and for X_test only tranform.

With the fit method you will compute the mean and std. dev. on the given data (X_train) and with the transform you apply the transformation with these computed values to a given dataset.

The problem is that in scikit-learn, there is no isolated transform method, it is embbeded in the predict method, that eventually applies all transformations and gives the predictions of the last estimator of the PipeLine.

In this post, how to apply only transformations is explained: https://stackoverflow.com/questions/33469633/how-to-transform-items-using-sklearn-pipeline

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange