Is it acceptable not to transform() test data after train data is being fit_transform()-ed

https://datascience.stackexchange.com/questions/75178

11-12-2020
|

Question

We know that the best practice in data preprocessing (such as standardization, Normalization, ... etc) is that while we perform fit_trasform() on the training data, we apply transform() testing data so that the learned parameters from scaling the train data are applied on testing data. Similar to this:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

The question is: Does it also make sense to perform fit_transform() on the training data but NOT transform() testing data at all so that we get to test the model performance on actual real-world data that are not transformed at all?

Thank you

Solution

No, it does not make sense to do this.

You model has learned how to map one input space to another, that is to say it is itself function approximation, and will likely not know what to for the unseen data.

By not performing the same scaling on the test data, you are introducing systematic errors in the model. This was pointed out in the comments by nanoman - see that comment for a simple transformation example.

To exaggerate the point, imagine you have a language model, translating text from English to French. You apply a single transformation on your English data: you first translate it to Spanish. Now the model is trained to translate Spanish to French, with some accuracy. Now you move on to the test data - still in English - and you do not apply the same transformation as you did to your training data. You are asking the model to translate directly from English to French, instead of Spanish to French, and it is obvious that the results won't be good.

In principal, this idea is the same as with any other model and transformations, just that the impact might not always be so visible i.e. you might get really lucky and not notice a large impact.

The language model might have learnt some elementary linguistics common to all three languages (e.g. overlapping vocabulary or sentence structuring), but we cannot expect the model to perform well, translating English to French.

Practical Note

You should only compute the transformation statistics (e.g. mean and variance for normalisation) only on training data and use these values to then transform the training data itself, and then the same values to transform the test data.

Including the test dataset in the transform computation will allow information to flow from the test data to the train data and therefore to the model that learns from it, thus allowing the model to cheat (introducing a bias).

Also, it is important not to confuse transformations with augmentations. Some "transformations" might be used to synthetically create more training data, but don't have to be used at test time. For example, in computer vision, deleting regions of an image. Test time augmentation is something you could read about.

Extra discussion

More complicated models (ones with many many more parameters) might be able to perform some kind of interpolation, especially if your dataset if N-dimensions with a large N (e.g. > 10).

This has recently been seen with extremely large models, such as Open AI's GPT-3, which has 175 BILLION parameters, and is therefore even able to perform quite well on completely different tasks, let alone the given problem in the training set range.

OTHER TIPS

There is only one answer to this question, which is no, it is not acceptable. Whatever transformation you apply to the train data (PCA, scaling, encoding, etc.) you have to also apply to the test data.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange