Consequence of Feature Scaling

https://datascience.stackexchange.com/questions/2582

16-10-2019
|

Question

I am currently using SVM and scaling my training features to the range of [0,1]. I first fit/transform my training set and then apply the same transformation to my testing set. For example:

    ### Configure transformation and apply to training set
    min_max_scaler = MinMaxScaler(feature_range=(0, 1))
    X_train = min_max_scaler.fit_transform(X_train)

    ### Perform transformation on testing set
    X_test = min_max_scaler.transform(X_test)

Let assume that a given feature in the training set has a range of [0,100], and that same feature in the testing set has a range of [-10,120]. In the training set that feature will be scaled appropriately to [0,1], while in the testing set that feature will be scaled to a range outside of that first specified, something like [-0.1,1.2].

I was wondering what the consequences of the testing set features being out of range of those being used to train the model? Is this a problem?

Solution

Within each class, you'll have distributions of values for the features. That in itself is not a reason for concern.

From a slightly theoretical point of view, you can ask yourself why you should scale your features and why you should scale them in exactly the chosen way.
One reason may be that your particular training algorithm is known to converge faster (better) with values around 0 - 1 than with features which cover other orders of magnitude. In that case, you're probably fine. My guess is that your SVM is fine: you want to avoid too large numbers because of the inner product, but a max of 1.2 vs. a max of 1.0 won't make much of a difference.
(OTOH, if you e.g. knew your algorithm to not accept negative values you'd obviously be in trouble. )

The practical question is whether your model performs well for cases that are slightly out of the range covered by training. This I believe can best and possibly only be answered by testing with such cases / inspecting test results for performance drop for cases outside the training domain. It is a valid concern and looking into this would be part of the validation of your model.

Observing differences of the size you describe is IMHO a reason to have a pretty close look at model stability.

OTHER TIPS

This was meant as a comment but it is too long.

The fact that your test set has a different range might be a sign that the training set is not a good representation of the test set. However, if the difference is really small as in your example, it is likely that it won't affect your predictions. Unfortunately, I don't think I have a good reason to think it won't affect a SVM in any circumstance.

Notice that the rationale for using MinMaxScalar is (according to the documentation):

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

Therefore, it is important for you to make sure that your data fits that case.

If you are really concerned about having a difference range, you should use a regular standardization (such as preprocessing.scale) instead.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange