How to scale a variable when not knowing the maximum

https://datascience.stackexchange.com//questions/63743

06-12-2019
|

Question

I have a dataset with different features where some of them are not categorical, so they need to be scaled or normalized (especially the target).

However, normalizing between 0-1 for instance means that the variable maximum value will be equal to one, and the mean to 0.

Now if I receive a new example never seen before, and this example has a value higher than the max of the training examples, how should this value be normalized ?

EDIT

As an example. If my maximum value is 150, it will be scaled to 1.0. Now if I receive a new example, with a value equal to 320, how should it be scaled ?

Solution

If your model works in production, you should not retransform your scaler, you should transform new example like 150 is still the maximum value. (It will give you higher than 1, so its a bit problematic but possible solution is below) However you can still label those example as outlier.

Possible solution for that case: If you have high number of outliers/leverages, you should consider tree ensembles and/or regularized models.

If your predictor is not on production, just add those examples to your train set and fit again since your sample in the first training would be different than the reality.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange