Scaling the data iteratively one by one vs batch scaling

https://datascience.stackexchange.com/questions/86263

17-12-2020
|

문제

I have 2000 signals in a dataset of shape (2000, 400000) where each signal is recorded within the range -127, 128. I want to downscale each signal from (-127, 128) to (-1,1) to save memory space and also for better visualization. There are two approaches:

Approach 1: Iteratively apply minmax_scale individually at each signal something like the following:

from sklearn.preprocessing import minmax_scale
data = read_dataset(...)
for i in range(data):
    minmax_scale(i, feature_range=(-1, 1))

Approach 2: Fit the whole dataset using MinMaxScaler something like the following:

from sklearn.preprocessing import MinMaxScaler
data = read_dataset(...)
scalar = MinMaxScaler(feature_range=(-1,1))
scaled_data = scalar.fit_transform(data)

I use the first approach because the dataset does not fit the memory but I am worried if it is incorrect. I want to make sure if my choice is sound, in theory.

Thank you very much.

해결책

With the first approach, you are completely disregarding the global scale of the signal and only focusing on the relative scale. This will most probably hurt the performance of whatever system you train on that data (or any analysis you perform on it), compared to training on the globally scaled data, unless the relative values are the only important pieces of information in your signals.

I noticed that you already know the range in which the data is defined: (-127, 128). If you want to scale your data, why don't you scale the data with a fixed computation like $((x + 127) \cdot 2 / 255) - 1$ ?

Anyway, I don't see how a mere linear change in scale would save memory or help visualization, which seem to be your final goal.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange