Most important part of feature standardization and how is standardization affected by sparsity?

https://datascience.stackexchange.com/questions/6076

16-10-2019
|

Question

I am thinking of preprocessing techniques for the input data to a convolutional neural network (CNN) using sparse datasets and trained with SGD. In Andrew Ng's coursera course, Machine Learning, he states that it is important to preprocess the data so it fits into the interval $ \left[ 3, 3 \right] $ when using SGD. However, the most common preprocessing technique is to standardize each feature so $ \mu = 0 $ and $ \sigma = 1 $. When standardizing a highly sparse dataset many of the values will not end up in the interval.

I am therefore curious - would it be better to aim for e.g. $ \mu = 0 $ and $ \sigma = 0.5 $ in order for the values be closer to the interval $ \left[ 3, 3 \right] $? Could anyone argue based on a knowledge of SGD on whether it is most important to aim for $ \mu = 0 $ and $ \sigma = 1 $ or $ \left[ 3, 3 \right] $?

Solution

No, you are misinterpreting his comments. If you have data that has some outliers in it then the outliers will extend beyond 3 standard deviations. Then if you standardize the data some will extend beyond the [-3,3] region.

He is simply saying that you need to remove your outliers so the outliers don't reap havoc on your stochastic gradient descent algorithm. He is NOT saying that you need to use some weird scaling algorithm.

You should standardize your data by subtracting the mean and dividing by the standard deviation, and then remove any points that extend beyond [-3,3], which are the outliers.

In stochastic gradient descent, the presence of outliers could increase the instability of the minimization and make it thrash around excessively, so its best to remove them.

If the sparseness of the data prevents removal then... Do you need to use stochastic gradient descent, or can you just use gradient descent? Gradient descent (GD) might help to alleviate some of the problems relating to convergence. Finally, if GD is having trouble converging, you could always do an direct solve (e.g. direct matrix inversion) rather than an iterative solve.

Hope this helps!

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange