Domanda

Data Whitening (features scaling and mean normalization) is very useful when we use features that represent different characteristics and are on very different scales (eg number of rooms in a house and house price).

What about the case when the features represent "similar variables" but are on a very different scale? Let's suppose for instance that we have a matrix representing the numbers of different species at different moments in an environment and we want to regroup these species into groups (say, for example, to prove that mosquitoes and birds populations are very correlated). In this example, the number of mosquitoes is much bigger than the one of birds (say ten or a hundred times). Is it a good idea to whiten this data?

È stato utile?

Soluzione

i think data scaling should be applied when the numeric range per feature varies , It should be applied in the data you discribed

in my experience with svm(liblinear) , the accuracy of the train model can be improved by data scaling by 10%.

usually we would apply regulization for svm model, which make sure the wight didn't grow too large, while, if data is not scaled , feature1 is 100 times larger than feature2 the weight respected to feature1 should be 100 times smaller than feature2 to balance the effect of feature1 and feature2 (which mean w*x is balanced ), in this situation, the weight respected to feature2 will try to grow(if feature2 is effective), but is constrained by the model, so feature2 can not show its effect.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top