
I have a data set with 20000 samples, each has 12 different features. Each sample is either in category 0 or 1. I want to train a neural network and a decision forest to categorize the samples so that I can compare the results and both techniques.

The first thing I stumbled upon is the proper normalization of the data. One feature is in the range $[0,10^6]$, another one in $[30,40]$ and there is one feature that mostly takes the value 8 and sometimes 7. So as I read in different sources, proper normalization of the input data is crucial for neural networks. As I found out, there are many possible ways to normalize the data, for example:

  1. Min-Max Normalization: The input range is linearly transformed to the interval $[0,1]$ (or alternatively $[-1,1]$, does that matter?)
  2. Z-Score Normalization: The data is transformed to have zero mean and unit variance: $$y_{new}=\frac{y_{old}-\text{mean}}{\sqrt{\text{Var}}}$$

Which normalization should I choose? Is normalization also needed for decision forests? With Z-Score normalization, the different features of my test data do not lie in the same range. Could this be a problem? Should every feature normalized with the same algorithm, so that I decide either to use Min-Max for all features or Z-Score for all features?

Are there combinations where the data is mapped to $[-1,1]$ and also has zero mean (which would imply a non-linear transformation of the data and hence a change in the variance and other features of the input data).

I feel a bit lost because I can't find references which answer these questions.

Was it helpful?


I disagree with the other comments.

First of all, I see no need to normalize data for decision trees. Decision trees work by calculating a score (usually entropy) for each different division of the data $(X\leq x_i,X>x_i)$. Applying a transformation to the data that does not change the order of the data makes no difference.

Random forests are just a bunch of decision trees, so it doesn't change this rationale.

Neural networks are a different story. First of all, in terms of prediction, it makes no difference. The neural network can easily counter your normalization since it just scales the weights and changes the bias. The big problem is in the training.

If you use an algorithm like resilient backpropagation to estimate the weights of the neural network, then it makes no difference. The reason is because it uses the sign of the gradient, not its magnitude, when changing the weights in the direction of whatever minimizes your error. This is the default algorithm for the neuralnet package in R, by the way.

When does it make a difference? When you are using traditional backpropagation with sigmoid activation functions, it can saturate the sigmoid derivative.

Consider the sigmoid function (green) and its derivative (blue):


What happens if you do not normalize your data is that your data is multiplied by the random weights and you get things like $s'(9999)=0$. The derivative of the sigmoid is (approximately) zero and the training process does not move along. The neural network that you end up with is just a neural network with random weights (there is no training).

Does this help us to know what the best normalization function is? But of course! First of all, it is crucial to use a normalization that centers your data because most implementation initialize bias at zero. I would normalize between -0.5 and 0.5, $\frac{X-\min{X}}{\max{X}-\min{X}}-0.5$. But standard score is also good.

The actual normalization is not very crucial because it only influences the initial iterations of the optimization process. As long as it is centered and most of your data is below 1, then it might mean you have to use slightly less or more iterations to get the same result. But the result will be the same, as long as you avoid the saturation problem I mentioned.

There is something not here discussed which is regularization. If you use regularization in your objective function, the way you normalize your data will affect the resulting model. I'm assuming your are already familiar with this. If you know that one variable is more prone to cause overfitting, your normalization of the data should take this into account. This is of course completely independent of neural networks being used.


  1. There is no clear cut answer. What I'd recommend would be to scale your data using different approaches and then use the same model to predict outcomes on your holdout set (RFs would work fine here). That should atleast show you which scaling approach is best in your prediction problem.
  2. You don't need to scale your data for Random Forests
  3. The individual ranges shouldn't be a problem as long as they are consistently scaled to begin with. This is just illustrating that there are differences between the variables, just on a more compact scale than before.
  4. Yes - all your data should be scaled with the same approach. Otherwise values in your transformed dataset might relate not to the data itself, but the algo used for scaling.

Hopefully this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top