How distribution of data effects model performance?

https://datascience.stackexchange.com/questions/74794

11-12-2020
|

Question

I am working on House Prices: Advanced Regression Techniques dataset. I was going through some kernels noticed many people converted SalePrice to log(SalePrice) see below:

I can see that taking a log transform reduced the skewness of data and made it more normal-like. But, i was wondering will it improve my model's performance or is use-full in any way. If it is then how a normal distribution of target variable was a catalyst?

Solution

Good question. Your interpretation is adequate. Using a logarithmic function reduces the skewness of the target variable. Why does that matter?

Transforming your target via a logarithmic function linearizes your target. Which is useful for many models which expect linear targets. Scikit-Learn has a page describing this phenomenon: https://scikit-learn.org/stable/auto_examples/compose/plot_transformed_target.html

Important to note

If you modify your targets before training, you should apply the inverse transform at the end of your model to compute your "final" prediction. That way, your performance metrics can be comparable.

Intuitively, imagine that you have a very naive model which returns the average target regardless of the input. If your targets are skewed, it means that you will under-/over-shoot for a majority of the predictions. Because of this, the range of your error will be greater, which worsens scores such as the Mean Absolute or Relative Error (MAE/MSE). By normalizing your targets, you reduce the range of your error, which ultimately should improve your model directly.

OTHER TIPS

Well ... there are many aspects from which one can answer this question (Like Valentin's answer ... +1!) as Machine Learning and Data Mining are very much about distributions in general. I just mention a few that come to my mind first.

Some models assume Gaussian distribution e.g. K-means. Imagine you want to apply K-means on this data without log-transform. K-means will have much difficulties with the original feature but the log-transform makes it pretty proper for K-means as "mean" is a better representative of samples in a Gaussian distribution rather than skewed ones.
Some statistical analysis techniques also assume gaussianity. ANOVA works the best with (and actually designed for) normally distributed data (specially in small sample populations). The reason is simply the fact that it is mainly dealing with mean and sample variance to determine the "center" and "variation" of sample populations and it makes sense the most in Gaussian distribution.
All in all, computation with adjusted features is more robust than skewed features. Skewed features have uneven range which is an issue specially if their range is huge (like your example). Adjusted (engineered) features are not going to necessarily become Gaussian but having a smaller and more even range for example.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange