How to work with Log-transformation?

https://datascience.stackexchange.com/questions/75618

11-12-2020
|

문제

I'm beginning my data science journey and I've faced a challenge that confuses me a bit. I have a set with few features and a target variable whose raw distribution is highly skewed.

I've read that it's possible to use a log transformation to normalize the target variable (loss in $) and thus increase the accuracy.

When I train my model with "y_raw", using MAE I get an error of 306k. When I log-transform y = y.transform(np.log) I get MAE accuracy of around 2 (log-transformed units I suppose?), which is e^2 = 7.39 (y_raw). This is a significant drop from 306k to only 7.39 ($) (or am I getting it wrong?), so I am a bit suspicious about it.

So here are my questions: 1) Did I get it correct that the error rate drop from 306k to only 7.39 is real and is valid? 2) How do I make a predictions from there? If I feed a sample to my model, receive a log-transformed output, lets say it returned a prediction of y_log = 10. Do I then simply use an inverse of it by placing e^10 = 22,026.5 and will it be my final prediction?

해결책

Taking the log doesn't result in a normally-distributed target; it would tend to if the target was log-normally distributed, and you have something normalish there, not quite. But, this distribution isn't actually what matters.

What taking the log does is change your model of how errors arise when fitting a regressor. You're now saying that the target values are $e^{P + \epsilon}$ where $P$ is your model's prediction and $\epsilon$ is Gaussian noise. Or: $e^P e^\epsilon$. That part directly interacts with the assumptions in your regressor.

So what you're finding is that on average the predictions are wrong by a factor of 7.39, not +/- $7.39.

What you really want to do is evaluate MAE on actual target values vs $e^P$ . You probably have a better model but not that much better.

다른 팁

Of course your error rate is going to decrease. Remember that your changes in MAE values may come from the fact that the scales of your original variable and that variable transformed by a logarithm ARE NOT THE SAME, and mean is scale-dependant.

About your second question, is exactly that!

If you would like to compare the use (or not) of logarithms. You could use your original model and, when calculating MAE, apply logarithms to predicted and real values. Then, you will be able to compare the two models in terms of MAE.

Check that before the application of the logarithm you had several values close to 0, almost surely your model learned to output values close to 0 with an acceptable accuracy. Once you apply the logarithm, you allow your data to be more spread (differences are "easier" to see).

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange