Appropriate objective function and evaluation metric when I DO care about outliers?

https://datascience.stackexchange.com/questions/77249

12-12-2020
|

Question

I am reading these two pages: xgboost documentation Post on evaluation metrics

I have a dataset where I am trying to predict future spend at the user level. A lot of our spend comes from large spenders, outliers. So, we care about them. I am using XGBoost.

I have tried xgboost with objective reg:squarederror. This tended to underpredict a little. I then tried with reg:squaredlogerror and this resulted in predictions that under predict by much more than just using squarederror.

I have tried tuning with several differing hyper parameter combinations but none made as big a difference as changing the objective. So, I'm dwelling on the objective function and trying to understand if there's another one out there that would be worth a shot?

On the xgboost docs above, some of the out of the other regression objective options are reg:pseudohubererror as well as count:poisson.

There is no option, that I can see, for just MAE. If using an objective function less susceptible to outliers with rmsle took me further away from accuracy whereas rmse took me closer, would using MAE potentially be worth a shot? In this dataset, outliers are more important, but so are regular users.

What would be a good objective and evaluation metric? Is MAE worth trying? If so, how? Looking at the docs above, I cannot see MAE as an option under regression parameters.

Solution

These are several things you can try:

Use quartic error, $(y - \hat{y})^4$, instead of quadratic error. This is going to penalize a lot big errors, way more than MSE. The issue is that this is not implemented in xgboost, and you would need to develop a custom loss.
If your target is always positive, you can use the target as training weights. This will give more weights to the outliers. If it is not always positive, you can use the absolute value of the target as weights. If using the target values directly puts too much weight on the outliers, you might want to transform it (e.g. using the log or square root), and if you have samples whose target value is zero, you might want to add some epsilon to all the weights. Note that xgboost can be easily trained using weights.
Try to predict the quantile of the training distribution, then transform your predictions using the training cumulative probability function.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange