High overestimation on prediction data

https://datascience.stackexchange.com/questions/14905

16-10-2019
|

Question

I am building lost sales estimation model for out of stock days etc. using XGBoost. I am using simple logic of training model on data of normal days with ample inventory (when sales and demand are same) and then using trained model to predict demand on out of stock days. For model building I am splitting the normal days data into train and test datasets.

However I am getting a peculiar problem of very highly overestimated sales values of out of stock days. prediction for both train and test days are fine but only out of stock days prediction gives problem. Any tips what might be going wrong and how I can debug the problem in gradient tree type of model.

Solution

Disclaimer: I am not 100% sure if my solution will be good for all cases but this solved my problem considerably.

For me changing to XGBoost based imputation worked wonderfully. In the earlier case I was doing multiple different type of imputation like mean, mode etc. From there I retained only those imputations on which I was very much confident. Imputations on which I didn't have much confidence I did let XGBoost do the job. After this change most of the overestimation cases vanished.

Another benefit I got from this change was: model fit parameters were direct reflective of the overestimation. So models in overestimation was still happening; all of those have had bad model fit Vs models with no overestimation always have good good fit.

I believe this is happening because with XGBoost also doing the imputation it uncovers and leverages deeply hidden patterns in data. With this knowledge it is able to do better imputation of the scenarios where my ways of ordinary imputation were not much useful.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange