XG Boost result interpretation for unbalanced datasets (Accuracy & AUCROC)

https://datascience.stackexchange.com/questions/76230

12-12-2020
|

Question

My dataset is of shape – 5621*8 (binary classification)

Label/target : Success (4324, 77 %) & Not success (1297, 23 %)

(success and Not success were been equally important for my prediction i.e, TP & TN)

I split my data into 3 (Train, Validate, test)

For train & Validate i perform 10 fold CV.
Test is the seperate data, which I evaluate for each folds

I tune my scale_pos_weight ranging between 5 to 80, and

Finally I fixed my values as 75 since I got average higher accuracy rate for my Test set (79 %) for those 10 folds
But, If i check my average auc_roc metrics it is very poor, i.e only 50 % for all 10 folds.

If i did not tune scale_pos_weight my avg.accuracy drops to 50% & my avg auc_roc increases to 70 %.

How can I interpret from the above results between AUCROC & Accuracy in this situation?

What might be the problem in my case?

Solution

With Success already being the larger class, you probably shouldn't be using a scale_pos_weight larger than one: you want to scale the positive class's contribution to the loss function down rather than up.

I suspect that's what's happening in the first case. With scale_pos_weight=75, the model ends up basically only caring about the positive class, predicts everyone is in the positive class, and so your accuracy is just a little better than the 77% baseline you'd expect with that strategy. With that motivation, it's not too surprising the AUC is poor, although I wouldn't have expected a drop all the way to the 50% baseline...

If you don't use scale_pos_weight (you said "if I did not tune", but does that mean you left it at the default 1?), then the model performs better in rank-ordering (AUC=70%), but not so well in the hard classification. You might want to tweak the prediction threshold here; there's probably a different threshold that will perform better for accuracy score. You could also try scale_pos_weight=0.25 or so; that should make the default cutoff better, hopefully with little effect on AUC?

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange