Question

I am working on the Kaggle home loan model and interestingly enough, the GradientBoostClassifier has a considerably better score than XGBClassifier. At the same time it seems to not overfit as much. (note, I am running both algos with default settings). From what I've been reading XGBClassifier is the same as GradientBoostClassifier, just much faster and more robust. Therefore I am now confused on why would XGB overfit so much more than GradientBoostClassifier, when it should do the contrary? What would be a good reason why this is happening?

Was it helpful?

Solution

GradientBoostClassifier is slower but more precise. In your case, it could be finding a better model without suffering from overfitting.

Here are some of the main differences you are looking for.

XGBClassifier was designed to be faster. However, XGBClassifier takes a few shortcuts in order to run faster. For example, to save time, XGBClassifier will use an approximation on the splits, not spend so much time on calculating and evaluating the best splits. Usually XGB results will be close and you won't see much difference, but your case could be an exception.

If you are still curious, you can try changing some parameters in the XGBClassifier model, such as the tree_method='exact' or you can modify the sketch_eps parameter in XGBClassifier to more closely match the GradientBoostClassifier results. Of course, this will slow down the XGBClassifier model as a result.

sketch_eps [default=0.03]

Only used for tree_method=approx.

This roughly translates into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.

Usually user does not have to tune this. But consider setting to a lower number for more accurate enumeration of split candidates.

range: (0, 1)

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top