Question

I'm trying to train a gradient boosting model over 50k examples with 100 numeric features. XGBClassifier handles 500 trees within 43 seconds on my machine, while GradientBoostingClassifier handles only 10 trees(!) in 1 minutes and 2 seconds :( I didn't bother trying to grow 500 trees as it will take hours. I'm using the same learning_rate and max_depth settings, see below.

What makes XGBoost so much faster? Does it use some novel implementation for gradient boosting that sklearn guys do not know? Or is it "cutting corners" and growing shallower trees?

p.s. I'm aware of this discussion: https://www.kaggle.com/c/higgs-boson/forums/t/10335/xgboost-post-competition-survey but couldn't get the answer there...

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
gamma=0, learning_rate=0.05, max_delta_step=0, max_depth=10,
min_child_weight=1, missing=None, n_estimators=500, nthread=-1,
objective='binary:logistic', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=1)

GradientBoostingClassifier(init=None, learning_rate=0.05, loss='deviance',
max_depth=10, max_features=None, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10,
presort='auto', random_state=None, subsample=1.0, verbose=0,
warm_start=False)
Was it helpful?

Solution

Since you mention "numeric" features, I guess your features are not categorical and have a high arity (they can take a lot of different values, and thus there are a lot of possible split points). In such a case, growing trees is difficult since there are [a lot of features $\times$ a lot of split points] to evaluate.

My guess is that the biggest effect comes from the fact that XGBoost uses an approximation on the split points. If you have a continuous feature with 10000 possible splits, XGBoost consider only "the best" 300 splits by default (this is a simplification). This behavior is controlled by the sketch_eps parameter, and you can read more about it in the doc. You can try lowering it and check the difference it makes. Since there is no mention of it in the scikit-learn documentation, I guess it is not available. You can learn what XGBoost method is in the their paper (arxiv).

XGBoost also uses an approximation on the evaluation of such split points. I do not know by which criterion scikit learn is evaluating the splits, but it could explain the rest of the time difference.


Adressing Comments

Regarding the evaluation of split points

However, what did you mean by "XGBoost also uses an approximation on the evaluation of such split points"? as far as I understand, for the evaluation they are using the exact reduction in the optimal objective function, as it appears in eq (7) in the paper.

In order to evaluate the split point, you would have to compute $L(y,H_{i-1}+h_i)$ where $L$ is the cost function, $y$ the target, $H_{i-1}$ the model built until now, and $h_i$ the current addition. Notice that this is not what XGBoost is doing; they are simplifying the cost function $L$ by a Taylor Expansion, which leads to a very simple function to compute. They have to compute the Gradient and the Hessian of $L$ with respect to $H_{i-1}$, and they can reuse those number for all potential splits at stage $i$, making the overral computation fast. You can check Loss function Approximation With Taylor Expansion (CrossValidated Q/A) for more details, or the derivation in their paper.

The point is that they have found a way to approximate $L(y,H_{i-1} + h_i)$ efficiently. If you were to evaluate $L$ fully, without insider knowledge allowing optimisation or avoidance or redundant computation, it would take more time per split. It this regard, it is an approximation. However, other gradient boosting implementations also use a proxy cost functions to evaluate the splits, and I do not know whether XGBoost approximation is quicker in this regards than the others.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top