XGBoost and Random Forest: ntrees vs. number of boosting rounds vs. n_estimators

https://datascience.stackexchange.com/questions/72778

10-12-2020
|

Question

So I understand the main difference between Random Forests and GB Methods. Random Forests grow parallel trees and GB Methods grow one tree for each iteration. However, I am confused on the vocab used with scikit's RF regressor and xgboost's Regressor. Specifically the part about tuning for the number of trees/iterations/boosting rounds. From my understanding each of those terms are the same. They determine how many times a decision tree should be calculated based on the algorithm. However, should I be referring to them as ntrees or n_estimators? Or should I simply use early stopping rounds for my xgboost and tune the number of trees only for my rf?

My Random Forest:

rf = RandomForestRegressor(random_state = 13)
param_grid = dict(model__n_estimators = [250,500,750,1000,1500,2000],
                  model__max_depth = [5,7,10,12,15,20,25],
                  model__min_samples_split= [2,5,10],
                  model__min_samples_leaf= [1,3,5]
                  )
gs = GridSearchCV(rf
                  ,param_grid = param_grid
                  ,scoring = 'neg_mean_squared_error'
                  ,n_jobs = -1
                  ,cv = 5
                  ,refit = 'neg_mean_squared_error'
                  )

My xgboost

model = XGBRegressor(random_state = 13)
param_grid = dict(model__ntrees = [500,750,1000,1500,2000],
                  model__max_depth = [1,3,5,7,10],
                  model__learning_rate= [0.01,0.025,0.05,0.1,0.15,0.2],
                  model__min_child_weight= [1,3,5,7,10],
                  model__colsample_bytree=[0.80,1]
                  )
gs = GridSearchCV(model
                  ,param_grid = param_grid
                  ,scoring = 'neg_mean_squared_error'
                  ,n_jobs = -1
                  ,cv = 5
                  ,refit = 'neg_mean_squared_error'
                  )

Solution

As I understand it, iterations is equivalent to boosting rounds.

However, number of trees is not necessarily equivalent to the above, as xgboost has a parameter called num_parallel_tree which allows the user to create multiple trees per iteration (i.e. think of it as boosted random forest).

As an example, if the user set num_parallel_tree = 3 for 500 iterations, then number of trees = 1500 (=3*500) rather than 500.

OTHER TIPS

Following from your comment...

Why would I want to create multiple trees per iteration?

Multiple trees per iteration may mean that the model's prediction power improves much faster than using a single tree - as an example, think of prediction power of an individual tree of depth 10 vs the prediction power of a random forest consisting of 5 trees of depth 10 (obviously in general, and not in edge cases where overfitting is present).

This could mean that fewer boosting rounds are needed before the model becomes "optimal", although time/resources savings made by using fewer boosting rounds may be consumed by the time/resource needed to construct many random forests.

Isn't the point to make tree, correct for errors and repeat?

No. The point is to make an estimator, correct for error and repeat. The estimator used in each round is not necessarily a tree; xgboost allows the user to create a linear model, a decision tree, or a random forest.

And for each iteration, I am making 3 trees?

Yes. Each iteration will produce a random forest with num_parallel_tree of trees.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange