Question

I am searching for a theoretical or experimental estimation of the lower bound for the number of trees in a Random Forest classifier.

I usually test different combinations and select the one that (using cross-validation) provides the median best result.

However, I think that there may be a lower bound on the number of trees to use, given $m$ observations and $n$ features, to reduce the variance contribution to the error. Is there some test or reference I may check to correctly tune my classifier?

Was it helpful?

Solution

This is not necessarily an answer to your question. Just general thoughts about cross-validating the number of decision trees within a random forest.

I see a lot of people in kaggle and stackexchange cross-validating the number of trees in a random forest. I have also asked a couple of colleagues and they tell me it is important to cross-validate them to avoid overfitting.

This never made sense to me. Since each decision tree is trained independently, adding more decision trees should just make your ensemble more and more robust.

(This is different from gradient boosting trees, which are a particular case of ada boosting, and therefore there is potential for overfitting since each decision tree is trained to weight residuals more heavily.)

I did a simple experiment:

from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
import numpy as np
import matplotlib.pyplot as plt
plt.ioff()

df = load_digits()
X = df['data']
y = df['target']

cv = GridSearchCV(
    RandomForestClassifier(max_depth=4),
    {'n_estimators': np.linspace(10, 1000, 20, dtype=int)},
    'accuracy',
    n_jobs=-1,
    refit=False,
    cv=50,
    verbose=1)
cv.fit(X, y)
scores = np.asarray([s[1] for s in cv.grid_scores_])
trees = np.asarray([s[0]['n_estimators'] for s in cv.grid_scores_])
o = np.argsort(trees)
scores = scores[o]
trees = trees[o]
plt.clf()
plt.plot(trees, scores)
plt.xlabel('n_estimators')
plt.ylabel('accuracy')
plt.savefig('trees.png')
plt.show()

performance

I am not saying your are committing this fallacy about thinking more trees can cause overfitting. You clearly are not since you have asked for a lower bound. This is just something that has been bugging me for awhile, and I think it is important to keep in mind.

(Addendum: Elements of Statistical Learning discusses this at page 596, and is in agreement on this with me. «It is certainly true that increasing B [B=number of trees] does not cause the random forest sequence to overfit». The author does make the observation that «this limit can overfit the data». In other words, since other hyperparameters may lead to overfitting, creating a robust model does not rescue you from overfit. You have to pay attention when cross-validating your other hyperparameters.)

To answer your question, adding decision trees will always be beneficial to your ensemble. It will always make it more and more robust. But, of course, it is dubious whether the marginal 0.00000001 reduction in variance is worth the computational time.

Your question therefore, as I understand, is whether you can somehow calculate or estimate the amount of decision trees to reduce the error variance to below a certain threshold.

I very much doubt it. We do not have clear answers for many broad questions in data mining, much less specific questions like that. As Leo Breiman (the author of random forests) wrote, there are two cultures in statistical modeling, and random forests is the type of model that he says has few assumptions, but is also very data-specific. That is why, he says, we cannot resort to hypothesis testing, we have to go with brute-force cross-validation.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top