How to avoid overfitting in random forest?

https://datascience.stackexchange.com/questions/6380

16-10-2019
|

Question

I want to avoid overfitting in random forest. In this regard, I intend to use mtry, nodesize, and maxnodes etc. Could you please help me choose values for these parameters? I am using R.
Also, if possible, please tell me how I can use k-fold cross validation for random forest (in R).

Solution

Relative to other models, Random Forests are less likely to overfit but it is still something that you want to make an explicit effort to avoid. Tuning model parameters is definitely one element of avoiding overfitting but it isn't the only one. In fact I would say that your training features are more likely to lead to overfitting than model parameters, especially with a Random Forests. So I think the key is really having a reliable method to evaluate your model to check for overfitting more than anything else, which brings us to your second question.

As alluded to above, running cross validation will allow to you avoid overfitting. Choosing your best model based on CV results will lead to a model that hasn't overfit, which isn't necessarily the case for something like out of the bag error. The easiest way to run CV in R is with the caret package. A simple example is below:

> library(caret)
> 
> data(iris)
> 
> tr <- trainControl(method = "cv", number = 5)
> 
> train(Species ~ .,data=iris,method="rf",trControl= tr)
Random Forest 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (5 fold) 

Summary of sample sizes: 120, 120, 120, 120, 120 

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa  Accuracy SD  Kappa SD  
  2     0.96      0.94   0.04346135   0.06519202
  3     0.96      0.94   0.04346135   0.06519202
  4     0.96      0.94   0.04346135   0.06519202

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2.

OTHER TIPS

@xof6 is correct in the sense that the more depth the model has the more it tends to overfit, but I wanted to add some more parameters that might be useful to you. I do not know which package you are using with R and I am not familiar with R at all, but I think there must be counterparts of these parameters implemented there.

Number of trees - The bigger this number, the less likely the forest is to overfit. This means that as each decision tree is learning some aspect of the training data, you are getting more options to choose from, so to speak. Number of features - This number constitutes how many features each individual tree learns. As this number grows, the trees get more and more complicated, hence they are learning patterns that might not be there in the test data. It will take some experimenting to find the right value, but such is machine learning. Experiment with the general depth as well, as we mentioned!

Here is a nice link on that on stackexchange https://stats.stackexchange.com/questions/111968/random-forest-how-to-handle-overfitting ,however my general experience is the more depth the model has the more it tends to overfit.

I always, decrease mtry until error on train dataset increases, then I lower nodesize and depth until difference between error on train and dataset stops to decrease

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange