Do Random Forest overfit?

https://datascience.stackexchange.com/questions/1028

16-10-2019
|

Question

I have been reading around about Random Forests but I cannot really find a definitive answer about the problem of overfitting. According to the original paper of Breiman, they should not overfit when increasing the number of trees in the forest, but it seems that there is not consensus about this. This is creating me quite some confusion about the issue.

Maybe someone more expert than me can give me a more concrete answer or point me in the right direction to better understand the problem.

Solution

Every ML algorithm with high complexity can overfit. However, the OP is asking whether an RF will not overfit when increasing the number of trees in the forest.

In general, ensemble methods reduces the prediction variance to almost nothing, improving the accuracy of the ensemble. If we define the variance of the expected generalization error of an individual randomized model as:

From here, the variance of the expected generalization error of an ensemble corresponds to:

where p(x) is the Pearson’s correlation coefficient between the predictions of two randomized models trained on the same data from two independent seeds. If we increase the number of DT's in the RF, larger M, the variance of the ensemble decreases when ρ(x)<1. Therefore, the variance of an ensemble is strictly smaller than the variance of an individual model.

In a nutshell, increasing the number of individual randomized models in an ensemble will never increase the generalization error.

OTHER TIPS

You may want to check cross-validated - a stachexchange website for many things, including machine learning.

In particular, this question (with exactly same title) has already been answered multiple times. Check these links: https://stats.stackexchange.com/search?q=random+forest+overfit

But I may give you the short answer to it: yes, it does overfit, and sometimes you need to control the complexity of the trees in your forest, or even prune when they grow too much - but this depends on the library you use for building the forest. E.g. in randomForest in R you can only control the complexity

The Random Forest does overfit.
The Random Forest does not increase generalization error when more trees are added to the model. The generalization variance is going to zero with more trees used.

I've made a very simple experiment. I have generated the synthetic data:

y = 10 * x + noise

I've train two Random Forest models:

one with full trees
one with pruned trees

The model with full trees has lower train error but higher test error than the model with pruned trees. The responses of both models:

It is clear evidence of overfitting. Then I took the hyper-parameters of the overfitted model and check the error while adding at each step 1 tree. I got the following plot:

As you can see the overfit error is not changing when adding more trees but the model is overfitted. Here is the link for the experiment I've made.

STRUCTURED DATASET -> MISLEADING OOB ERRORS

I've found interesting case of RF overfitting in my work practice. When data are structured RF overfits on OOB observations.

Detail :

I try to predict electricity prices on electricity spot market for each single hour (each row of dataset contain price and system parameters (load, capacities etc.) for that single hour).
Electricity prices are created in batches (24 prices created on electricity market in one fixing in one moment of time).
So OOB obs for each tree are random subsets of set of hours, but if you predict next 24 hours you do it all at once (in first moment you obtain all system parameters, then you predict 24 prices, then there is an fixing which produces those prices), so its easier to make OOB predictions, then for the whole next day. OOB obs are not contained in 24-hour blocks, but dispersed uniformly, as there is an autocorrelation of prediction errors its easier to predict price for single hour which is missing then for whole block of missing hours.

easier to predict in case of error autocorrelation :
known, known, prediction, known, prediction - OBB case
harder one :
known, known, known, prediction, prediction - real world prediction case

I hope its interesting

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange