Question

I am new to Machine Learning and I am trying to undrestand the Out of Bag Error in Random Forests and its use.

Let's say that we have a dataset. First we use the whole dataset (without splitting it) to get a Random Forest and its Out of Bag error. Then we split the dataset, train a Neural Network on the training part and test it on the test part of the dataset.

Can I choose between the two models by comparing the Out of Bag error of the random forest with the total test error of the Neural Network ? Does it make sense ?

Was it helpful?

Solution

We generally rely on sampling to validate the result of our Model.
We make train/test to test a model on a separate unseen dataset.
If we are doing a Hyperparameter tuning, we keep another set as a validation set to validate the result of new hyperparameters.

Random forest builds each new tree on a Bagged sample from the original sample(train data).
Bagging means sampling is done with replacement i.e. you pick one data point, put it back and then pick the next.

In this process, many duplicated data is sampled and many data points are not sampled.
~63% of the data points are selected Read here
The other "not selected" 37% data points is called Out of Bag samples.
Hence, the way Bagging, and RF is designed, we got another set of data to do our validation. It an opportunity to do a level of validation with these samples.

What it means,

  • You are getting 37% data points to validate your model
  • But OOB is not done with the fully grown Ensemble. It is done using all the trees in the random forest ensemble for which the particular data point was omitted during training Read here
  • It's not equivalent to K-Fold or Train/test on fully built RF but it gives a decent idea about the validation error about to come.
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top