random forest package prediction, newdata argument?

https://stackoverflow.com/questions/21940955

14-10-2022
|

Question

I've just recently started playing around with the random forest package in R. After growing my forest, I tried predicting the response using the same dataset (ie the training dataset) which gave me a confusion matrix different from the one that was printed with the forest object itself. I thought there might be something wrong with the newdata argument but I followed the example given in the documentation to the t and it gave the same problem. Here's an example using the Species dataset. this is the same example the authors used in their documentation, except I use the same dataset to train and predict... So the question here is: why are those two confusion matrices not identical?

data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
#grow forest
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
print(iris.rf)

Call:
 randomForest(formula = Species ~ ., data = iris[ind == 1, ]) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 3.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         45          0         0  0.00000000
versicolor      0         39         1  0.02500000
virginica       0          3        32  0.08571429

#predict using the training again...
iris.pred <- predict(iris.rf, iris[ind == 1,])
table(observed = iris[ind==1, "Species"], predicted = iris.pred)

           predicted
observed     setosa versicolor virginica
  setosa         45          0         0
  versicolor      0         40         0
  virginica       0          0        35

La solution

You'll note that in the first summary, the confusion matrix is labelled the OOB estimate.

This stands for Out-of-Bag, and is not the same as directly predicting each observation in the training set on the forest. The latter will obviously be a biased estimate of accuracy, the OOB estimate less so (although OOB has it's critics as well; but it's at least more reasonable).

Basically, when you print the summary itself, it is taking each observation and only testing it on the trees on which it was not used, i.e. "out of bag". So the OOB predictions are essentially using only a subset of the trees in your forest (roughly 2/3 in general).

When you call predict on the training data directly, it is using trees where each observation was actually used in the tree construction, so it's not surprising that that version gets each one correct, while the OOB version has some misclassified.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow