Question

I tried to use random forests for regression. The original data is a data frame of 218 rows and 9 columns. The first 8 columns are categorical values ( can be either A, B, C, or D), and the last column V9 has numerical values that can go from 10.2 to 999.87.

When I used random forests on a training set, which represents 2/3 of the original data and which is randomly selected, I got the following results.

>r=randomForest(V9~.,data=trainingData,mytree=4,ntree=1000,importance=TRUE,do.trace=100)
       |      Out-of-bag   |
  Tree |      MSE  %Var(y) |
   100 | 6.927e+04    98.98 |
   200 | 6.874e+04    98.22 |
   300 | 6.822e+04    97.48 |
   400 | 6.812e+04    97.34 |
   500 | 6.839e+04    97.73 |
   600 | 6.852e+04    97.92 |
   700 | 6.826e+04    97.54 |
   800 | 6.815e+04    97.39 |
   900 | 6.803e+04    97.21 |
  1000 | 6.796e+04    97.11 |

I do not know if the high variance percentage means that the model is good or not. Also, since MSE is high, I suspect that the regression model is not really good. Any idea about how to read the results above? Do they mean that the model is not good?

Was it helpful?

Solution

Like @Joran told, %Var is the amount of total variance of Y explained by your random forest model. After the adjust, apply the model to your validation data (1/3 remain):

RFestimated = predict(r, data=ValidationData)

It is interesting also to check the residual:

qqnorm((RFestimated - ValidationData$V9)/sd(RFestimated-ValidationData$V9))

qqline((RFestimated-ValidationData$V9)/sd(RFestimated-ValidationData$V9))

the estimated versus observed values:

plot(ValidationData$V9, RFestimated)

and the RMSE:

RMSE <- (sum((RFestimated-ValidationData$V9)^2)/length(Validation$v9))^(1/2)

I hope this help!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top