How to know if a regression model generated by random forests is good? ( MSE and %Var(y)) [closed]

https://stackoverflow.com/questions/16548882

29-05-2022
|

Question

I tried to use random forests for regression. The original data is a data frame of 218 rows and 9 columns. The first 8 columns are categorical values ( can be either A, B, C, or D), and the last column V9 has numerical values that can go from 10.2 to 999.87.

When I used random forests on a training set, which represents 2/3 of the original data and which is randomly selected, I got the following results.

>r=randomForest(V9~.,data=trainingData,mytree=4,ntree=1000,importance=TRUE,do.trace=100)
       |      Out-of-bag   |
  Tree |      MSE  %Var(y) |
   100 | 6.927e+04    98.98 |
   200 | 6.874e+04    98.22 |
   300 | 6.822e+04    97.48 |
   400 | 6.812e+04    97.34 |
   500 | 6.839e+04    97.73 |
   600 | 6.852e+04    97.92 |
   700 | 6.826e+04    97.54 |
   800 | 6.815e+04    97.39 |
   900 | 6.803e+04    97.21 |
  1000 | 6.796e+04    97.11 |

I do not know if the high variance percentage means that the model is good or not. Also, since MSE is high, I suspect that the regression model is not really good. Any idea about how to read the results above? Do they mean that the model is not good?

Solution

Like @Joran told, %Var is the amount of total variance of Y explained by your random forest model. After the adjust, apply the model to your validation data (1/3 remain):

RFestimated = predict(r, data=ValidationData)

It is interesting also to check the residual:

qqnorm((RFestimated - ValidationData$V9)/sd(RFestimated-ValidationData$V9))

qqline((RFestimated-ValidationData$V9)/sd(RFestimated-ValidationData$V9))

the estimated versus observed values:

plot(ValidationData$V9, RFestimated)

and the RMSE:

RMSE <- (sum((RFestimated-ValidationData$V9)^2)/length(Validation$v9))^(1/2)

I hope this help!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow