Question

I'm working with a data set that has a lot of NA's. I know that the first 6 columns do NOT have any NA's. Since the first column is an ID column I'm omitting it.

I run the following code to select only lines that have values in the response column:

sub1 <- TrainingData[which(!is.na(TrainingData[,70])),]

I then use sub1 as the data set in a randomForest using this code:

set.seed(448)
RF <- randomForest(sub1[,c(2:6)], sub1[,70]
    ,do.trace=TRUE,importance=TRUE,ntree=10,,forest=TRUE)

then I run this code to check the output for NA's:

> length(which(is.na(RF$predicted)))
[1] 65

I can't figure out why I'd be getting NA's if the data going in is clean.

Any suggestions?

Was it helpful?

Solution

I think you should use more trees. Because predicted values are preditions for the out-of-bag set. And if number of trees very small some cases are never present in out-of-bag set, because this set forms randomly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top