Pergunta

I am trying to use the rfe function from the caret package to run a feature selection on 400 variables belonging to about 50 different classes, with a total of 8000 samples. If I subset my data to about 5 classes and 10 variables, everything works well. However, when I include my entire dataset, R runs for about 32 hours before I get a warning saying that the R GUI has stopped working. I tried subsetting my data to 100 variables and 1000 samples, and the same happened. I tried using a completely different dataset of 44 variables and roughly 3000 samples belonging to 44 classes. Yet after a day or so, R becomes unresponsive and shuts down. I have tried the same code on the iris data set for reproducability:

iris$Species <- as.factor(as.character(iris$Species))
IND.svm <- rfe(iris[,-1],iris$Species,
               sizes = c(2, 5, 10,30),
               rfeControl = rfeControl(functions = caretFuncs,
                                       verbose = FALSE,number = 2000),
                                 method = "svmRadial")

I am running Windows 7 so I cannot use the recommended doMC package. I have used the latest 64-bit R version, on a machine with a 32G RAM. Still with no success. Is there something I am overlooking here?

Foi útil?

Solução

I'd say that the problem is that you are doing 2000 bootstrap samples. Let's say that the argument tuneLength has a value of T and you are testing 5 subset sizes. For these specifications, you are fitting 10000*T SVM models for a data set with 8000 samples and 400 variables.

Maybe I low-ball it, but I don't usually do more than 50 resamples (unless the training set is really small). You are basically trying to estimate the mean here (unlike more traditional uses of the bootstrap) and 25 or 50 should be enough, especially for that sample size.

Remember, rfe is replicated the entire feature selection process for each resample, so the computations really add up.

Max

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top