Pergunta

I would like to run random forest on a large data set: 100k * 400. When I use random forest it takes a lot of time. Can I use parRF method from caret package in order to reduce running time? What is the right syntax for that? Here is an example dataframe:

dat <- read.table(text = " TargetVar  Var1    Var2       Var3
 0        0        0         7
 0        0        1         1
 0        1        0         3
 0        1        1         7
 1        0        0         5
 1        0        1         1
 1        1        0         0
 1        1        1         6
 0        0        0         8
 0        0        1         5
 1        1        1         4
 0        0        1         2
 1        0        0         9
 1        1        1         2  ", header = TRUE)

I tried:

library('caret')
m<-randomForest(TargetVar ~ Var1 + Var2 + Var3, data = dat, ntree=100, importance=TRUE, method='parRF')

But I don't see too much of a difference. Any Ideas?

Foi útil?

Solução

The reason that you don't see a difference is that you aren't using the caret package. You do load it into your environment with the library() command, but then you run randomForest() which doesn't use caret.

I'll suggest starting by creating a data frame (or data.table) that contains only your input variables, and a vector containing your outcomes. I'm referring to the recently updated caret docs.

x <- data.frame(dat$Var1, dat$Var2, dat$Var3)
y <- dat$TargetVar

Next, verify that you have the parRF method available. I didn't until I updated my caret package to the most recent version (6.0-29).

library("randomForest")
library("caret")
names(getModelInfo())

You should see parRF in the output. Now you're ready to create your training model.

library(foreach)

rfParam <- expand.grid(ntree=100, importance=TRUE)

m <- train(x, y, method="parRF", tuneGrid=rfParam)
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top