You need to profile the code to be sure, but I would be surprised if the timing difference did not come from R having to turn each factor variable into two binary variables as it prepares the model matrix.
Try
Rprof("rpartProfile.Rprof")
rpart( y ~ ., data = x.fac, method = 'anova')
Rprof()
summaryRprof("rpartProfile.Rprof")
and look to see where the time is being spent. Which I have now done:
> summaryRprof("rpartProfile.Rprof")
$by.self
self.time self.pct total.time total.pct
"[[<-.data.frame" 786.46 72.45 786.56 72.46
"rpart.matrix" 294.26 27.11 1081.78 99.66
"model.frame.default" 1.04 0.10 3.00 0.28
"terms.formula" 0.96 0.09 0.96 0.09
"as.list.data.frame" 0.46 0.04 0.46 0.04
"makepredictcall.default" 0.46 0.04 0.46 0.04
"rpart" 0.44 0.04 1085.38 99.99
"[[.data.frame" 0.16 0.01 0.42 0.04
"<Anonymous>" 0.16 0.01 0.18 0.02
"match" 0.14 0.01 0.22 0.02
"print" 0.12 0.01 0.12 0.01
"model.matrix.default" 0.10 0.01 0.44 0.04
....
$by.total
total.time total.pct self.time self.pct
"rpart" 1085.38 99.99 0.44 0.04
"rpart.matrix" 1081.78 99.66 294.26 27.11
"[[<-" 786.62 72.47 0.06 0.01
"[[<-.data.frame" 786.56 72.46 786.46 72.45
"model.frame.default" 3.00 0.28 1.04 0.10
"eval" 3.00 0.28 0.04 0.00
"eval.parent" 3.00 0.28 0.00 0.00
"model.frame" 3.00 0.28 0.00 0.00
"terms.formula" 0.96 0.09 0.96 0.09
"terms" 0.96 0.09 0.00 0.00
"makepredictcall" 0.50 0.05 0.04 0.00
"as.list.data.frame" 0.46 0.04 0.46 0.04
"makepredictcall.default" 0.46 0.04 0.46 0.04
"as.list" 0.46 0.04 0.00 0.00
"vapply" 0.46 0.04 0.00 0.00
"model.matrix.default" 0.44 0.04 0.10 0.01
"[[" 0.44 0.04 0.02 0.00
"model.matrix" 0.44 0.04 0.00 0.00
....
$sample.interval
[1] 0.02
$sampling.time
[1] 1085.5
Note from the above that a big chunk of time is spent in function rpart.matrix
:
> rpart:::rpart.matrix
function (frame)
{
if (!inherits(frame, "data.frame") || is.null(attr(frame,
"terms")))
return(as.matrix(frame))
for (i in 1:ncol(frame)) {
if (is.character(frame[[i]]))
frame[[i]] <- as.numeric(factor(frame[[i]]))
else if (!is.numeric(frame[[i]]))
frame[[i]] <- as.numeric(frame[[i]])
}
X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
class(X) <- c("rpart.matrix", class(X))
X
}
But it is the for
loop in that function where most of the time is spent, essentially converting each column and adding them back to the data frame.