In general, a call to Rprof
(in library(utils)
which should be on the search()
path by default) will illustrate where the overhead in time is coming from in a function call:
Rprof("Rprof.out")
m1 <- model.matrix( ~ x1 + x2 + x3)
Rprof(NULL)
summaryRprof("Rprof.out")
giving
> summaryRprof("Rprof.out")
$by.self
self.time self.pct total.time total.pct
"model.matrix.default" 0.12 42.86 0.28 100.00
"na.omit.data.frame" 0.06 21.43 0.14 50.00
"[.data.frame" 0.04 14.29 0.08 28.57
"anyDuplicated.default" 0.04 14.29 0.04 14.29
"as.list.data.frame" 0.02 7.14 0.02 7.14
$by.total
total.time total.pct self.time self.pct
"model.matrix.default" 0.28 100.00 0.12 42.86
"model.matrix" 0.28 100.00 0.00 0.00
"na.omit.data.frame" 0.14 50.00 0.06 21.43
"model.frame" 0.14 50.00 0.00 0.00
"model.frame.default" 0.14 50.00 0.00 0.00
"na.omit" 0.14 50.00 0.00 0.00
"[.data.frame" 0.08 28.57 0.04 14.29
"[" 0.08 28.57 0.00 0.00
"anyDuplicated.default" 0.04 14.29 0.04 14.29
"anyDuplicated" 0.04 14.29 0.00 0.00
"as.list.data.frame" 0.02 7.14 0.02 7.14
"as.list" 0.02 7.14 0.00 0.00
"vapply" 0.02 7.14 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 0.28
Thus a large amount of time is spent on checking for NA
s with na.omit.data.frame
and subsetting the data.frame
with [.data.frame
, within model.frame.default
. The proportions of time will vary depending on the sample size n
, but will tend towards a limit for large sample sizes.