Question

Consider the following demo data:

set.seed(1)
n <- 1000000
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)

If I build a matrix mt with model.matrix(), it takes forever:

system.time(mt <- model.matrix(~x1+x2+x3))
usuário   sistema decorrido 
 0.916     0.185     1.135 

But if I do the same with matrix, it is pretty fast:

system.time(mt2 <- matrix(c(rep(1, n), x1, x2, x3), byrow=FALSE, ncol=4))
 usuário   sistema decorrido 
  0.085     0.021     0.105 

Why the difference? Is what makes model.matrix() slow really necessary for lm() and related functions?

Was it helpful?

Solution

Using debugonce(model.matrix.default), and within this using tracemem(data)

model.matrix.default calls model.frame which returns a data.frame. Within model.matrix.default, this data.frame is copied at least 3 times.

Why does lm use model.matrix --> lm is usually called with a data.frame , list or environment as the data argument. model.frame and returning a data.frame ensures that terms in the formula can be found by subsequent calls to lm, and will reference the same values.

OTHER TIPS

In general, a call to Rprof (in library(utils) which should be on the search() path by default) will illustrate where the overhead in time is coming from in a function call:

Rprof("Rprof.out")
m1 <- model.matrix( ~ x1 + x2 + x3)
Rprof(NULL)
summaryRprof("Rprof.out")

giving

> summaryRprof("Rprof.out")
$by.self
                        self.time self.pct total.time total.pct
"model.matrix.default"       0.12    42.86       0.28    100.00
"na.omit.data.frame"         0.06    21.43       0.14     50.00
"[.data.frame"               0.04    14.29       0.08     28.57
"anyDuplicated.default"      0.04    14.29       0.04     14.29
"as.list.data.frame"         0.02     7.14       0.02      7.14

$by.total
                        total.time total.pct self.time self.pct
"model.matrix.default"        0.28    100.00      0.12    42.86
"model.matrix"                0.28    100.00      0.00     0.00
"na.omit.data.frame"          0.14     50.00      0.06    21.43
"model.frame"                 0.14     50.00      0.00     0.00
"model.frame.default"         0.14     50.00      0.00     0.00
"na.omit"                     0.14     50.00      0.00     0.00
"[.data.frame"                0.08     28.57      0.04    14.29
"["                           0.08     28.57      0.00     0.00
"anyDuplicated.default"       0.04     14.29      0.04    14.29
"anyDuplicated"               0.04     14.29      0.00     0.00
"as.list.data.frame"          0.02      7.14      0.02     7.14
"as.list"                     0.02      7.14      0.00     0.00
"vapply"                      0.02      7.14      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 0.28

Thus a large amount of time is spent on checking for NAs with na.omit.data.frame and subsetting the data.frame with [.data.frame, within model.frame.default. The proportions of time will vary depending on the sample size n, but will tend towards a limit for large sample sizes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top