Why is model.matrix so slow?

Question 1

Using debugonce(model.matrix.default), and within this using tracemem(data)

model.matrix.default calls model.frame which returns a data.frame. Within model.matrix.default, this data.frame is copied at least 3 times.

Why does lm use model.matrix --> lm is usually called with a data.frame , list or environment as the data argument. model.frame and returning a data.frame ensures that terms in the formula can be found by subsequent calls to lm, and will reference the same values.

Question 2

In general, a call to Rprof (in library(utils) which should be on the search() path by default) will illustrate where the overhead in time is coming from in a function call:

Rprof("Rprof.out")
m1 <- model.matrix( ~ x1 + x2 + x3)
Rprof(NULL)
summaryRprof("Rprof.out")

giving

> summaryRprof("Rprof.out")
$by.self
                        self.time self.pct total.time total.pct
"model.matrix.default"       0.12    42.86       0.28    100.00
"na.omit.data.frame"         0.06    21.43       0.14     50.00
"[.data.frame"               0.04    14.29       0.08     28.57
"anyDuplicated.default"      0.04    14.29       0.04     14.29
"as.list.data.frame"         0.02     7.14       0.02      7.14

$by.total
                        total.time total.pct self.time self.pct
"model.matrix.default"        0.28    100.00      0.12    42.86
"model.matrix"                0.28    100.00      0.00     0.00
"na.omit.data.frame"          0.14     50.00      0.06    21.43
"model.frame"                 0.14     50.00      0.00     0.00
"model.frame.default"         0.14     50.00      0.00     0.00
"na.omit"                     0.14     50.00      0.00     0.00
"[.data.frame"                0.08     28.57      0.04    14.29
"["                           0.08     28.57      0.00     0.00
"anyDuplicated.default"       0.04     14.29      0.04    14.29
"anyDuplicated"               0.04     14.29      0.00     0.00
"as.list.data.frame"          0.02      7.14      0.02     7.14
"as.list"                     0.02      7.14      0.00     0.00
"vapply"                      0.02      7.14      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 0.28

Thus a large amount of time is spent on checking for NAs with na.omit.data.frame and subsetting the data.frame with [.data.frame, within model.frame.default. The proportions of time will vary depending on the sample size n, but will tend towards a limit for large sample sizes.