So there are a couple of things.
First, calling data.frame(...)
is very expensive, so calling it at each iteration will slow things down considerably. Lists, on the other hand, are extremely efficient. So try to keep everything in lists until the end.
Second, it might just be that running 2 * 106 million regressions is taking most of the time.
I would be inclined to try it this way:
## not tested...
df1 <- t(diff.exp.protein.expr)
df2 <- t(lncRNA.expr)
df <- cbind(df1,df2,colData)
names <- expand.grid(x=colnames(df1),y=colnames(df2),stringsAsFactors=FALSE)
get.anova <- function(n){
form.1 <- as.formula(paste0(n[1],"~gender+age+treatment+",n[2]))
form.2 <- as.formula(paste0(n[1],"~gender+age+treatment+",n[2],"+treatment:",n[2]))
lm1 <- lm(form.1,data=df)
lm2 <- lm(form.2,data=df)
coef <- summary(lm2)$coefficients
list(n[1],n[2],anova(lm1,lm2)$"Pr(>F)"[2],coef[5,1],coef[5,2],coef[6,1],coef[6,2])
}
result <- do.call(rbind,apply(names,1,get.anova))
result <- data.frame(result)
colnames(result) <- c("protein","lncRNA","p.value","est.1","se.1","est.2","se.2")
This was not tested because the dataset you provided was not big enough: the models have < 0 df for error. Consequently, there is no row 6 in the coefficients table. Your real dataset will not have that problem.
EDIT (Response to OP's comment and provision of data).
With the dataset provided in your comment, we can benchmark both the original code (based on your updated post), and the new version (based on the code above, updated to reflect your new model formulas). In both case I use just 10 rows from each dataset (100 combinations).
f.original <- function() lm.anova(diff.exp.protein.expr.sub[1:10,],lncRNA.expr.sub[1:10,],colData)
f.new <- function() do.call(rbind,apply(names,1,get.anova))
library(microbenchmark)
microbenchmark(f.original(), f.new(), times=10)
# Unit: seconds
# expr min lq median uq max neval
# f.original() 2.622461 2.712527 2.824419 2.869193 2.914099 10
# f.new() 2.049756 2.074909 2.144422 2.191546 2.224831 10
So we can see that the new version, which returns lists instead of data frames, is about 25% faster.
Profiling both approaches, using Rprof
, shows that the original version spends about 50% of it's time in lm(...)
, whereas the new version spends about 60% of it's (shorter) time in lm(...)
. This makes sense and suggests that the best you could possibly do is an improvement of about 30% over the new version. In other words: the calls to lm(...)
are the bottlenecks: 200 million calls to lm(...)
just takes a long time.
You could consider a parallel computing approach, for example using the foreach
package, but before going down that road IMO you should consider other strategies for getting to your end result.