data.table alternatives to nested for-loop?

Question 1

So there are a couple of things.

First, calling data.frame(...) is very expensive, so calling it at each iteration will slow things down considerably. Lists, on the other hand, are extremely efficient. So try to keep everything in lists until the end.

Second, it might just be that running 2 * 106 million regressions is taking most of the time.

I would be inclined to try it this way:

## not tested...
df1 <- t(diff.exp.protein.expr)
df2 <- t(lncRNA.expr)

df  <- cbind(df1,df2,colData)
names <- expand.grid(x=colnames(df1),y=colnames(df2),stringsAsFactors=FALSE)
get.anova <- function(n){
  form.1 <- as.formula(paste0(n[1],"~gender+age+treatment+",n[2]))
  form.2 <- as.formula(paste0(n[1],"~gender+age+treatment+",n[2],"+treatment:",n[2]))
  lm1    <- lm(form.1,data=df)
  lm2    <- lm(form.2,data=df)
  coef <- summary(lm2)$coefficients
  list(n[1],n[2],anova(lm1,lm2)$"Pr(>F)"[2],coef[5,1],coef[5,2],coef[6,1],coef[6,2])
}
result <- do.call(rbind,apply(names,1,get.anova))
result <- data.frame(result)
colnames(result) <- c("protein","lncRNA","p.value","est.1","se.1","est.2","se.2")

This was not tested because the dataset you provided was not big enough: the models have < 0 df for error. Consequently, there is no row 6 in the coefficients table. Your real dataset will not have that problem.

EDIT (Response to OP's comment and provision of data).

With the dataset provided in your comment, we can benchmark both the original code (based on your updated post), and the new version (based on the code above, updated to reflect your new model formulas). In both case I use just 10 rows from each dataset (100 combinations).

f.original <- function() lm.anova(diff.exp.protein.expr.sub[1:10,],lncRNA.expr.sub[1:10,],colData)
f.new      <- function() do.call(rbind,apply(names,1,get.anova))
library(microbenchmark)
microbenchmark(f.original(), f.new(), times=10)
# Unit: seconds
#          expr      min       lq   median       uq      max neval
#  f.original() 2.622461 2.712527 2.824419 2.869193 2.914099    10
#       f.new() 2.049756 2.074909 2.144422 2.191546 2.224831    10

So we can see that the new version, which returns lists instead of data frames, is about 25% faster.

Profiling both approaches, using Rprof, shows that the original version spends about 50% of it's time in lm(...), whereas the new version spends about 60% of it's (shorter) time in lm(...). This makes sense and suggests that the best you could possibly do is an improvement of about 30% over the new version. In other words: the calls to lm(...) are the bottlenecks: 200 million calls to lm(...) just takes a long time.

You could consider a parallel computing approach, for example using the foreach package, but before going down that road IMO you should consider other strategies for getting to your end result.

Question 2

Your problem seems to be the solved by my package MatrixEQTL.

http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/

http://cran.r-project.org/web/packages/MatrixEQTL/index.html

You would need to use the modelLINEAR_CROSS model to test for the significance of the interaction term.

Please feel free to ask any questions about the package.

Question 3

The main time consumer here is not the nested loop but lm. And here are a couple of things you can optimize (but see also Andrey Shabalin's answer - he may have an easier solution for your case).

You have two lm's, in simplified form:

        lm1 <- lm(Y~ X1 + X2 + X3 + X4)
        lm2 <- lm(Y~ X1 + X2 + X3 + X4 + X3:X4)

Then you make a summary of lm2 and compare lm1 with anova. Because you extract just the p-value from anova, but the p-value is identical to the p-value for interaction term in lm2, then doing anova is redundant. And lm1 is consequently also redundant. Then using rbind to increment a data frame is a waste of time, so I'd suggest using a list and just add a new element at each iteration. so your code within the loop could be simplified to:

# lm1 <- ... # skip this 
#model 2 (has an additional interaction term at the end)
lm2 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment+lncRNA.expr[j,]+lncRNA.expr[j,]:colData$treatment)
#get the summary of model 2, * and coefficients from that summary *
lms <- coef(summary(lm2))
tempdff[[length(tempdff)+1]]  <- data.frame(rownames(diff.exp.protein.expr)[i],
     rownames(lncRNA.expr)[j], lms[6,4],lms[5,1:2],lms[6,1:2])

In addition, you have the result as a data.frame - using a list instead would save some time too.

A next step could be examining what the lm and summary.lm are doing. You don't need all of what it does, you just use b's and their standard errors, as well as the p-value from the last row. You might be able to skip some of the computations that lm and summary.lm do .. with the help of perhaps some matrix algebra.