Question

I have two matrices diff.exp.protein.expr and lncRNA.expr each of which have 64 columns but different number of rows.

>dim(diff.exp.protein.expr)
[1] 14000 64

>dim(lncRNA.expr)
[1] 7600 64

I am using these matrices as input in two separate linear models where I compare each row of diff.exp.protein.expr with all rows of lncRNA.expr (=2*106 million tests). Eventually, I want to compare whether the two linear models are statistically different using anova for which I have written a function which is as follows:

lm.anova <- function(diff.exp.protein.expr,lncRNA.expr,colData) 
{
    tempdff <- data.frame() #create an empty dataframe
    for(i in 1:nrow(diff.exp.protein.expr)) #traverse through 1st matrix
    {
        for(j in 1:nrow(lncRNA.expr)) #traverse through 2nd matrix
        {
            #model 1
            lm1 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment)
            #model 2 (has an additional interaction term at the end)
            lm2 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment+lncRNA.expr[j,]+colData$treatment*lncRNA.expr[j,])
            #get the summary of model 2
            lm.model <- summary(lm2)
            #get the rownames of ith row of matrix 1 and jth row of matrix 2
            #get the pvalue from the anova model
            #get the estimate and std. error for last two terms of model 2
            #add these values as a row to tempdff
            tempdff <- rbind(tempdff,data.frame(rownames(diff.exp.protein.expr)[i],rownames(lncRNA.expr)[j],
                                      anova(lm1,lm2)$"Pr(>F)"[2]),lm.model$coefficients[4,1:2],lm.model$coefficients[6,1:2])
        }
    }
    return(tempdff) #return results
}

lm.anova.res <- lm.anova(diff.exp.protein.expr,lncRNA.expr,colData) #call function

And this is how the data, that goes into the function, looks like:

>head(diff.exp.protein.expr)[,1:5] #log transformed values

                     C00060   C00079   C00135   C00150   C00154
ENSG00000000005.5  5.187977 6.323024 6.022986 5.376513 4.810042
ENSG00000000460.12 7.071433 7.302448 7.499133 7.441582 7.439453
ENSG00000000938.8  8.713285 8.584996 8.982816 9.787420 8.823927
ENSG00000001497.12 9.569952 9.658418 9.392670 9.394250 9.087214
ENSG00000001617.7  9.215362 9.417367 8.949943 8.172713 9.198314
ENSG00000001626.10 6.363638 6.038328 6.698733 5.202132 5.336239

>head(lncRNA.expr)[,1:5] #log transformed values
                     C00060   C00079   C00135   C00150   C00154
ENSG00000100181.17 7.477983 7.776463 7.950514 8.098205 7.278615
ENSG00000115934.11 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000122043.6  4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000124915.6  4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000125514.5  4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000129816.5  4.104380 4.104380 4.104380 4.104380 4.104380

>head(colData)[,1:5] #sample information for each column of the two input matrices
sample_name status age gender      site treatment
     C00060     NF  48 Female Cleveland      case
     C00079     NF  26 Female Cleveland      case
     C00135     NF  55 Female Cleveland      case
     C00150     NF  61   Male Cleveland      ctrl
     C00154     NF  50   Male Cleveland      case
     C00176     NF  60 Female Cleveland      ctrl

I wrote this code when I had very few tests (~50) to be performed. Now, I have it all figured out except that I don't know how can I make my code efficient because using a for-loop in this case where 14000*7600 tests have to be performed doesn't make any sense. It would take forever to run. I would like to know what are other alternatives by which I can really speed up this code. Any suggestion would be appreciated.

UPDATE 1: I have updated my linear models a bit. Now, anova(lm1,lm2)"Pr(>F)" does not give the same value as that for the interaction term in model2.

UPDATE 2: I have updated my linear models so that lm1 is nested in lm2.

Thanks,

Was it helpful?

Solution

So there are a couple of things.

First, calling data.frame(...) is very expensive, so calling it at each iteration will slow things down considerably. Lists, on the other hand, are extremely efficient. So try to keep everything in lists until the end.

Second, it might just be that running 2 * 106 million regressions is taking most of the time.

I would be inclined to try it this way:

## not tested...
df1 <- t(diff.exp.protein.expr)
df2 <- t(lncRNA.expr)

df  <- cbind(df1,df2,colData)
names <- expand.grid(x=colnames(df1),y=colnames(df2),stringsAsFactors=FALSE)
get.anova <- function(n){
  form.1 <- as.formula(paste0(n[1],"~gender+age+treatment+",n[2]))
  form.2 <- as.formula(paste0(n[1],"~gender+age+treatment+",n[2],"+treatment:",n[2]))
  lm1    <- lm(form.1,data=df)
  lm2    <- lm(form.2,data=df)
  coef <- summary(lm2)$coefficients
  list(n[1],n[2],anova(lm1,lm2)$"Pr(>F)"[2],coef[5,1],coef[5,2],coef[6,1],coef[6,2])
}
result <- do.call(rbind,apply(names,1,get.anova))
result <- data.frame(result)
colnames(result) <- c("protein","lncRNA","p.value","est.1","se.1","est.2","se.2")

This was not tested because the dataset you provided was not big enough: the models have < 0 df for error. Consequently, there is no row 6 in the coefficients table. Your real dataset will not have that problem.

EDIT (Response to OP's comment and provision of data).

With the dataset provided in your comment, we can benchmark both the original code (based on your updated post), and the new version (based on the code above, updated to reflect your new model formulas). In both case I use just 10 rows from each dataset (100 combinations).

f.original <- function() lm.anova(diff.exp.protein.expr.sub[1:10,],lncRNA.expr.sub[1:10,],colData)
f.new      <- function() do.call(rbind,apply(names,1,get.anova))
library(microbenchmark)
microbenchmark(f.original(), f.new(), times=10)
# Unit: seconds
#          expr      min       lq   median       uq      max neval
#  f.original() 2.622461 2.712527 2.824419 2.869193 2.914099    10
#       f.new() 2.049756 2.074909 2.144422 2.191546 2.224831    10

So we can see that the new version, which returns lists instead of data frames, is about 25% faster.

Profiling both approaches, using Rprof, shows that the original version spends about 50% of it's time in lm(...), whereas the new version spends about 60% of it's (shorter) time in lm(...). This makes sense and suggests that the best you could possibly do is an improvement of about 30% over the new version. In other words: the calls to lm(...) are the bottlenecks: 200 million calls to lm(...) just takes a long time.

You could consider a parallel computing approach, for example using the foreach package, but before going down that road IMO you should consider other strategies for getting to your end result.

OTHER TIPS

Your problem seems to be the solved by my package MatrixEQTL.

http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/

http://cran.r-project.org/web/packages/MatrixEQTL/index.html

You would need to use the modelLINEAR_CROSS model to test for the significance of the interaction term.

Please feel free to ask any questions about the package.

The main time consumer here is not the nested loop but lm. And here are a couple of things you can optimize (but see also Andrey Shabalin's answer - he may have an easier solution for your case).

You have two lm's, in simplified form:

        lm1 <- lm(Y~ X1 + X2 + X3 + X4)
        lm2 <- lm(Y~ X1 + X2 + X3 + X4 + X3:X4)

Then you make a summary of lm2 and compare lm1 with anova. Because you extract just the p-value from anova, but the p-value is identical to the p-value for interaction term in lm2, then doing anova is redundant. And lm1 is consequently also redundant. Then using rbind to increment a data frame is a waste of time, so I'd suggest using a list and just add a new element at each iteration. so your code within the loop could be simplified to:

# lm1 <- ... # skip this 
#model 2 (has an additional interaction term at the end)
lm2 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment+lncRNA.expr[j,]+lncRNA.expr[j,]:colData$treatment)
#get the summary of model 2, * and coefficients from that summary *
lms <- coef(summary(lm2))
tempdff[[length(tempdff)+1]]  <- data.frame(rownames(diff.exp.protein.expr)[i],
     rownames(lncRNA.expr)[j], lms[6,4],lms[5,1:2],lms[6,1:2]) 

In addition, you have the result as a data.frame - using a list instead would save some time too.

A next step could be examining what the lm and summary.lm are doing. You don't need all of what it does, you just use b's and their standard errors, as well as the p-value from the last row. You might be able to skip some of the computations that lm and summary.lm do .. with the help of perhaps some matrix algebra.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top