Question

I come from a Java/Python comp sci theory background so I am still getting used to the various R packages and how they can save run time in functions.

Basically, I am working on a few projects and all of them involve taking individual factors in a long-list data set (15,000 to 200,000 factors) and performing calculations on individual factors in an equally-large data set, and concurrently storing the results of those calculations in an exponentially-longer data frame.

So far I have been using nested while loops and concatenating into a growing list, but that is taking days. Ive recently learned about 'lapply' and the 'data.frame' options in R, and I would love to see an example of how to apply (no pun intended) them to the following basic correlation function:

Corr<-function(miRdf, mRNAdf)
{
j=1  
k=1
m=1
n=1
c=0
corrList=NULL
while(n<=71521)
{
  while(m<=1477)
  {    
  corr=cor(as.numeric(miRdf[k,2:13]), as.numeric(mRNAdf[j,2:13]), use ="complete.obs")
  corrList<-c(corrList, corr)
  j=j+1
  c=c+1
  print(c)  #just a counter to see how far the function has run
  m=m+1
}
k=k+1
n=n+1
j=1
m=1         #to reset the inner while loop
}
corrList<-matrix(unlist(corrList), ncol=1477, byrow=FALSE)
colnames(corrList)<-miRdf[,1]
rownames(corrList)<-mRNAdf[,1]
write.csv(corrList, "testCorrWhole.csv")
} 

As you can see, the nested while loop results in 105,636,517 (71521x1477) miRNA vs mRNA expression-value correlation scores that need to be performed and stored in a data frame that is 1477 cols x 71521 rows in order to generate a scoring matrix.

My question is, can anyone shed light on how to turn the above monstrosity into an efficient function that utilizes 'lapply' instead of the while loops, and uses the 'data.table' set() function to do away with the inefficiency of concatenating a list during every pass through the loop?

Thank you in advance!

Was it helpful?

Solution

Your names end with 'df', which makes it seem like your data are a data.frame. But @Troy's answer uses a matrix. A matrix is appropriate when the data are homogeneous, and generally matrix operations are much faster than data.frame operations. So you can see already that if you'd provided a small example of your data set (e.g., dput(mRNAdf[1:10,]) that people might be in a better position to help you; this is what they're asking for.

In large numerical calculations it makes sense to 'hoist' any repeated calculations outside the loop, so they are performed only once. Repeated calculations in your case include sub-setting to columns 2:13, and coercion to numeric. With this idea, and guessing that you actually have a data.frame where each column is already a numeric vector, I'd start with

mRNAmatrix <- as.matrix(mRNAdf[,2:13])
miRmatrix <- as.matrix(miRdf[,2:13])

From the help page ?cor we see that the arguments can be a matrix, and if so the correlation is calculated between columns. You're interested in the result when the arguments are transposed relative to your current representation. So

result <- cor(t(mRNAmatrix), t(miRmatrix), use="complete.obs")

This is fast enough for your purposes

> m1 = matrix(rnorm(71521 * 12), 71521)
> m2 = matrix(rnorm(1477 * 12), 1477)
> system.time(ans <- cor(t(m1), t(m2)))
   user  system elapsed 
  9.124   0.200   9.340 
> dim(ans)
[1] 71521  1477

result is the same as your corrList -- it's not a list, but a matrix; probably the row and column names have been carried forward. You'd write this to a file as you do above, write.csv(result, "testCorrWhole.csv")

OTHER TIPS

UPDATED BELOW TO SHOW PARALLEL PROCESSING - ABOUT A 60% SAVING

Using apply() might not be quick enough for you. Here's how to do it, though. Will have a think about performance since this example (1M output correlations in 1000x1000 grid) takes over a minute on laptop.

miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf)     # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)

corRow<-function(y){
    apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
  }

system.time(apply(mRNAdf,1,function(x)corRow(x[1])))
# user  system elapsed 
# 72.94    0.00   73.39

And with parallel::parApply on a 4 core Win64 laptop

require(parallel) ## Library to allow parallel processing

miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf)     # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)

corRow<-function(y){
    apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
  }


      # Make a cluster from all available cores
      cl=makeCluster(detectCores()) 
      # Use clusterExport() to distribute the function and data.frames needed in the apply() call
      clusterExport(cl,c("corRow","miRdf","mRNAdf"))
      # time the call
      system.time(parApply(cl,mRNAdf,1,function(x)corRow(x[[1]])))

      # Stop the cluster
      stopCluster(cl)

      # time the call without clustering
      system.time(apply(mRNAdf,1,function(x)corRow(x[[1]])))

      ## WITH CLUSTER (4)
      user  system elapsed 
      0.04    0.03   29.94 

      ## WITHOUT CLUSTER
      user  system elapsed 
      73.96    0.00   74.46     
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top