Question

I have a data frame that is some 35,000 rows, by 7 columns. it looks like this:

head(nuc)

  chr feature    start      end   gene_id    pctAT    pctGC length
1   1     CDS 67000042 67000051 NM_032291 0.600000 0.400000     10
2   1     CDS 67091530 67091593 NM_032291 0.609375 0.390625     64
3   1     CDS 67098753 67098777 NM_032291 0.600000 0.400000     25
4   1     CDS 67101627 67101698 NM_032291 0.472222 0.527778     72
5   1     CDS 67105460 67105516 NM_032291 0.631579 0.368421     57
6   1     CDS 67108493 67108547 NM_032291 0.436364 0.563636     55

gene_id is a factor, that has about 3,500 unique levels. I want to, for each level of gene_id get the min(start), max(end), mean(pctAT), mean(pctGC), and sum(length).

I tried using lapply and do.call for this, but it's taking forever +30 minutes to run. the code I'm using is:

nuc_prof = lapply(levels(nuc$gene_id), function(gene){
  t = nuc[nuc$gene_id==gene, ]
  return(list(gene_id=gene, start=min(t$start), end=max(t$end), pctGC =
              mean(t$pctGC), pct = mean(t$pctAT), cdslength = sum(t$length))) 
})
nuc_prof = do.call(rbind, nuc_prof)

I'm certain I'm doing something wrong to slow this down. I haven't waited for it to finish as I'm sure it can be faster. Any ideas?

Was it helpful?

Solution

Since I'm in an evangelizing mood ... here's what the fast data.table solution would look like:

library(data.table)
dt <- data.table(nuc, key="gene_id")

dt[,list(A=min(start),
         B=max(end),
         C=mean(pctAT),
         D=mean(pctGC),
         E=sum(length)), by=key(dt)]
#      gene_id        A        B         C         D   E
# 1: NM_032291 67000042 67108547 0.5582567 0.4417433 283
# 2:       ZZZ 67000042 67108547 0.5582567 0.4417433 283

OTHER TIPS

do.call can be extremely slow on large objects. I think this is due to how it constructs the call, but I'm not certain. A faster alternative would be the data.table package. Or, as @Andrie suggested in a comment, use tapply for each calculation and cbind the results.

A note on your current implementation: rather than doing the subsetting in your function, you could use the split function to break up your data.frame into a list of data.frames you can loop over.

g <- function(tnuc) {
  list(gene_id=tnuc$gene_id[1], start=min(tnuc$start), end=max(tnuc$end),
       pctGC=mean(tnuc$pctGC), pct=mean(tnuc$pctAT), cdslength=sum(tnuc$length))
}
nuc_prof <- lapply(split(nuc, nuc$gene_id), g)

As others have mentioned - do.call has issues with large objects, and I recently discovered exactly how slow it is on large data sets. To illustrate the problem, here's a benchamark using a simple summary call with a large regression object (a cox regression using the rms-package):

> model <- cph(Surv(Time, Status == "Cardiovascular") ~  
+              Group + rcs(Age, 3) + cluster(match_group),
+              data=full_df, 
+              x=TRUE, y=TRUE)

> system.time(s_reg <- summary(object = model))
   user  system elapsed 
   0.00    0.02    0.03 
> system.time(s_dc <- do.call(summary, list(object = model)))
   user  system elapsed 
 282.27    0.08  282.43 
> nrow(full_df)
[1] 436305

While the data.table solution is an excellent approach to above it does not contain the full functionality of the do.call and I therefore thought that I'll share my fastDoCall function - a modification of Hadley Wickhams suggested hack on the R-mailing list. It is available in the Gmisc-package 1.0-version (not yet released on CRAN but you can find it here). The benchmark is:

> system.time(s_fc <- fastDoCall(summary, list(object = model)))
   user  system elapsed 
   0.03    0.00    0.06 

The full code for the function is as below:

fastDoCall <- function(what, args, quote = FALSE, envir = parent.frame()){
  if (quote)
    args <- lapply(args, enquote)

  if (is.null(names(args))){
    argn <- args
    args <- list()
  }else{
    # Add all the named arguments
    argn <- lapply(names(args)[names(args) != ""], as.name)
    names(argn) <- names(args)[names(args) != ""]
    # Add the unnamed arguments
    argn <- c(argn, args[names(args) == ""])
    args <- args[names(args) != ""]
  }

  if (class(what) == "character"){
    if(is.character(what)){
      fn <- strsplit(what, "[:]{2,3}")[[1]]
      what <- if(length(fn)==1) {
        get(fn[[1]], envir=envir, mode="function")
      } else {
        get(fn[[2]], envir=asNamespace(fn[[1]]), mode="function")
      }
    }
    call <- as.call(c(list(what), argn))
  }else if (class(what) == "function"){
    f_name <- deparse(substitute(what))
    call <- as.call(c(list(as.name(f_name)), argn))
    args[[f_name]] <- what
  }else if (class(what) == "name"){
    call <- as.call(c(list(what, argn)))
  }

  eval(call,
       envir = args,
       enclos = envir)
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top