lapply and do.call running very slowly?
-
14-06-2021 - |
Question
I have a data frame that is some 35,000 rows, by 7 columns. it looks like this:
head(nuc)
chr feature start end gene_id pctAT pctGC length
1 1 CDS 67000042 67000051 NM_032291 0.600000 0.400000 10
2 1 CDS 67091530 67091593 NM_032291 0.609375 0.390625 64
3 1 CDS 67098753 67098777 NM_032291 0.600000 0.400000 25
4 1 CDS 67101627 67101698 NM_032291 0.472222 0.527778 72
5 1 CDS 67105460 67105516 NM_032291 0.631579 0.368421 57
6 1 CDS 67108493 67108547 NM_032291 0.436364 0.563636 55
gene_id is a factor, that has about 3,500 unique levels. I want to, for each level of gene_id get the min(start)
, max(end)
, mean(pctAT)
, mean(pctGC)
, and sum(length)
.
I tried using lapply and do.call for this, but it's taking forever +30 minutes to run. the code I'm using is:
nuc_prof = lapply(levels(nuc$gene_id), function(gene){
t = nuc[nuc$gene_id==gene, ]
return(list(gene_id=gene, start=min(t$start), end=max(t$end), pctGC =
mean(t$pctGC), pct = mean(t$pctAT), cdslength = sum(t$length)))
})
nuc_prof = do.call(rbind, nuc_prof)
I'm certain I'm doing something wrong to slow this down. I haven't waited for it to finish as I'm sure it can be faster. Any ideas?
Solution
Since I'm in an evangelizing mood ... here's what the fast data.table
solution would look like:
library(data.table)
dt <- data.table(nuc, key="gene_id")
dt[,list(A=min(start),
B=max(end),
C=mean(pctAT),
D=mean(pctGC),
E=sum(length)), by=key(dt)]
# gene_id A B C D E
# 1: NM_032291 67000042 67108547 0.5582567 0.4417433 283
# 2: ZZZ 67000042 67108547 0.5582567 0.4417433 283
OTHER TIPS
do.call
can be extremely slow on large objects. I think this is due to how it constructs the call, but I'm not certain. A faster alternative would be the data.table
package. Or, as @Andrie suggested in a comment, use tapply
for each calculation and cbind
the results.
A note on your current implementation: rather than doing the subsetting in your function, you could use the split
function to break up your data.frame into a list of data.frames you can loop over.
g <- function(tnuc) {
list(gene_id=tnuc$gene_id[1], start=min(tnuc$start), end=max(tnuc$end),
pctGC=mean(tnuc$pctGC), pct=mean(tnuc$pctAT), cdslength=sum(tnuc$length))
}
nuc_prof <- lapply(split(nuc, nuc$gene_id), g)
As others have mentioned - do.call
has issues with large objects, and I recently discovered exactly how slow it is on large data sets. To illustrate the problem, here's a benchamark using a simple summary call with a large regression object (a cox regression using the rms-package):
> model <- cph(Surv(Time, Status == "Cardiovascular") ~
+ Group + rcs(Age, 3) + cluster(match_group),
+ data=full_df,
+ x=TRUE, y=TRUE)
> system.time(s_reg <- summary(object = model))
user system elapsed
0.00 0.02 0.03
> system.time(s_dc <- do.call(summary, list(object = model)))
user system elapsed
282.27 0.08 282.43
> nrow(full_df)
[1] 436305
While the data.table
solution is an excellent approach to above it does not contain the full functionality of the do.call
and I therefore thought that I'll share my fastDoCall
function - a modification of Hadley Wickhams suggested hack on the R-mailing list. It is available in the Gmisc-package 1.0-version (not yet released on CRAN but you can find it here). The benchmark is:
> system.time(s_fc <- fastDoCall(summary, list(object = model)))
user system elapsed
0.03 0.00 0.06
The full code for the function is as below:
fastDoCall <- function(what, args, quote = FALSE, envir = parent.frame()){
if (quote)
args <- lapply(args, enquote)
if (is.null(names(args))){
argn <- args
args <- list()
}else{
# Add all the named arguments
argn <- lapply(names(args)[names(args) != ""], as.name)
names(argn) <- names(args)[names(args) != ""]
# Add the unnamed arguments
argn <- c(argn, args[names(args) == ""])
args <- args[names(args) != ""]
}
if (class(what) == "character"){
if(is.character(what)){
fn <- strsplit(what, "[:]{2,3}")[[1]]
what <- if(length(fn)==1) {
get(fn[[1]], envir=envir, mode="function")
} else {
get(fn[[2]], envir=asNamespace(fn[[1]]), mode="function")
}
}
call <- as.call(c(list(what), argn))
}else if (class(what) == "function"){
f_name <- deparse(substitute(what))
call <- as.call(c(list(as.name(f_name)), argn))
args[[f_name]] <- what
}else if (class(what) == "name"){
call <- as.call(c(list(what, argn)))
}
eval(call,
envir = args,
enclos = envir)
}