Question

Whenever I replace a for loop with an apply statement, my R scripts run faster but here's an exception. I'm still inexperienced in using the apply family correctly, so what I can do to the apply statements to outperform (ie. become faster) than the for loop?

Example data:

vc<-as.character(c("120,129,129,114","103,67,67,67,67,10,10,10,12","2,1,1,1,2,4,3,1,1,1,3,2,1,1","1,3,1,1,1,1,1,4",NA,"5","1,1,99","2,2,2,16,11,11,11,11,11,29,29,26,26,26,26,26,26,26,26,26,26,31,24,29,29,29,29,40,24,23,3,3,3,6,6,4,5,4,4,3,3,4,4,6,8,8,6,6,6,5,3,3,4,4,5,5,4,4,4,4,6,11,10,11,10,14,2,2,22,22,22,22,24,24,24,23,24,24,24,23,24,23,23,23,24,25,27,27,24,24,26,24,25,25,24,25,26,29,31,32,32,32,32,33,32,35,35,35,52,44,37,26","20,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,19,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,19,19,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,1,1,1,12,10","67,63,73,70,75,135,94,94,96,94,95,96,96,97,94,94,94,94,24,24,24,24,24,24,24,24,24,24,24,1,1,1"))

The goal is to populate a numeric matrix m.res where each row contains the top3 values of each element in vc. Here's the for loop:

fx.test1 
function(vc) 
     {
     m.res<-matrix(ncol=3, nrow=length(vc))
     for (j in 1:length(vc)) 
      {vn<-as.numeric(unlist(strsplit(vc[j], split=","))) 
      vn[is.na(vn)]<-0; vn2<-rev(sort(vn)) 
      m.res[j,]<-vn2[1:3]
      }
     }

And below is my "apply solution". Why is it slower? How can I make it faster? Thank you!

fx.test2
function(vc) 
    {
    m.res<-matrix(ncol=3, nrow=length(vc))
    vc[is.na(vc)]<-"0"
    ls.vc<-sapply(vc, function(x) tail(sort(as.numeric(unlist(strsplit(x, split=",")))),3), simplify=TRUE)
    #names(ls.vc)<-seq(1:length(vc))
    ls.vc2<-lapply(ls.vc, function(x) c(as.numeric(x), rep(0, times = 3 - length(x))))
    m.res<-as.matrix(t(as.data.frame(ls.vc)))
    return(m.res)
}

system.time(m.res<-fx.test1(vc))
#   user  system elapsed 
#  0.001   0.000   0.001 

system.time(m.res<-fx.test2(vc))
#   user  system elapsed 
#  0.003   0.000   0.003

UPDATE: I followed the suggestions of @John and generated two trimmed & truly equivalent functions. Indeed, I was able to speed up the lapply function somewhat but it's still SLOWER than the for loop. If you happen to have any ideas for how optimize these functions for speed, please let me know. Thank you all.

fx.test3<-function(vc) 
{
    L<-strsplit(vc,split=",")
    m.res<-matrix(ncol=3, nrow=length(vc))
    for (j in 1:length(vc)) 
        {
        m.res[j,]<-sort(c(as.numeric(L[[j]]),rep(0,3)), decreasing=TRUE)[1:3]
    }
    return(m.res)
}



fx.test4<-function(vc) 
    {
        L<-strsplit(vc, split=",")
        D<-t(as.data.frame(lapply(L, function(X) {sort(c(as.numeric(X),rep(0,3)),decreasing=TRUE)[1:3]})))
        row.names(D)<-NULL
        m.res<-as.matrix(D)
        return(m.res)
    }

system.time(fx.test3(vc))
#   user  system elapsed 
#  0.001   0.000   0.001

system.time(fx.test4(vc))
#   user  system elapsed 
#  0.002   0.000   0.002 
Was it helpful?

Solution

UPDATE2 & potential answer:

I now simplified fx.test4 as follows and it is now equivalent in speed to the for loop. Therefore, it was the extra conversion steps that made the lapply solution slower as @John pointed out. In addition, maybe the assumption that *apply HAD to be faster was faulty as discussed by @Ari B. Friedman and @SimonO101 Thank you All!

fx.test5<-function(vc) 
    {
        L<-strsplit(vc, split=",")
        m.res<-t(sapply(seq_along(L), function(X){sort(c(as.numeric(L[[X]]),rep(0,3)),decreasing=TRUE)[1:3]}))
        return(m.res)
    }

fx.test5(vc)
      [,1] [,2] [,3]
 [1,]  129  129  120
 [2,]  103   67   67
 [3,]    4    3    3
 [4,]    4    3    1
 [5,]    0    0    0
 [6,]    5    0    0
 [7,]   99    1    1
 [8,]   52   44   40
 [9,]   20   19   19
[10,]  135   97   96

system.time(fx.test5(vc))
   user  system elapsed 
  0.001   0.000   0.001 

UPDATE3: Indeed, on a longer example, the *apply function is faster (by a hair).

system.time(fx.test3(vc2))
#   user  system elapsed 
#  3.596   0.006   3.601 
system.time(fx.test5(vc2))
#   user  system elapsed 
#  3.355   0.006   3.359

OTHER TIPS

Your problem can be solved using concat.split function from splitstackshape package:

library(splitstackshape)
kk<-data.frame(vc)
nn<-concat.split(kk,split.col="vc",sep=",")
head(nn[1:10,1:4])
                           vc vc_1 vc_2 vc_3
1             120,129,129,114  120  129  129
2 103,67,67,67,67,10,10,10,12  103   67   67
3 2,1,1,1,2,4,3,1,1,1,3,2,1,1    2    1    1
4             1,3,1,1,1,1,1,4    1    3    1
5                        <NA>   NA   NA   NA
6                           5    5   NA   NA

You can manipulate the nn dataframe to get the columns with max value.

You're doing lots of stuff in your loops, apply or for, that shouldn't be. The main feature of apply is not so much that it is faster than for but that it encourages expression that allows you to keep things vectorized as much as possible (i.e. as little in your loops as possible). The thing that R is particularly slow at is interpreting a function call and each time through the loop it needs to interpret every function call it encounters. Sometimes loops are unavoidable but they should be made as small as possible.

Your strsplit can just be used outside the first sapply. That way you call it once. Then you also don't need unlist before as.numeric. You can also sort with decreasing = FALSE instead of additionally calling tail (although maybe that's as fast as a [1:3] selector). All of that saves you function interpretation in your loop being called over and over.

You don't have to pre-allocate your matrix because you're going to generate the values all at once and shape them into a matrix.

See if following that advice speeds things up.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top