Whenever I replace a for loop with an apply statement, my R scripts run faster but here's an exception. I'm still inexperienced in using the apply family correctly, so what I can do to the apply statements to outperform (ie. become faster) than the for loop?
Example data:
vc<-as.character(c("120,129,129,114","103,67,67,67,67,10,10,10,12","2,1,1,1,2,4,3,1,1,1,3,2,1,1","1,3,1,1,1,1,1,4",NA,"5","1,1,99","2,2,2,16,11,11,11,11,11,29,29,26,26,26,26,26,26,26,26,26,26,31,24,29,29,29,29,40,24,23,3,3,3,6,6,4,5,4,4,3,3,4,4,6,8,8,6,6,6,5,3,3,4,4,5,5,4,4,4,4,6,11,10,11,10,14,2,2,22,22,22,22,24,24,24,23,24,24,24,23,24,23,23,23,24,25,27,27,24,24,26,24,25,25,24,25,26,29,31,32,32,32,32,33,32,35,35,35,52,44,37,26","20,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,19,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,19,19,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,1,1,1,12,10","67,63,73,70,75,135,94,94,96,94,95,96,96,97,94,94,94,94,24,24,24,24,24,24,24,24,24,24,24,1,1,1"))
The goal is to populate a numeric matrix m.res where each row contains the top3 values of each element in vc. Here's the for loop:
fx.test1
function(vc)
{
m.res<-matrix(ncol=3, nrow=length(vc))
for (j in 1:length(vc))
{vn<-as.numeric(unlist(strsplit(vc[j], split=",")))
vn[is.na(vn)]<-0; vn2<-rev(sort(vn))
m.res[j,]<-vn2[1:3]
}
}
And below is my "apply solution". Why is it slower? How can I make it faster? Thank you!
fx.test2
function(vc)
{
m.res<-matrix(ncol=3, nrow=length(vc))
vc[is.na(vc)]<-"0"
ls.vc<-sapply(vc, function(x) tail(sort(as.numeric(unlist(strsplit(x, split=",")))),3), simplify=TRUE)
#names(ls.vc)<-seq(1:length(vc))
ls.vc2<-lapply(ls.vc, function(x) c(as.numeric(x), rep(0, times = 3 - length(x))))
m.res<-as.matrix(t(as.data.frame(ls.vc)))
return(m.res)
}
system.time(m.res<-fx.test1(vc))
# user system elapsed
# 0.001 0.000 0.001
system.time(m.res<-fx.test2(vc))
# user system elapsed
# 0.003 0.000 0.003
UPDATE: I followed the suggestions of @John and generated two trimmed & truly equivalent functions. Indeed, I was able to speed up the lapply function somewhat but it's still SLOWER than the for loop. If you happen to have any ideas for how optimize these functions for speed, please let me know. Thank you all.
fx.test3<-function(vc)
{
L<-strsplit(vc,split=",")
m.res<-matrix(ncol=3, nrow=length(vc))
for (j in 1:length(vc))
{
m.res[j,]<-sort(c(as.numeric(L[[j]]),rep(0,3)), decreasing=TRUE)[1:3]
}
return(m.res)
}
fx.test4<-function(vc)
{
L<-strsplit(vc, split=",")
D<-t(as.data.frame(lapply(L, function(X) {sort(c(as.numeric(X),rep(0,3)),decreasing=TRUE)[1:3]})))
row.names(D)<-NULL
m.res<-as.matrix(D)
return(m.res)
}
system.time(fx.test3(vc))
# user system elapsed
# 0.001 0.000 0.001
system.time(fx.test4(vc))
# user system elapsed
# 0.002 0.000 0.002