Why is mclappy slower than apply in this case?

https://stackoverflow.com/questions/18123125

23-06-2022
|

Question

i'm pretty confused. I want to speed up my algorithm by using mclapply:parallel, but when I compare time efficiency, apply still wins.

I'm smoothing log2ratio data by rq.fit.fnb:quantreg which is called by my function quantsm and I'm wrapping my data into matrix/list for apply/lapply(mclapply) usage.

I adjist my data like this:

q = matrix(data, ncol=N)        # wrapping into matrix (using N = 2, 4, 6 or 8)
ql = as.list(as.data.frame(q))  # making list

And time comparing:

apply=system.time(apply(q, 1, FUN=quantsm, 0.50, 2))
lapply=system.time(lapply(ql, FUN=quantsm, 0.50, 2))
mc2lapply=system.time(mclapply(ql, FUN=quantsm, 0.50, 2, mc.cores=2))
mc4lapply=system.time(mclapply(ql, FUN=quantsm, 0.50, 2, mc.cores=4))
mc6lapply=system.time(mclapply(ql, FUN=quantsm, 0.50, 2, mc.cores=6))
mc8lapply=system.time(mclapply(ql, FUN=quantsm, 0.50, 2, mc.cores=8))
timing=rbind(apply,lapply,mc2lapply,mc4lapply,mc6lapply,mc8lapply)

Function quantsm:

quantsm <- function (y, p = 0.5, lambda) {
   # Quantile smoothing
   # Input: response y, quantile level p (0<p<1), smoothing parmeter lambda
   # Result: quantile curve

   # Augment the data for the difference penalty
   m <- length(y)
   E <- diag(m);
   Dmat <- diff(E);
   X <- rbind(E, lambda * Dmat)
   u <- c(y, rep(0, m - 1))

   # Call quantile regression
   q <- rq.fit.fnb(X, u, tau = p)
   q
}

Function rq.fit.fnb (quantreg library):

rq.fit.fnb <- function (x, y, tau = 0.5, beta = 0.99995, eps = 1e-06) 
{
    n <- length(y)
    p <- ncol(x)
    if (n != nrow(x)) 
        stop("x and y don't match n")
    if (tau < eps || tau > 1 - eps) 
        stop("No parametric Frisch-Newton method.  Set tau in (0,1)")
    rhs <- (1 - tau) * apply(x, 2, sum)
    d <- rep(1, n)
    u <- rep(1, n)
    wn <- rep(0, 10 * n)
    wn[1:n] <- (1 - tau)
    z <- .Fortran("rqfnb", as.integer(n), as.integer(p), a = as.double(t(as.matrix(x))), 
        c = as.double(-y), rhs = as.double(rhs), d = as.double(d), 
        as.double(u), beta = as.double(beta), eps = as.double(eps), 
        wn = as.double(wn), wp = double((p + 3) * p), it.count = integer(3), 
        info = integer(1), PACKAGE = "quantreg")
    coefficients <- -z$wp[1:p]
    names(coefficients) <- dimnames(x)[[2]]
    residuals <- y - x %*% coefficients
    list(coefficients = coefficients, tau = tau, residuals = residuals)
}

For data vector of length 2000 i get:

(value = elapsed time in sec; columns = different number of columns of smoothed matrix/list)

           2cols 4cols 6cols 8cols
apply      0.178 0.096 0.069 0.056
lapply    16.555 4.299 1.785 0.972
mc2lapply 11.192 2.089 0.927 0.545
mc4lapply 10.649 1.326 0.694 0.396
mc6lapply 11.271 1.384 0.528 0.320
mc8lapply 10.133 1.390 0.560 0.260

For data of length 4000 i get:

            2cols  4cols  6cols 8cols
apply       0.351  0.187  0.137 0.110
lapply    189.339 32.654 14.544 8.674
mc2lapply 186.047 20.791  7.261 4.231
mc4lapply 185.382 30.286  5.767 2.397
mc6lapply 184.048 30.170  8.059 2.865
mc8lapply 182.611 37.617  7.408 2.842

Why is apply so much more efficient than mclapply? Maybe I'm just doing some usual beginner mistake.

Thank you for your reactions.

Solution

It looks like mclapply compares pretty well against lapply, but lapply does not compare well against apply. The reason may be that you're iterating over the rows of q with apply, and you're iterating over the columns of q with lapply and mclapply. That may account for the performance difference.

If you really do want to iterate over the rows of q, you could create ql using:

ql <- lapply(seq_len(nrow(x)), function(i) x[i,])

If you want to iterate over the columns of q, then you should set MARGIN=2 in apply, as suggested by @flodel.

Both lapply and mclapply will iterate over the columns of a data frame, so you can create ql with:

ql <- as.data.frame(q)

This makes sense since a data frame actually is a list.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow