Question

Consider the following matrix,

nc <- 5000
nr <- 1024
m <- matrix(rnorm(nc*nr), ncol=nc)

I wish to take the difference between the rowMeans of two groups of identical size taken at random in this matrix.

n <- 1000 # group size

system.time(replicate(100, {
   ind1 <- sample(seq.int(nc), n) 
   ind2 <- sample(seq.int(nc), n)
   rowMeans(m[, ind1]) - rowMeans(m[, ind2])
}))

It is quite slow, unfortunately I didn't understand the output of Rprof (it seemed most of the time was spent on is.data.frame??)

Suggestions for something more efficient?

I have contemplated the following:

  • Rcpp: from my online readings I believe R's rowMeans is quite efficient, so it's not clear it would help at this step. I'd like to be convinced of where the bottleneck really is first, maybe my whole design is suboptimal. If most of the time is spent in making copies for each of the smaller matrices, would Rcpp perform better?

  • updating to R-devel, there seems to be a new .rowMeans function more efficient yet. Has anyone tried it?

Thanks.

Was it helpful?

Solution

Each rowSums() call on a subset of columns from m can be seen as the matrix multiplication between m and a vector of 0 or 1 indicating the selected columns. If you juxtapose all those vectors, you end up with a multiplication between two matrices (which is much more efficient):

ind1 <- replicate(100, seq.int(nc) %in% sample(seq.int(nc), n)) 
ind2 <- replicate(100, seq.int(nc) %in% sample(seq.int(nc), n))
output <- m %*% (ind1 - ind2)

OTHER TIPS

You don't need 2 calls to rowMeans. You can do the subtraction first and call rowMeans on the result.

x1 <- rowMeans(m[,ind1])-rowMeans(m[,ind2])
x2 <- rowMeans(m[,ind1]-m[,ind2])
all.equal(x1,x2)
# [1] TRUE

is.data.frame is part of the checks done in rowMeans.

UPDATE: regarding .rowMeans in R-devel, it looks like it's just a straight call to the internal code (assuming do_colsum hasn't changed). It's defined as:

.rowMeans <- function(X, m, n, na.rm = FALSE)
    .Internal(rowMeans(X, m, n, na.rm))

In your case, m=1024 and n=1000.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top