Replacing a loop in R: multivariate k-nearest neighbor regression example

Question

One option in these situations is to build a big matrix and manipulate the indices:

y2<-array(colMeans(matrix(y[t(nn),],nrow=ncol(nn))),dim(y.new))
identical(y2,y.new) 
## [1] TRUE

In this case, my code runs about twice as fast as yours:

microbenchmark(
loop = for(i in 1:nrow(nn))
    y.new[i,] <- colMeans(y[nn[i,],,drop=FALSE]),
matrix=y2<-array(colMeans(matrix(y[t(nn),],nrow=ncol(nn))),dim(y.new)))
## Unit: microseconds
##    expr    min      lq  median     uq     max neval
##    loop 43.680 47.8805 49.1675 49.975 128.698   100
##  matrix 23.807 25.4330 25.9985 26.761  80.491   100

The loop in this case isn't really that bad. In general, as long as you're doing a lot of work in a loop (in this case subsetting a matrix and calling colMeans), then the amount of overhead per iteration will be small compared to the actual meat of the loop. The times you really need to avoid loops in R are where each iteration is only doing a small amount of work, in which case the overhead of iterating in R will truly be the bottleneck, and avoiding the loop can give a dramatic performance improvement.

The advantage of the loop is that it is very clear what you are doing, whereas my code is pretty incomprehensible. However, doing matrix index manipulation like this will usually be faster, sometimes by a lot, because you're only subsetting the y matrix once, as opposed to once each time through the loop.