Question

I'm new to R and am trying to replace the loop in the appended block of code with something more efficient. For context, this is a simple, synthetic example of a k-nearest neighbor regression with a multivariate (3-dimensional) target.

rm(list=ls())
set.seed(1)

# Fast nearest neighbor package
library(FNN)
k <- 3

# Synthetic 5d predictor and noisy 3d target data
x <- matrix(rnorm(50), ncol=5)
y <- 5*x[,1:3] + matrix(rnorm(30), ncol=3)
print(x)
print(y)

# New synthetic 5d predictor data (4 cases)
x.new <- matrix(rnorm(20), ncol=5)
print(x.new)

# Identify k-nearest neighbors
nn <- knnx.index(data=x, query=x.new, k=k)
print(nn)

At present, I am taking the unweighted average of the k-nearest neighbours (nn) by the following loop:

# Unweighted k-nearest neighbor regression predictions based on y and nn
y.new <- matrix(0, ncol=ncol(y), nrow=nrow(x.new))
for(i in 1:nrow(nn))
    y.new[i,] <- colMeans(y[nn[i,],,drop=FALSE])

print(y.new)

but there must be a simple way to avoid looping here. Thanks.

Was it helpful?

Solution

One option in these situations is to build a big matrix and manipulate the indices:

y2<-array(colMeans(matrix(y[t(nn),],nrow=ncol(nn))),dim(y.new))
identical(y2,y.new) 
## [1] TRUE

In this case, my code runs about twice as fast as yours:

microbenchmark(
loop = for(i in 1:nrow(nn))
    y.new[i,] <- colMeans(y[nn[i,],,drop=FALSE]),
matrix=y2<-array(colMeans(matrix(y[t(nn),],nrow=ncol(nn))),dim(y.new)))
## Unit: microseconds
##    expr    min      lq  median     uq     max neval
##    loop 43.680 47.8805 49.1675 49.975 128.698   100
##  matrix 23.807 25.4330 25.9985 26.761  80.491   100

The loop in this case isn't really that bad. In general, as long as you're doing a lot of work in a loop (in this case subsetting a matrix and calling colMeans), then the amount of overhead per iteration will be small compared to the actual meat of the loop. The times you really need to avoid loops in R are where each iteration is only doing a small amount of work, in which case the overhead of iterating in R will truly be the bottleneck, and avoiding the loop can give a dramatic performance improvement.

The advantage of the loop is that it is very clear what you are doing, whereas my code is pretty incomprehensible. However, doing matrix index manipulation like this will usually be faster, sometimes by a lot, because you're only subsetting the y matrix once, as opposed to once each time through the loop.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top