How can I vectorize this code in R? Maybe with the apply() function?

https://datascience.stackexchange.com/questions/8435

16-10-2019
|

Question

I am really struggling to replicate the output of the dist() function in R code without using 1 or 2 for loops. (If you're wondering why I'm doing this, it's so that I can play around with the distance calculation, and also to improve my R skills - so please only solutions that involve R!)

Overview: matrix is passed to dist(), which calculates the Euclidean distance row-wise and outputs a full distance matrix of the distance between each row (e.g. distance between rows 1 and 50 will be in distancematrix[1, 50] and distancematrix[50, 1]). The fast code looks like this:

distancematrix <- as.matrix(dist(myMatrix, method="euclidean", diag = T, upper = T))

I have successfully produced the same output in R using the following code:

for (i in 1:nrow(myMatrix)) {
  for (j in 1:nrow(myMatrix)) {
    distancematrix[i, j] <- sum(abs(myMatrix[i,] - myMatrix[j,])) 
  }
}

However, using two nested for loops is much slower than using dist(). I have read a lot about using apply() to optimise slower for loops, but I haven't been able to get my head around it so far. I believe that at least one of the for loops is definitely avoidable by just outputting a vector, and dealing with it at the end. However, I cannot for the life of me work out how to remove both for loops.

Does anyone have any thoughts?

Solution

First of all it should be noted that the code you posted does not actually replicate the output of the dist function, because the line:

distancematrix[i, j] <- sum(abs(myMatrix[i,] - myMatrix[j,]))

does not calculate the Euclidean distance; it should be:

distancematrix[i, j] <- sqrt(sum((myMatrix[i,] - myMatrix[j,]) ^ 2))

Here are two solutions that rely on apply. They are simplified, and in particular do not take advantage of the symmetry of the distance matrix (which, if considered, would lead to a 2-fold speedup). First, generate some test data:

# Number of data points
N <- 2000
# Dimensionality
d <- 10
# Generate data
myMatrix = matrix(rnorm(N * d), nrow = N)

For convenience, define:

# Wrapper for the distance function
d_fun <- function(x_1, x_2) sqrt(sum((x_1 - x_2) ^ 2))

The first approach is a combination of apply and sapply:

system.time(
    D_1 <- 
        apply(myMatrix, 1, function(x_i) 
            sapply(1:nrow(myMatrix), function(j) d_fun(x_i, myMatrix[j, ]))
    )
)

   user  system elapsed 
 14.041   0.100  14.001

while the second uses only apply (but going over the indices, which are paired using expand.grid):

system.time(
    D_2 <- 
        matrix(apply(expand.grid(i = 1:nrow(myMatrix), j = 1:nrow(myMatrix)), 1, function(I) 
            d_fun(myMatrix[I[["i"]], ], myMatrix[I[["j"]], ])
        )
    )
)

   user  system elapsed 
 39.313   0.498  39.561

However, as expected both are much slower than dist:

system.time(
    distancematrix <- as.matrix(
        dist(myMatrix, method = "euclidean", diag = T, upper = T)
    )
)

   user  system elapsed 
  0.337   0.054   0.388

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange