Question

I am using rdist iteratively in order to compute nearest neighbours for huge datasets. Currently I have a rather small matrix of 634,000 vectors with 6 columns.

As mentioned I'm using rdist to compute the distance of each vector to every single other vector, with each distance computation being a step. In addition at every step I run a function that computes k=1,2,3,4 nearest neighbours and takes the sum (effectively k=all neighbours).

###My function to compute k nearest neighbours from distance vector

    knn <- function (vec,k) {
      sum((sort(vec)[1:k+1]))
    }

###My function to compute nearest neighbours iteratively for every vector
myfunc <- function (tab) {

  rowsums <- numeric(nrow(tab)) ###Here I will save total sums
  knnsums_log <- matrix(nrow=nrow(tab),ncol=4) ###Matrix for storing each of my kNN sums

  for(i in 1:nrow(tab)) { ###For loop to compute distance and total sums
    q<-as.matrix(rdist(tab[i,],tab))
    rowsums[i] <- rowSums(q)

     for (k in c(1:4)) { ###Nested loop to run my knn function
     knnsums[i,k] <- knn(q,k) 
    }

  }

  return(cbind(rowsums,knnsums_log))
}

A sample of what the data looks like (634k rows of this)

    X1  X2  X3  X4  X5  X6
1   0.00    0.02    0   0   0.02    -0.263309267
2   0.00    0.02    0   0   0.02    -0.171764667
3   0.00    0.02    0   0   0.02    -0.128784869
4   0.00    0.02    0   0   0.02    -0.905651733

For those unfamiliar with the function rdist gets Euclidean distance between arguements. It works far faster than a custom written function. It is more applicable than dist as dist only computes within matrix distances. I know technically that is what I'm doing but dist attempts to store that in memory and it is far far too large to even consider doing that.

How can I make the above work better? I've tried messing around with apply functions but get nothing useful. I hope I have explained everything clearly. If my math is correct, worst case guestimates it will take me over a week to run that code. I have very powerful servers to crunch this on. No GPUs however. I have not tried multicore (should have 12 available) but then again I don't know how I would delegate per core.

Thank you for your help.

Was it helpful?

Solution

Several tips:

0) profile your code using Rprof, with the line.profiling option

1) Matrices in R are column-wise. Because you compare your vectors between them, it will be much faster if you store them as columns of your matrix

2) I do not know where the rdist function comes from, but you should avoid the as.matrix(rdist(tab[i,],tab)) that will copy and create a new matrix

3) you can optimize your knn() function, that sorts 4 times the same vector

4) Why not just rdist(tab) ?

OTHER TIPS

So I've been working on this for a while and testing. For anyone else stuck on a similar problem, here are two more optimized versions of the code. I've significantly decreased computational time, however it still blows up with too many data entries. My next step would be to attempt to implement this with Rcpp and if possible make use of the 12 cores I have available (with the end goal being to compute 1-2 million entries in a reasonable time-frame). Not sure about the best way of proceeding on either point, but here is my code. Thank you for the help!

##################################
##############Optimized code
t.m<-t(test_euclid_log)

knn_log <- function (vec,k) {
  sum(vec[1:k+1])
}
knn_log <- cmpfun(knn_log)

distf <- function(x,t.m) sqrt(colSums((x - t.m)^2))
distf <- cmpfun(distf)

myfunc <- function (tab) {
  rowsums<-numeric(nrow(tab))
  knnsums_log <- matrix(nrow=nrow(tab),ncol=4)
  for(i in 1:nrow(tab)) {
    q<-apply(tab[i,],1,distf,t.m=t.m)
    rowsums[i] <- colSums(q)
    q<-sort(q)
    for (kn in 1:4) {
      knnsums_log[i,kn] <- knn_log(q,kn)             
    }
  }
  return(cbind(rowsums,knnsums_log))
}
myfunc <- cmpfun(myfunc)
system.time(output <- myfunc(t))

And my attempt using applys:

###############Vectorized
myfuncvec <- function (tab) {
  kn<-c(1:4)
  q<-apply(tab,1,distf,t.m=t.m)
  rowsums <- colSums(q)
  q<-sort(q)
  knnsums_log <- vapply(kn,knn_log,vec=q,FUN.VALUE=c(0))        
  return(c(rowsums,knnsums_log))
}
myfuncvec <- cmpfun(myfuncvec)

t1<-split(t,row(t))
system.time(out <- vapply(t1,myfuncvec,FUN.VALUE=c(0,0,0,0,0)))
out <- t(out)

For reference the first of the codes seems to be faster.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top