### DISABLE ADBLOCK

ADBlock is blocking some content on the site # Loop in C to make RScript more efficient performance

### Question

I am trying to compute the number of pairwise differences between each row in a table of 100 rows x 2500 Columns.

I have a small RScript that does this but the run time is (obviously) extremely high! I am trying to write a loop in C but I keep getting errors (compileCode).

Do you have any idea of how I can "convert" the following loop to C?

``````pw.dist <- function (vec1, vec2) {

return( length(which(vec1!=vec2)) )

}

N.row <- dim(table)
pw.dist.table <- array( dim = c(dim(table), dim(table)))

for (i in 1:N.row) {
for (j in 1:N.row) {
pw.dist.table[i,j] <- pw.dist(table[i,-c(1)], table[j,-c(1)])
}
}
``````

I am trying something like:

``````sig <- signature(N.row="integer", table="integer", pw.dist.table="integer")
code <- "
for( int i = 0; i < (*N.row) - 1; i++ ) {
for( int j = i + 1; j < *N.row; j++ ) {
int pw.dist.table = table[j] - table[i];
}
}
"
f <- cfunction( sig, code, convention=".C" )
``````

I am a complete newbie when it comes to programming!

Thanks in advance. JMFA

### Solution

Before trying to optimize the code, it is always a good idea to check where the time is spent.

``````Rprof()
... # Your loops
Rprof(NULL)
summaryRprof()
``````

In your case, the loop is not slow, but your distance function is.

``````\$by.total
total.time total.pct self.time self.pct
"pw.dist"                37.98     98.85      0.54     1.41
"which"                  37.44     97.45     34.02    88.55
"!="                      3.12      8.12      3.12     8.12
``````

You can rewite it as follows (it takes 1 second).

``````# Sample data
n <- 100
k <- 2500
d <- matrix(sample(1:10, n*k, replace=TRUE), nr=n, nc=k)
# Function to compute the number of differences
f <- function(i,j) sum(d[i,]!=d[j,])
# You could use a loop, instead of outer,
# it should not make a big difference.
d2 <- outer( 1:n, 1:n, Vectorize(f) )
``````

### OTHER TIPS

Vincent above has the right idea. In addition, you can take advantage of how matrices work in R and get even faster results:

``````n <- 100
k <- 2500
d <- matrix(sample(1:10, n*k, replace=TRUE), nr=n, nc=k)
system.time(d2 <- outer( 1:n, 1:n, Vectorize(f) ))
#precompute transpose of matrix - you can just replace
#dt with t(d) if you want to avoid this
system.time(dt <- t(d))
system.time(sapply(1:n, function(i) colSums( dt[,i] != dt)))
``````

Output:

``````#> system.time(d2 <- outer( 1:n, 1:n, Vectorize(f) ))
#   user  system elapsed
#    0.4     0.0     0.4
#> system.time(dt <- t(d))
#   user  system elapsed
#      0       0       0
#> system.time(sapply(1:n, function(i) colSums( dt[,i] != dt)))
#   user  system elapsed
#   0.08    0.00    0.08
``````

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow