R-Operating on subset of columns from dataframe with ddply

https://stackoverflow.com/questions/17130275

31-05-2022
|

Question

I have a large-ish dataframe (40000 observations of 800 variables) and wish to operate on a range of columns of every observation with something akin to dot product. This is how I implemented it:

matrixattempt <- as.matrix(dframe)
takerow <- function(k) {as.vector(matrixattempt[k,])}
takedot0 <- function(k) {sqrt(sum(data0averrow * takerow(k)[2:785]))}

for (k in 1:40000){
print(k)
dframe$dot0aver[k]<-takedot0(k)
}

The print is just to keep track of what's going on. data0averrow is a numeric vector, same size as takerow(k)[2:785], that has been pre-defined.

This is running, and from a few tests running correctly, but it is very slow.

I searched for dot product for a subset of columns, and found this question, but could not figure out how to apply it to my setup. ddply sounds like it should work faster (although I do not want to do splitting and would have to use the same define-id trick that the referenced questioner did). Any insight/hints?

Solution

Try this:

sqrt(colSums(t(matrixattempt[, 2:785])  * data0averrow))

or equivalently:

sqrt(matrixattempt[, 2:785] %*% data0averrow)

OTHER TIPS

Use matrix multiplication and rowSums on the result:

dframe$dot0aver <- NA
dframe$dot0aver[2:785] <- sqrt( rowSums( 
                              matrixattempt[2:785,] %*% data0averrow ))

It's the sqrt of the dot-product of data0aver with each row in the range

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow