Вопрос

I have a data set with dozens of columns and thousands of rows. Here I present just a toy example:

trN <- c(0,0,0,0,1,1,1,1)
tt <- c(1,2,3,4,1,2,3,4)
varX <- c(1,5,NA,9,2,NA,8,4)
d <- as.data.frame(cbind(trN, tt, varX))

The first thing that I do is to spline interpolate column varX as a function of column tt for each trN. An operation that is easily done with ddply from the plyr package.

ddply(d, .(trN), mutate, varXint = spline(tt, varX, xout = tt)$y)

But suppose that I would like to also change the dimension (number of rows) of the new data frame. For instance, I would like to have a set of values specifying where interpolation is to take place (xout) that has a different length then tt. Obviously, the approach here below doesn't work, because with mutate the new column needs to have the same length as the columns of the original data frame:

ddply(d, .(trN), mutate, varXint = spline(tt, varX, xout = seq(1, 4, by = 1.5))$y)

Does anyone have a suitable solution or any kind of suggestion? I would prefer to have a solution based on the plyr package, because I can take advantage of the implemented parallelization.

Это было полезно?

Решение

Try a simple data.table first:

library(data.table)
dt = data.table(d)

# I added xout since I assumed you want that
dt[, list(varXint = spline(tt, varX, xout = seq(1, 4, by = .5))$y,
          xout = seq(1, 4, 0.5)),
     by = trN]
#    trN  varXint xout
# 1:   0 1.000000  1.0
# 2:   0 3.166667  1.5
# 3:   0 5.000000  2.0
# 4:   0 6.500000  2.5
# 5:   0 7.666667  3.0
# 6:   0 8.500000  3.5
# 7:   0 9.000000  4.0
# 8:   1 2.000000  1.0
# 9:   1 5.250000  1.5
#10:   1 7.333333  2.0
#11:   1 8.250000  2.5
#12:   1 8.000000  3.0
#13:   1 6.583333  3.5
#14:   1 4.000000  4.0

And if your bottleneck is indeed the inside computation vs just the grouping issue, then check out e.g. multicore and data.table in R or data.table and parallel computing

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top