All the solutions posted so far compute the cumulative sum of the entire Y
variable, which can be inefficient in cases where the data frame is really large but the index is near the beginning. In this case, a solution with Rcpp could be more efficient:
library(Rcpp)
get_min_cum2 = cppFunction("
int gmc2(NumericVector X, NumericVector Y, int start, int total) {
double running = 0.0;
for (int idx=0; idx < Y.size(); ++idx) {
if (X[idx] >= start) {
running += Y[idx];
if (running >= total) {
return X[idx];
}
}
}
return -1; // Running total never exceeds limit
}")
Comparison with microbenchmark:
get_min_cum <-
function(start,total)
with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])
get_min_dt <- function(start, total)
dt[X >= start, X[cumsum(Y) >= total][1]]
set.seed(144)
dat = data.frame(X=1:1000000, Y=abs(rnorm(1000000)))
dt = data.table(dat)
get_min_cum(3, 17)
# [1] 29
get_min_dt(3, 17)
# [1] 29
get_min_cum2(dat$X, dat$Y, 3, 17)
# [1] 29
library(microbenchmark)
microbenchmark(get_min_cum(3, 17), get_min_dt(3, 17),
get_min_cum2(dat$X, dat$Y, 3, 17))
# Unit: milliseconds
# expr min lq median uq max neval
# get_min_cum(3, 17) 125.324976 170.052885 180.72279 193.986953 418.9554 100
# get_min_dt(3, 17) 100.990098 149.593250 162.24523 176.661079 399.7531 100
# get_min_cum2(dat$X, dat$Y, 3, 17) 1.157059 1.646184 2.30323 4.628371 256.2487 100
In this case, it's about 100x faster to use the Rcpp solution than other approaches.