More Efficient Way To Do A Conditional Running Total In R

Question 1

All the solutions posted so far compute the cumulative sum of the entire Y variable, which can be inefficient in cases where the data frame is really large but the index is near the beginning. In this case, a solution with Rcpp could be more efficient:

library(Rcpp)
get_min_cum2 = cppFunction("
int gmc2(NumericVector X, NumericVector Y, int start, int total) {
    double running = 0.0;
    for (int idx=0; idx < Y.size(); ++idx) {
        if (X[idx] >= start) {
            running += Y[idx];
            if (running >= total) {
                return X[idx];
            }
        }
    }
    return -1;  // Running total never exceeds limit
}")

Comparison with microbenchmark:

get_min_cum <- 
 function(start,total) 
   with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])
get_min_dt <- function(start, total)
   dt[X >= start, X[cumsum(Y) >= total][1]]

set.seed(144)
dat = data.frame(X=1:1000000, Y=abs(rnorm(1000000)))
dt = data.table(dat)
get_min_cum(3, 17)
# [1] 29
get_min_dt(3, 17)
# [1] 29
get_min_cum2(dat$X, dat$Y, 3, 17)
# [1] 29

library(microbenchmark)
microbenchmark(get_min_cum(3, 17), get_min_dt(3, 17),
               get_min_cum2(dat$X, dat$Y, 3, 17))
# Unit: milliseconds
#                               expr        min         lq    median         uq      max neval
#                 get_min_cum(3, 17) 125.324976 170.052885 180.72279 193.986953 418.9554   100
#                  get_min_dt(3, 17) 100.990098 149.593250 162.24523 176.661079 399.7531   100
#  get_min_cum2(dat$X, dat$Y, 3, 17)   1.157059   1.646184   2.30323   4.628371 256.2487   100

In this case, it's about 100x faster to use the Rcpp solution than other approaches.

Question 2

Try this for example, I am using cumsum and vectorized logical subsetting:

 get_min_cum <- 
 function(start,total) 
   with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])

 get_min_cum(3,17) 
 5

Question 3

Here you go (using data.table because of ease of syntax):

library(data.table)
dt = data.table(df)

dt[X >= 3, X[cumsum(Y) >= 17][1]]
#[1] 5

Question 4

Well, here's one way:

i <- 3
j <- 17
min(df[i:nrow(df),]$X[cumsum(df$Y[i:nrow(df)])>j])
# [1] 5

This takes df$X for rows i:nrow(df) and indexes that based on cumsum(df$Y) > j, starting also at row i. This returns all df$X for which the cumsum > j. min(...) then returns the smallest value.

Question 5

with(df, which( cumsum( (x>=3)*y) >= 17)[1] )