Question

I often run into situations where I need to check if some condition holds for any or all elements of a very large vector or list. For example to check if a list contains any/only NULL elements I would use:

any(vapply(x, is.null, logical(1))
all(vapply(x, is.null, logical(1))

However this is inefficient because it always checks each element in the list. A smarter implementation would stop checking when the first NULL or non NULL element was found. I.e. the equivalent of:

is.null(x[[1]]) || is.null(x[[2]]) || is.null(x[[3]]) || ...
is.null(x[[1]]) && is.null(x[[2]]) && is.null(x[[3]]) && ...

Doing this with a for loop is slow. There are some special cases provided by r-base, for example anyNA is an efficient version of any(is.na(.)) that does exactly this. But I was wondering if we could implement this more generally and provide an optimized functions for checking a condition:

all_fast(x, is.null)
any_fast(x, is.null)

But also:

all_fast(x, function(z) {length(z) == 2})
all_fast(x, is, "POSIXt")
Was it helpful?

Solution

Here's the naive way,

all0 <- function(x, FUN)
    all(vapply(x, FUN, logical(1)))

and an R loop...

all1 <- function(x, FUN) {
    for (xi in x)
        if (!FUN(xi))
            return(FALSE)
    TRUE
}

...which can be compiled

library(compiler)
all1c <- cmpfun(all1)

...or written in C

library(inline)
allc <- cfunction(signature(x="list", fun="function"), "
    SEXP call = PROTECT(lang2(fun, R_NilValue));
    int len = Rf_length(x);
    for (int i = 0; i < len; ++i) {
        SETCADR(call, VECTOR_ELT(x, i));
        if (!LOGICAL(eval(call, R_GlobalEnv))[0]) {
            UNPROTECT(1);
            return Rf_ScalarLogical(FALSE);
        }
    }
    UNPROTECT(1);
    return Rf_ScalarLogical(TRUE);")

We need to measure the performance, so

library(microbenchmark)

The worst case would seem to be that the condition passes

n <- 100000
x0 <- x <- vector("list", n)
microbenchmark(all0(x, is.null), all1(x, is.null), all1c(x, is.null),
               allc(x, is.null))
## Unit: milliseconds
##               expr      min       lq   median       uq      max neval
##   all0(x, is.null) 47.48038 50.58960 52.34946 54.10116 61.94736   100
##   all1(x, is.null) 41.52370 44.40024 45.25135 46.68218 53.22317   100
##  all1c(x, is.null) 33.76666 35.03008 35.71738 36.41944 45.37174   100
##   allc(x, is.null) 13.95340 14.43153 14.78244 15.94688 19.41072   100

so we're only 2x as fast in C compared to the compiled R version -- there is a function call on each test, so we're only saving on the looping per se. The best case is when we exit immediately and clearly showing the advantage of the loop, but then neither compilation nor C code helps us

x[[1]] <- FALSE
microbenchmark(all0(x, is.null), all1(x, is.null), all1c(x, is.null),
               allc(x, is.null))
## Unit: microseconds
##               expr       min         lq     median        uq       max neval
##   all0(x, is.null) 45376.760 45772.5020 46108.5795 46655.005 54242.687   100
##   all1(x, is.null)     1.566     1.9550     2.6335    12.015    14.177   100
##  all1c(x, is.null)     1.367     1.7340     2.0345     9.359    17.438   100
##   allc(x, is.null)     1.229     1.6925     4.6955    11.628    23.378   100

Here's an intermediate case, which does not really contain any surprises -- the C loop is about 2x faster than the compiled R loop, so gets there about 2x as quickly.

x <- x0
x[[length(x)/2]] <- FALSE
microbenchmark(all0(x, is.null), all1(x, is.null), all1c(x, is.null),
               allc(x, is.null))
## Unit: milliseconds
##               expr      min       lq    median        uq       max neval
##   all0(x, is.null) 46.85690 49.92969 51.045519 52.653137 59.445611   100
##   all1(x, is.null) 20.90066 21.92357 22.582636 23.077863 25.974395   100
##  all1c(x, is.null) 16.51897 17.44539 17.825551 18.119202 20.535709   100
##   allc(x, is.null)  6.98468  7.18392  7.312575  8.290859  9.460558   100

Explicitly testing for NULL at the C level (VECTOR_ELT(x, i) == R_NilValue) is very fast, so C code that compares the value to NULL is about 100x faster than the corresponding R code. It would seem that allNULL might be a worth-while generalization if speed were of the essence, but the case for a general-purpose C-level all doesn't seem so compelling. And of course the C code doesn't deal with NA or error conditions.

OTHER TIPS

Jeroen rightly says that

However this is inefficient because it always checks each element in the list. A smarter implementation would stop checking when the first NULL or non NULL element was found.

and the Rcpp sugar versions have been doing that for a few years. I have a benchmark comparison somewhere.

Edit: Found it, it is a really old example that predates our use of the rbenchmark or microbenchmark packages, and it is still in the Rcpp package in the examples/SugarPerformance directory. When I run it now, the relevant line is (and edited to fit line here)

  runs              expr hand.written       sugar        R hnd/sugar    R/sugar
1 5000    any(x * y < 0)  0.000128746 0.000232458  7.52280  0.553846 32361.9631

We used this in a lot of early talks as the "gain" seems so impressive. But even a single R run is just 0.15 milliseconds so unless you really it repeatedly it is not worth the gain.

And as Martin shows in his answer, just byte-compiling (which wasn't available yet either when we set the example up in early 2010) is helpful too.

The 'any' version:

res <- FALSE
for ( i in seq_along(x) ) { if( is.null(x[i]) ) { res <-TRUE; break()} 
res

lapply and vapply are just for-loops internally, so you are only loosing the syntactic compression that they offer, but you are gaining the ability to break out of the loop on the first occurrence of a defining condition. You could use res <- TRUE and set to FALSE for the 'all' version.

FWIW, although this is less flexible, it is much faster to avoid R's evaluation mechanism when possible. I provide a simple Rcpp solution comparing to Martin's answer, but specifically for the 'all NULL' case.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
SEXP all_fast(SEXP x, SEXP fun) {
    SEXP call = PROTECT(Rf_lang2(fun, R_NilValue));
    int len = Rf_length(x);
    for (int i = 0; i < len; ++i) {
        SETCADR(call, VECTOR_ELT(x, i));
        if (!LOGICAL(Rf_eval(call, R_GlobalEnv))[0]) {
            UNPROTECT(1);
            return Rf_ScalarLogical(FALSE);
        }
    }
    UNPROTECT(1);
    return Rf_ScalarLogical(TRUE);
}

// [[Rcpp::export]]
bool all_null(List x) {
  int n = x.size();
  for (R_len_t i=0; i < n; ++i) {
    if (x[i] != R_NilValue) return false;
  }
  return true;
}

/*** R
n <- 100000
x0 <- x <- vector("list", n)
all_fast(x, is.null)
all_null(x)
library(microbenchmark)
microbenchmark(
  all_fast(x, is.null),
  all_null(x)
)
*/

gives me

> Rcpp::sourceCpp('~/Desktop/all_fast.cpp')

> n <- 100000

> x0 <- x <- vector("list", n)

> all_fast(x, is.null)
[1] TRUE

> all_null(x)
[1] TRUE

> library(microbenchmark)

> microbenchmark(
+   all_fast(x, is.null),
+   all_null(x)
+ )
Unit: microseconds
                 expr      min        lq   median        uq      max neval
 all_fast(x, is.null) 6703.948 6962.7355 7051.680 7231.1805 13100.41   100
          all_null(x)  280.816  283.8025  292.531  303.3125   340.19   100

If you have a set of functions that are invoked very commonly, it could be worth the effort to write your own simple Rcpp wrappers. You lose flexibility, but you do gain a substantial amount of speed.

Whether the saved microseconds add up enough to be worth it depends on your use case / data size, though.

Although I do think Martin's C answer is the best answer here I do think it's worth being aware that specific implementations for some common cases can be worth it.

A package that implemented these concepts would be nice: the 'generic' version Martin provides, plus 'tuned' versions for common cases. E.g.: all_null, all_na, all_inherits, all_odd, ...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top