Question

Is there an implementation of colMeans in R that includes an alpha trimmed mean parameter?

If not, how could I make one?


The original question has been answered in the comments below.

apply(x, 2, mean, trim=.05) is not as optimized as colMeans;

what is an implementation of equal efficiency?

Was it helpful?

Solution

Here are some examples of different ways to calculate trimmed colMeans, with a comparison of their performance.

m <- matrix(runif(1000000), nc=1000)
trim <- 0.1

Using apply:

out1 <- apply(m, 2, mean, trim=trim)

Using sapply:

out2 <- sapply(seq_len(ncol(m)), function(i) mean(m[, i], trim=trim))

Using Rcpp:

library(inline)
library(RcppArmadillo)

f <- 'using namespace arma;
      mat x = sort(as<mat>(x_));
      double trim = as<double>(trim_); 
      int low;
      if(x.n_rows % 2 == 1) {
        low = ceil(trim * x.n_rows) - 1;
      } else {
        low = ceil(trim * x.n_rows);
      }
      int high = ceil((1 - trim) * x.n_rows) - 1;
      return(wrap(mean(x.rows(low, high))));'

trim.colMeans <- cxxfunction(signature(x_='matrix', trim_='numeric'), 
                             f, plugin="RcppArmadillo")

out3 <- trim.colMeans(m, trim)

Comparison

identical(out1, out2)
[1] TRUE
identical(out1, c(out3))
[1] TRUE

library(microbenchmark)

microbenchmark(apply=apply(m, 2, mean, trim=trim),
               sapply=sapply(seq_len(ncol(m)), function(i) mean(m[, i], trim=trim)),
               Rcpp=trim.colMeans(m, trim),
               colMeans=colMeans(m))


Unit: microseconds
     expr       min          lq     median          uq        max neval
    apply 68907.162 100439.4775 102555.396 109044.4025 136034.067   100
   sapply 64675.928  66383.6010  66937.615  68152.1115  98680.906   100
     Rcpp 43614.629  44297.6980  44761.360  45164.4850  46883.602   100
 colMeans   782.458    805.7995    828.538    988.4625   1452.877   100

I'm sure my Rcpp implementation is sub-optimal, so feel free to chime in with improvements. As you can see, none of these methods is as efficient as an untrimmed calculation of colMeans, yet I suspect equivalent efficiency is impossible, since additional calculations must be made, including sorting and subsetting of the matrix. This penalty for trimming data is evident when benchmarking the mean of a vector vs. the trimmed counterpart:

v <- runif(1000)
microbenchmark(mean(v), mean(v, trim=0.1))

Unit: microseconds
                expr    min     lq median     uq     max neval
             mean(v)  5.722  6.325  6.927  7.229 124.989   100
 mean(v, trim = 0.1) 42.165 43.671 44.574 44.876  84.630   100
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top