Question

Suppose, I've a data.frame as follows:

set.seed(45)
DF <- data.frame(x=1:10, strata2013=sample(letters[1:3], 10, TRUE))

    x strata2013
1   1          b
2   2          a
3   3          a
4   4          b
5   5          b
6   6          a
7   7          a
8   8          b
9   9          a
10 10          a

And I'd like to get the counts for each unique value in the column strata2013, then, using data.table (for speed), one could do it in this manner:

DT <- as.data.table(DF)
DT[, .N, by=strata2013]
   strata2013 N
1:          b 4
2:          a 6

Now, I'd like to try and accomplish this in Rcpp, as a learning exercise. I've written and tried out the code shown below which is supposed to provide the same output, but instead it gives me an error. Here's the code:

#include <Rcpp.h>
using namespace Rcpp;  

// [[Rcpp::export]]
NumericVector LengthStrata (CharacterVector uniqueStrata, DataFrame dataset ) {
  int n = uniqueStrata.size();
  NumericVector Nh(n);
  Rcpp::CharacterVector strata=dataset["strate2013"];
  for (int i = 0; i < n; ++i) {
    Nh[i]=strata(uniqueStrata(i)).size();
  }
  return Nh;
}

Here is the error message:

conversion from 'Rcpp::Vector<16>::Proxy {aka Rcpp::internal::string_proxy<16>}' 
to 'const size_t { aka const long long unsigned int}' is ambiguous

What am I doing wrong? Thank you very much for your help.

Was it helpful?

Solution

If I understand correctly, you're hoping that strata( uniqueStrata(i) ) will subset the vector, similar to how R's subsetting operates. This is unfortunately not the case; you would have to perform the subsetting 'by hand'. Rcpp doesn't have 'generic' subsetting operates available yet.

When it comes to using Rcpp, you really want to leverage the C++ standard library where possible. The de-facto C++ way of generating these counts would be to use a std::map (or std::unordered_map, if you can assume C++11), with something like the following. I include a benchmark for interest.

Note from Dirk: unordered_map is actually available from tr1 for pre-C++11, so one can include it using e.g. #include <tr1/unordered_map>

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
  Rcpp::CharacterVector strata = dataset["strata2013"];
  int n = strata.size();
  std::map<SEXP, int> counts;
  for (int i = 0; i < n; ++i) {
    ++counts[ strata[i] ];
  }
  return wrap(counts);
}

/*** R
library(data.table)
library(microbenchmark)
set.seed(45)
DF <- data.frame(strata2013=sample(letters, 1E5, TRUE))
DT <- data.table(DF)
LengthStrata(DF)
DT[, .N, by=strata2013]
microbenchmark(
  LengthStrata(DF),
  DT[, .N, by=strata2013]
)
*/

gives me

Unit: milliseconds
                      expr      min       lq   median       uq       max neval
          LengthStrata(DF) 3.267131 3.831563 3.934992 4.101050 11.491939   100
 DT[, .N, by = strata2013] 1.980896 2.360590 2.480884 2.687771  3.052583   100

The Rcpp solution is slower in this case likely due to the time it takes to move R objects to and from the C++ containers, but hopefully this is instructive.

Aside: This is, in fact, already included in Rcpp as the sugar table function, so if you want to skip the learning experience, you can use a pre-baked solution as

#include <Rcpp.h>
using namespace Rcpp;  

// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
  Rcpp::CharacterVector strata = dataset["strata2013"];
  return table(strata);
}

Sugar improves the speed of the Rcpp function:

 Unit: milliseconds
                      expr      min       lq   median       uq       max neval
          LengthStrata(DF) 5.548094 5.870184 6.014002 6.448235  6.922062   100
 DT[, .N, by = strate2013] 6.526993 7.136290 7.462661 7.949543 81.233216   100

OTHER TIPS

I am not sure I understand what you are trying to do. And when strata is a vector

     Rcpp::CharacterVector strata=df["strate2013"];

then I am not sure what

     strata(uniqueStrata(i)).size()

is supposed to do. Maybe you could describe in words (or in R with some example code and data) what you are trying to do here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top