Question

Is it possible to write a C++ function that gets an R dataFrame as input, then modifies the dataFrame (in our case taking a subset) and returns the new data frame (in this question, returning a sub-dataframe) ? My code below may make my question more clear:

code:

# Suppose I have the data frame below created in R:
myDF = data.frame(id = rep(c(1,2), each = 5), alph = letters[1:10], mess = rnorm(10))

# Suppose I want to write a C++ function that gets id as inout and returns 
# a sub-dataframe corresponding to that id (**If it's possible to return 
# DataFrame in C++**)

# Auxiliary function --> helps get a sub vector:
arma::vec myVecSubset(arma::vec vecMain, arma::vec IDVec, int ID){
  arma::uvec AuxVec = find(IDVec == ID);
  arma::vec rslt = arma::vec(AuxVec.size());
  for (int i = 0; i < AuxVec.size(); i++){
    rslt[i] = vecMain[AuxVec[i]];
  }
  return rslt;
}

# Here is my C++ function:
Rcpp::DataFrame myVecSubset(Rcpp::DataFrame myDF, int ID){
  arma::vec id = Rcpp::as<arma::vec>(myDF["id"]);
  arma::vec alph = Rcpp::as<arma::vec>(myDF["alpha"]);
  arma::vec mess = Rcpp::as<arma::vec>(myDF["mess"]);

  // here I take a sub-vector:
  arma::vec id_sub = myVecSubset(id, id, int ID);
  arma::vec alph_sub = myVecSubset(alph, id, int ID);
  arma::vec mess_sub = myVecSubset(mess, id, int ID);

  // here is the CHALLENGE: How to combine these vectors into a new data frame???
  ???
}

In summary, there are actually two main question: 1) Is there any better way to take the sub-dataframe above in C++? (wish I could simple say myDF[myDF$id == ID,]!!!)

2) Is there anyway that I can combine id_sub, alpha_sub, and mess_sub into an R data frame and return it?

I really appreciate your help.

Was it helpful?

Solution 2

You don't need Rcpp and RcppArmadillo for that, you can just use R's subset or perhaps dplyr::filter. This is likely to be more efficient than your code that has to deep copy data from the data frame into armadillo vectors, create new armadillo vectors, and then copy these back into R vectors so that you can build the data frame. This produces lots of waste. Another source of waste is that you find three times the same exact thing

Anyway, to answer your question, just use DataFrame::create.

DataFrame::create( _["id"] = id_sub, _["alpha"] = alph_dub, _["mess"] = mess_sub ) ;

Also, note that in your code, alpha will be a factor, so arma::vec alph = Rcpp::as<arma::vec>(myDF["alpha"]); is not likely to do what you want.

OTHER TIPS

To add on to Romain's answer, you can try calling the [ operator through Rcpp. If we understand how df[x, ] is evaluated (ie, it's really a call to "[.data.frame"(df, x, R_MissingArg) this is easy to do:

#include <Rcpp.h>
using namespace Rcpp;

Function subset("[.data.frame");

// [[Rcpp::export]]
DataFrame subset_test(DataFrame x, IntegerVector y) {
  return subset(x, y, R_MissingArg);
}

/*** R
df <- data.frame(x=1:3, y=letters[1:3])
subset_test(df, c(1L, 2L))
*/

gives me

> df <- data.frame(x=1:3, y=letters[1:3])
> subset_test(df, c(1L, 2L))
  x y
1 1 a
2 2 b

Callbacks to R can generally be slower in Rcpp, but depending on how much of a bottleneck this is it could still be fast enough for you.

Be careful though, as this function will use 1-based subsetting rather than 0-based subsetting for integer vectors.

Here is a complete test file. It does not need your extractor function and just re-assembles the subsets -- but for that it needs the very newest Rcpp as currently on GitHub where Kevin happens to have added some work on subset indexing which is just what we need here:

#include <Rcpp.h>

/*** R
##  Suppose I have the data frame below created in R:
##  NB: stringsAsFactors set to FALSE
##  NB: setting seed as well
set.seed(42)
myDF <- data.frame(id = rep(c(1,2), each = 5), 
                   alph = letters[1:10], 
                   mess = rnorm(10), 
                   stringsAsFactor=FALSE)
*/

// [[Rcpp::export]]
Rcpp::DataFrame extract(Rcpp::DataFrame D, Rcpp::IntegerVector idx) {

  Rcpp::IntegerVector     id = D["id"];
  Rcpp::CharacterVector alph = D["alph"];
  Rcpp::NumericVector   mess = D["mess"];

  return Rcpp::DataFrame::create(Rcpp::Named("id")    = id[idx],
                                 Rcpp::Named("alpha") = alph[idx],
                                 Rcpp::Named("mess")  = mess[idx]);
}

/*** R
extract(myDF, c(2,4,6,8))
*/

With that file, we get the expected result:

R> library(Rcpp)
R> sourceCpp("/tmp/sepher.cpp")

R> ##  Suppose I have the data frame below created in R:
R> ##  NB: stringsAsFactors set to FALSE
R> ##  NB: setting seed as well
R> set.seed(42)

R> myDF <- data.frame(id = rep(c(1,2), each = 5), 
+                    alph = letters[1:10], 
+                    mess = rnorm(10), 
+               .... [TRUNCATED] 

R> extract(myDF, c(2,4,6,8))
  id alpha     mess
1  1     c 0.363128
2  1     e 0.404268
3  2     g 1.511522
4  2     i 2.018424
R>
R> packageDescription("Rcpp")$Version   ## unreleased version
[1] "0.11.1.1"
R> 

I just needed something similar a few weeks ago (but not involving character vectors) and used Armadillo with its elem() functions using an unsigned int vector as index.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top