Generating Indicators in large data frames

https://stackoverflow.com/questions/9878356

26-05-2021
|

Question

The objective is to create indicators for a factor/string variable in a data frame. That dataframe has > 2mm rows, and running R on windows, I don't have the option of using plyr with .parallel=T. So I'm taking the "divide and conquer" route with plyr and reshape2.

Running melt and cast runs out of memory, and using

ddply( idata.frame(items) , c("ID") , function(x){
       (    colSums( model.matrix( ~ x$element - 1) ) > 0   )
} , .progress="text" )

ddply( idata.frame(items) , c("ID") , function(x){
           (    elements %in% x$element   )
    } , .progress="text" )

does take a while. The fastest approach is the call to tapply below. Do you see a way to speed this up? The %in% statement runs faster than the model.matrix call. Thanks.

set.seed(123)

dd <- data.frame(
  id  = sample( 1:5, size=10 , replace=T ) ,
  prd = letters[sample( 1:5, size=10 , replace=T )]
  )

prds <- unique(dd$prd)

tapply( dd$prd , dd$id , function(x) prds %in% x )

Solution

For this problem, the packages bigmemory and bigtabulate might be your friends. Here is a slightly more ambitious example:

library(bigmemory)
library(bigtabulate)

set.seed(123)

dd <- data.frame(
  id = sample( 1:15, size=2e6 , replace=T ), 
  prd = letters[sample( 1:15, size=2e6 , replace=T )]
  )

prds <- unique(dd$prd)

benchmark(
bigtable(dd,c(1,2))>0,
table(dd[,1],dd[,2])>0,
xtabs(~id+prd,data=dd)>0,
tapply( dd$prd , dd$id , function(x) prds %in% x )
)

And the results of benchmarking (I'm learning new things all the time):

                                            test replications elapsed relative user.self sys.self user.child sys.child
1                      bigtable(dd, c(1, 2)) > 0          100  54.401 1.000000    51.759    3.817          0         0
2                    table(dd[, 1], dd[, 2]) > 0          100 112.361 2.065422   107.526    6.614          0         0
4 tapply(dd$prd, dd$id, function(x) prds %in% x)          100 178.308 3.277660   166.544   13.275          0         0
3                xtabs(~id + prd, data = dd) > 0          100 229.435 4.217478   217.014   16.660          0         0

And that shows bigtable winning out by a considerable amount. The results are pretty much that all prds are in all IDs, but see ?bigtable for details on the format of the results.

OTHER TIPS

Can you say a little bit more how the problem would scale in terms of numbers of levels, numbers of IDs, etc. (if you keep the numbers of levels fixed, then for enough individuals the indicator matrix you're computing will approach all TRUE/all 1 ...)? I expected that xtabs would be faster, but it's not for an example of this size ...

library(rbenchmark)
benchmark(
          tapply( dd$prd , dd$id , function(x) prds %in% x ),
          xtabs(~id+prd,data=dd)>0)

     test        replications elapsed relative 
1 tapply(...)             100   0.053 1.000000
2 xtabs(...) > 0          100   0.120 2.264151

Your use of the %in% function seems backwards to me. And if you want an true/false result for each row of data then you should use either %in% as a vector operation or ave. Although it's not needed here you might want to use it if there were a more complex function that needed to be applied to every item.

set.seed(123)

dd <- data.frame(
  id  = sample( 1:5, size=10 , replace=T ) ,
  prd = letters[sample( 1:5, size=10 , replace=T )]
  )

prds <- unique(dd$prd)
target.prds <- prds[1:2]
dd$prd.in.trgt <- with( dd, prd %in% target.prds)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow