big ddply, appropriate alternative

https://stackoverflow.com/questions/14563674

05-03-2022
|

Вопрос

I have a list of data.frames. Each data.frame is not very big ~150,000 rows. But my list has over 1000s of these data.frames.

a data.frame looks like:

comp <- read.table(text = " G T H S B
                             1 1 1 1 x1
                             1 1 1 2 x2
                             1 2 6 1 x3
                             1 2 6 2 x4
                             2 1 7 1 x1
                             2 2 8 2 x2
                             2 2 8 1 x1
                             2 3 9 2 x2",header=TRUE,stringsAsFactors=FALSE)

so a list is:

complist <- list(comp,comp,comp)

I want to know for every data.frame, (comp), the length of B for each S in each H in each T in each G.

so for my small practice I use:

library(plyr)
listresults <- lapply(complist, function(x) {
                                res <- ddply(x, .(G,T,H,S),
                                function(z) data.frame(resultcol = length(z$B)) )
                                            } )

But on my larger list this is bruuutally long, could someone help me find a quicker way? Aggregate is not the option here, and I have been failing with a sapply(split)) alternative to the ddply. Suggestions even if not actual code will be just as helpful to me.

Решение

This is a situation where data.table might be a very good option. data.table has consistently shown to be blisteringly fast, much more so that plyr. There are many examples here on SO, see e.g.:

This is just a very small portion of the information available, you can check out the documentation of data.table, or look at the [r][data.table] tags on SO.

Другие советы

Using data.table may get this faster. Here's how you can do it.

require(data.table)
o <- lapply(1:length(complist), function(i) { 
    dt <- data.table(complist[[i]], key=c("S", "H", "T", "G"))
    dt[, list(resultcol = .N), by=c("S", "H", "T", "G")]
})

One try with using data.table :

lapply(complist, function(df) {
  df <- data.table(df, key=c("G","T","H","S"))
  df[,length(B),by=c("G","T","H","S")]
})

Disclaimer : it's the first time I'm using data.table, so be careful with this answer :)

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow