Regarding your data.table
solution, you don't need to set key for aggregation operations. You can directly do:
indexVars = paste0('f', 1:4, sep = '')
dtDup <- as.data.table(dfDup) ## faster than data.table(.)
dtDupAgg = dtDup[, list(data = sum(data)), by = c(indexVars)]
data.table
version 1.9.2+ also implements a function setDT
that enables conversion of data.frames
to data.tables
by reference (which means, there is no copy and therefore takes almost no time in the conversion, especially useful on large data.frames).
So, instead of doing:
dtDup <- as.data.table(dfDup)
dtDup[...]
You could do:
## data.table v1.9.2+
setDT(dfDup) ## faster than as.data.table(.)
dfDup[...] ## dfDup is now a data.table, converted by reference
On your first question, plyr
is not known for its speed. Check Why is plyr so slow? (and the many informative comments there) for more info.
Perhaps you maybe interested in dplyr
, which is orders of magnitude faster than plyr
, but still slower than data.table
, IMHO. Here's the equivalent dplyr
version:
dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data))
Here's a benchmark between data.table
and dplyr
on the data (all timings are minimum of three consecutive runs):
## data.table v1.9.2+
system.time(ans1 <- dtDup[, list(data=sum(data)), by=c(indexVars)])
# user system elapsed
# 0.049 0.009 0.057
## dplyr (commit ~1360 from github)
system.time(ans2 <- dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data)))
# user system elapsed
# 0.374 0.013 0.389
I really don't have the patience to run the plyr
version (stopped after 93 seconds of first run). As you can see dplyr
is much faster than plyr
, but ~7x times slower than data.table
here.
Check if the results are equal to be sure:
all.equal(as.data.frame(ans1[order(f1,f2,f3,f4)]),
as.data.frame(ans2))
# [1] TRUE
HTH