Aggregating duplicate rows by taking sum

Question

Regarding your data.table solution, you don't need to set key for aggregation operations. You can directly do:

indexVars = paste0('f', 1:4, sep = '')
dtDup <- as.data.table(dfDup) ## faster than data.table(.)
dtDupAgg = dtDup[, list(data = sum(data)), by = c(indexVars)]

data.table version 1.9.2+ also implements a function setDT that enables conversion of data.frames to data.tables by reference (which means, there is no copy and therefore takes almost no time in the conversion, especially useful on large data.frames).

So, instead of doing:

dtDup <- as.data.table(dfDup)
dtDup[...]

You could do:

## data.table v1.9.2+
setDT(dfDup) ## faster than as.data.table(.)
dfDup[...]   ## dfDup is now a data.table, converted by reference

On your first question, plyr is not known for its speed. Check Why is plyr so slow? (and the many informative comments there) for more info.

Perhaps you maybe interested in dplyr, which is orders of magnitude faster than plyr, but still slower than data.table, IMHO. Here's the equivalent dplyr version:

dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data))

Here's a benchmark between data.table and dplyr on the data (all timings are minimum of three consecutive runs):

## data.table v1.9.2+
system.time(ans1 <- dtDup[, list(data=sum(data)), by=c(indexVars)])
#  user  system elapsed 
# 0.049   0.009   0.057 

## dplyr (commit ~1360 from github)
system.time(ans2 <- dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data)))
#  user  system elapsed 
# 0.374   0.013   0.389

I really don't have the patience to run the plyr version (stopped after 93 seconds of first run). As you can see dplyr is much faster than plyr, but ~7x times slower than data.table here.

Check if the results are equal to be sure:

all.equal(as.data.frame(ans1[order(f1,f2,f3,f4)]), 
          as.data.frame(ans2))
# [1] TRUE

HTH

Aggregating duplicate rows by taking sum

Solution 1:

Solution 2: