Since you say you have too many rows to do this, I'll suggest a data.table
solution:
require(data.table)
DT <- data.table(df) # where `df` is your data.frame
DT[, diff(value), by=type]$V1
# [1] 2 6 1
Simulating this code on a data of your dimensions:
It takes close to 20 seconds (bottleneck should be calls to diff
) on data of your dimensions.
require(data.table)
set.seed(45)
types <- sapply(1:5e5, function(x) paste0(sample(letters, 5, TRUE), collapse=""))
DT <- data.table(value=sample(100, 5e6, TRUE), type=sample(types, 5e6, TRUE))
system.time(t1 <- DT[, diff(value), by=type]$V1)
# user system elapsed
# 18.610 0.238 19.166
To compare against the other answer with tapply
:
system.time(t2 <- tapply(DT[["value"]], DT[["type"]], diff))
# user system elapsed
# 48.471 0.664 51.673
Also, tapply
orders the results by type
where as data.table
without key
will preserve the original order.
Edit: Following @eddi's comment:
> system.time(t3 <- DT[, value[-1]-value[-.N], by=type]$V1)
# user system elapsed
# 6.221 0.195 6.641
There's a 3x improvement by removing the call to diff
. Thanks @eddi.