You don't need to do this when you use data.table
's. Instead, you should set a key
or use an ad-hoc by
(like I show in the example below). This is one of the key foundations of operations in data.table
.
Toy example using by
:
Look at the toy example below. We sum the rating by the id
and grp
variable. So where duplicates exist, they get summed, but unique combinations of the grouping variables will be treated by themselves (so note the value for rating
and sum_rating
for the last row which has a unique combination of grouping variables (the other rows have two rows each like in your example):
# Make this data reproducible
set.seed(1)
dt <- data.table( id = c( rep( 1:2 , 2 ) , 1 ) , grp = c( rep( 1:2 , 2 ) , 3 ) , rating = sample( 5 , 5 , TRUE ) )
# id grp rating
#1: 1 1 4
#2: 2 2 1
#3: 1 1 3
#4: 2 2 4
#5: 1 3 4
# Sum by 'id' and 'grp'...
dt[ , sum_rating := sum( rating ) , by = list( id , grp ) ]
dt
# id grp rating sum_rating
#1: 1 1 4 7
#2: 2 2 1 5
#3: 1 1 3 7
#4: 2 2 4 5
#5: 1 3 4 4 <===== rating and sum_rating are the same because this is a unique row