Apply function on a subset of columns (.SDcols) whilst applying a different function on another column (within groups)

Question 1

Update: Issue #495 is solved now with this recent commit, we can now do this just fine:

require(data.table) # v1.9.7+
set.seed(1L)
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
                v1 = rnorm(100), 
                v2 = rnorm(100), 
                v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1),  lapply(.SD,mean)), by = grp, .SDcols = sd.cols]

However note that in this case, v2 would be returned as a list. That's because you're doing list(val, list()) effectively. What you intend to do perhaps is:

dt[, c(list(v1=sum(v1)), lapply(.SD, mean)), by=grp, .SDcols = sd.cols]
#    grp        v1          v2         v3
# 1:   a -6.440273  0.16993940  0.2173324
# 2:   b  4.304350 -0.02553813  0.3381612
# 3:   c  0.377974 -0.03828672 -0.2489067

See history for older answer.

Question 2

Try this:

dt[,list(sum(v1), mean(v2), mean(v3)), by=grp]

In data.table, using list() in the second argument allows you to describe a set of columns that result in the final data.table.

For what it's worth, .SD can be quite slow [^1] so you may want to avoid it unless you truly need all of the data supplied in the subsetted data.table like you might for a more sophisticated function.

Another option, if you have many columns for .SDcols would be to do the merge in one line using the data.table merge syntax.

For example:

dt[, sum(v1), by=grp][dt[,lapply(.SD,mean), by=grp, .SDcols=sd.cols]]

In order to use the merge from data.table, you need to first use setkey() on your data.table so it knows how to match things up.

So really, first you need:

setkey(dt, grp)

Then you can use the line above to produce an equivalent result.

[^1]: I find this to be especially true as your number of groups approach the number of total rows. For example, this might happen where your key is an individual ID and many individuals have just one or two observations.