data.table converts factors in `by` to underlying integer?

Question

There are three questions in your post. I'll answer them in order.

Referring to a factor column in j not retaining the class is fixed in 1.9.3 (bug #5437 IIRC). It was a tiny regression due to various enhancements in 1.9.0 (and some changes for R3.1.0 IIRC). Now tests are added to catch this as well.

require(data.table) ## 1.9.3
dt <- data.table(a=factor(rep(c("a", "b"), 5)), b=1:10)
str(dt[, list(mean(b), a), by=a])

# Classes ‘data.table’ and 'data.frame':    2 obs. of  3 variables:
#  $ a : Factor w/ 2 levels "a","b": 1 2
#  $ V1: num  5 6
#  $ a : Factor w/ 2 levels "a","b": 1 2
#  - attr(*, ".internal.selfref")=<externalptr>

The issue with .BY is also fixed in 1.9.3:

dt[, print(.BY), by=a]
# $a
# [1] a
# Levels: a b

# $a
# [1] b
# Levels: a b

# Empty data.table (0 rows) of 1 col: a

dt[, list(mean(b), a[[2]]), by=a]
# Error in `[[.default`(a, 2) : subscript out of bounds

This is because variables/columns in by are by default available as length=1 vectors. After all, it's the variable you're grouping by.

However, I've raised potential issues with this feature with @Matt and @eddi. You can find a brief discussion between me and @eddi here under comments. I've also written Matt about this and is currently under discussion. This will be soon resolved and documented, whatever the resolution is.

My stance as of now is that columns in by should not mask that of dt. This started from bug #5191, which is basically this:

DT <- data.table(x=1:5, y=6:10)
DT[, sum(x), by=x%%3L]
#    x V1
# 1: 1  1
# 2: 2  2
# 3: 0  0

Where the actual results should've been:

DT <- data.table(x=1:5, y=6:10)
DT[, sum(x), by=list(grp=x%%3L)]
#    grp V1
# 1:   1  5
# 2:   2  7
# 3:   0  3

The results weren't right because by column x masks the column x in DT corresponding to each group. In this case, this happened because we allow for expressions in by.

However, it extends even if it weren't the scenario. Consider the case:

> DT[, sum(y), by=list(y=x)]
#    y V1
# 1: 1  1
# 2: 2  2
# 3: 3  3
# 4: 4  4
# 5: 5  5

What happened here is that naming the by column y resulted in y from DT being masked.

IMHO, what should be done is that by shouldn't be masking the columns from DT that'll be used in j at all. Instead, if one needs to refer to the grouping variable, they should use the already existing .BY variable (or simply subset the first index with [1L]) as follows:

> DT[, print(.BY$x), by=x]
# [1] 1
# [1] 2
# [1] 3
# [1] 4
# [1] 5
# Empty data.table (0 rows) of 1 col: x

This is just my opinion and there might be other arguments for retaining the current feature and just fixing these potential buggy cases. We'll have to discuss it and get this resolved and depending on what we conclude, document it accordingly.

I'll update this post once that's been done :).