Question

Consider:

dt <- data.table(a=factor(rep(c("a", "b"), 5)), b=1:10)
dt[, list(mean(b), a), by=a]

Produces:

   a V1 a
1: a  5 1
2: b  6 2

Also:

Classes 'data.table' and 'data.frame':  2 obs. of  3 variables:
 $ a : Factor w/ 2 levels "a","b": 1 2
 $ V1: num  5 6
 $ a : int  1 2
 - attr(*, ".internal.selfref")=<externalptr> 

Note the last column. The actual by column itself is fine, the problem arises when you try to re-use the by column explicitly in j. I also believe the .BY variable has the same issue. This is in 1.9.2 with R 3.0.2 and Rstudio on Win 7 (though observed on Mac OS 10.8 as well). This used to work on earlier versions (not sure which, from memory so could be wrong).

Posting here first in case I'm doing something stupid.

Also, it seems that the ungrouped by variable is no longer available. For example:

dt[, list(mean(b), a[[2]]), by=a]

produces an out of bounds error, though perhaps that was always the case. I would have expected a in j to be fully evaluated in dt, so a[[2]] should work (in my head anyway, perhaps it never did and was never intended to).

Session info:

R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] graphics  grDevices utils     datasets  stats     methods   base     

other attached packages:
[1] data.table_1.9.2

loaded via a namespace (and not attached):
[1] Rcpp_0.11.1    functional_0.4 plyr_1.8.1     reshape2_1.2.2
[5] stringr_0.6.2  tools_3.0.2   
Was it helpful?

Solution

There are three questions in your post. I'll answer them in order.

Referring to a factor column in j not retaining the class is fixed in 1.9.3 (bug #5437 IIRC). It was a tiny regression due to various enhancements in 1.9.0 (and some changes for R3.1.0 IIRC). Now tests are added to catch this as well.

require(data.table) ## 1.9.3
dt <- data.table(a=factor(rep(c("a", "b"), 5)), b=1:10)
str(dt[, list(mean(b), a), by=a])

# Classes ‘data.table’ and 'data.frame':    2 obs. of  3 variables:
#  $ a : Factor w/ 2 levels "a","b": 1 2
#  $ V1: num  5 6
#  $ a : Factor w/ 2 levels "a","b": 1 2
#  - attr(*, ".internal.selfref")=<externalptr> 

The issue with .BY is also fixed in 1.9.3:

dt[, print(.BY), by=a]
# $a
# [1] a
# Levels: a b

# $a
# [1] b
# Levels: a b

# Empty data.table (0 rows) of 1 col: a

dt[, list(mean(b), a[[2]]), by=a]
# Error in `[[.default`(a, 2) : subscript out of bounds

This is because variables/columns in by are by default available as length=1 vectors. After all, it's the variable you're grouping by.

However, I've raised potential issues with this feature with @Matt and @eddi. You can find a brief discussion between me and @eddi here under comments. I've also written Matt about this and is currently under discussion. This will be soon resolved and documented, whatever the resolution is.

My stance as of now is that columns in by should not mask that of dt. This started from bug #5191, which is basically this:

DT <- data.table(x=1:5, y=6:10)
DT[, sum(x), by=x%%3L]
#    x V1
# 1: 1  1
# 2: 2  2
# 3: 0  0

Where the actual results should've been:

DT <- data.table(x=1:5, y=6:10)
DT[, sum(x), by=list(grp=x%%3L)]
#    grp V1
# 1:   1  5
# 2:   2  7
# 3:   0  3

The results weren't right because by column x masks the column x in DT corresponding to each group. In this case, this happened because we allow for expressions in by.

However, it extends even if it weren't the scenario. Consider the case:

> DT[, sum(y), by=list(y=x)]
#    y V1
# 1: 1  1
# 2: 2  2
# 3: 3  3
# 4: 4  4
# 5: 5  5

What happened here is that naming the by column y resulted in y from DT being masked.

IMHO, what should be done is that by shouldn't be masking the columns from DT that'll be used in j at all. Instead, if one needs to refer to the grouping variable, they should use the already existing .BY variable (or simply subset the first index with [1L]) as follows:

> DT[, print(.BY$x), by=x]
# [1] 1
# [1] 2
# [1] 3
# [1] 4
# [1] 5
# Empty data.table (0 rows) of 1 col: x

This is just my opinion and there might be other arguments for retaining the current feature and just fixing these potential buggy cases. We'll have to discuss it and get this resolved and depending on what we conclude, document it accordingly.

I'll update this post once that's been done :).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top