You don't have to avoid duplicate keys. As long as the result does not get bigger than max(nrow(x), nrow(i))
, you won't get this error, even if you've duplicates. It is basically a precautionary measure.
When you've duplicate keys, the resulting join can sometimes get much bigger. Since data.table
knows the total number of rows that'll result from this join early enough, it provides this error message and asks you to use the argument allow.cartesian=TRUE
if you're really sure.
Here's an (exaggerated) example that illustrates the idea behind this error message:
require(data.table)
DT1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)),
y=1L, key="x")
DT2 <- data.table(x=rep("b", 3), key="x")
# not run
# DT1[DT2] ## error
dim(DT1[DT2, allow.cartesian=TRUE])
# [1] 30000000 2
The duplicates in DT2
resulted in 3 times the total number of "a" in DT1
(=1e7). Imagine if you performed the join with 1e4 values in DT2
, the results would explode! To avoid this, there's the allow.cartesian
argument which by default is FALSE.
That being said, I think Matt once mentioned that it maybe possible to just provide the error in case of "large" joins (or joins that results in huge number of rows - which might be set arbitrarily I guess). This, when/if implemented, will make the join properly without this error message in case of joins that don't combinatorially explode.